Bedrock + Mantle Benchmark Report

Scope: all runs (full history)

24475 calls 11078 ok 13397 failed/gaps 25 models 15 regions 11 runs in DB profile

Window: — → —. All figures are averages over every matching call (sample counts shown per row / in cell tooltips).

Recommendations — best model per use-case

This measures speed, not answer quality. Rankings reward whichever model returns tokens fastest — usually the smallest. Use them to pick the fastest model at an acceptable quality tier, paired with your own quality judgement (e.g. Opus/Sonnet for hard work, a small model for high-volume simple turns). Chat ranks by time-to-first-token; the rest by total wall-clock time.

Chat / low-latency

Short interactive turns. Dominated by TTFB — how fast the first token arrives.

  1. 🥇 deepseek-r1 geo0.13s first token
  2. 🥈 llama4-scout geo0.26s first token
  3. 🥉 llama4-maverick geo0.26s first token

Coding

Code generation and structured output. Ranked by total time to a complete answer — the real wait, folding in first-token latency and generation speed.

  1. 🥇 nova-micro1.05s total · 485 tok/s
  2. 🥈 nova-2-lite geo1.76s total · 303 tok/s
  3. 🥉 llama4-scout geo1.78s total · 222 tok/s

Creative / long-form

Essays, stories, summaries. Ranked by total time to finish: a model with blistering throughput but a slow first token is not actually fast here.

  1. 🥇 nova-micro1.51s total · 298 tok/s
  2. 🥈 nemotron-nano-12b1.93s total · 194 tok/s
  3. 🥉 nova-2-lite geo2.76s total · 203 tok/s

Reasoning / math

Multi-step reasoning. Total time to a complete answer.

  1. 🥇 nova-micro1.09s total · 570 tok/s
  2. 🥈 llama4-scout geo1.39s total · 175 tok/s
  3. 🥉 llama4-maverick geo1.86s total · 196 tok/s

Speed: time to first token (TTFB)

Time until the first token arrives — the latency you feel in a chat box. Lower is better. Bars green→red = best→worst.

Throughput: output tokens / second

Sustained generation speed once streaming starts. Higher is better. Buffered (non-streaming) models excluded — their throughput isn't client-measurable.

Latency by region — pick a model

Same model, different regional endpoint = pure routing/geo latency. Pick a model to see its TTFB in every region it's available, sorted fastest first (green→red). Lower is better. Full heat-matrix for all models is beneath.

Heat-matrix — mean TTFB (s), every model in every region

fast mid slow  ·  · = not available/run  ·  hover a cell for sample count
Modelap-northeast-1ap-northeast-2ap-northeast-3ap-south-1ap-southeast-1ap-southeast-2eu-central-1eu-north-1eu-west-1eu-west-2eu-west-3us-east-1us-east-2us-west-1us-west-2
llama4-scout (geo)0.270.260.270.24
nova-micro (direct)0.330.210.25
llama4-maverick (geo)0.280.320.300.26
nemotron-nano-12b (direct)0.390.350.390.360.360.380.370.40
nova-2-lite (geo profile)0.330.390.410.390.420.410.430.390.42
nemotron-nano-9b (direct)0.460.410.490.480.640.440.400.55
llama3.3-70b (geo)0.510.500.49
nova-2-lite0.500.530.630.600.540.530.600.490.540.540.430.410.400.37
deepseek-r1 (geo)0.420.770.53
haiku-4.5 (geo profile)0.770.730.660.850.870.850.880.870.900.860.820.85
nemotron-nano3-30b (direct)0.410.400.350.420.485.150.350.39
nemotron-super3-120b (direct)2.590.460.610.481.630.421.06
haiku-4.50.990.981.051.110.991.032.121.201.111.191.251.121.081.051.23
sonnet-4.61.211.271.201.271.231.251.201.311.391.271.341.611.241.181.33
opus-4.71.431.401.361.491.291.261.481.581.611.581.651.441.401.511.39
opus-4.82.202.242.422.022.592.372.022.112.022.052.192.362.022.282.10
sonnet-52.502.512.692.882.612.762.882.442.362.632.952.372.452.682.76
fable-5 (geo)3.413.493.563.54
deepseek-v3.2 (direct)1.100.572.540.7724.560.690.890.74
llama3.1-405b (direct)7.44
gpt-5.4
gpt-5.5
gpt-oss-120b
gpt-oss-20b
grok-4.3

Trend over time

TTFB per run for the fastest models: solid line = median (p50), shaded band = p5–p95 spread. A widening band means the tail is degrading even if the median looks fine — the same signal fire rate captures. Each point is one benchmark run.

Global vs Geo vs In-region — the cost of the router

These three are not interchangeable. in-region = model served in the region you called (true residency). geo = a regional inference profile (us./eu./…) routing within a geography. global = the cross-region router. Empirically (months of 5-min sampling) global is a latency tax, not a win — it queues rather than rerouting to free capacity, so it runs slower than direct profiles and degrades hardest under load. Compared like-for-like over base models present in more than one scope: haiku-4.5, nova-2-lite. Lower is better.

ScopeCalls TTFB avg (s)TTFB p95 (s) Fire rate
geo13730.6431.1050%
global18930.8481.4400%

Scope × region — mean TTFB (s)

Same comparison broken out by region. Hover a cell for call count and fire count.

Scopeap-northeast-1ap-northeast-2ap-northeast-3ap-south-1ap-southeast-1ap-southeast-2eu-central-1eu-north-1eu-west-1eu-west-2eu-west-3us-east-1us-east-2us-west-1us-west-2
geo0.55·0.73··0.660.620.640.620.880.650.650.640.610.64
global0.750.761.050.870.800.781.320.900.800.870.890.780.750.720.80

Time of day — UTC hour (0 hour-buckets, 0 calls)

Mean TTFB and fire rate by UTC hour, across all runs. Bedrock capacity shifts with the clock — empirically the EU window (UTC ~15:00–03:00) runs hotter than the US window. This fills in as scheduled runs (00/06/12/18) accumulate; with runs clustered in one window it'll look flat. Lower is better.

Per-model detail

best mid worst — cells heat-graded within each column. OK = calls succeeded / attempted (the sample size).
ModelProviderOK (n) TTFB avg (s)TTFB p95 (s)TTFB σ Fire rate Correct Total avg (s)Throughput (tok/s) Inter-token (ms) Cost/call Buffered
llama4-scout (geo)bedrock262/9790.2590.460±0.100%100%1.691746.621
nova-micro (direct)bedrock197/9790.2630.738±0.190%81%0.974322.817
llama4-maverick (geo)bedrock262/9790.2880.529±0.160%100%1.651976.021
nemotron-nano-12b (direct)bedrock524/9790.3750.633±0.130%85%1.621845.7204
nova-2-lite (geo profile)bedrock589/9790.3990.636±0.230%86%1.662694.497
nemotron-nano-9b (direct)bedrock524/9790.4830.732±0.371%14%3.221507.152
llama3.3-70b (geo)bedrock197/9790.5001.626±0.551%98%2.511299.816
nova-2-litebedrock914/9790.5080.800±0.170%83%1.722774.2151
deepseek-r1 (geo)bedrock197/9790.5733.707±1.438%0%3.17n/a2.5172
haiku-4.5 (geo profile)bedrock784/9790.8261.194±0.240%100%2.531577.9$0.002366
nemotron-nano3-30b (direct)bedrock515/9790.9350.713±6.391%64%2.522086.1216
nemotron-super3-120b (direct)bedrock411/9790.9891.942±5.043%68%3.371438.9162
haiku-4.5bedrock979/9791.1661.704±1.941%100%2.871627.5$0.002381
sonnet-4.6bedrock979/9791.2892.318±1.043%100%5.749415.2$0.007278
opus-4.7bedrock979/9791.4592.331±0.442%91%4.9618210.6$0.0146156
opus-4.8bedrock979/9792.1995.406±1.4619%96%5.6913711.4$0.0145168
sonnet-5bedrock978/9792.6306.573±1.8131%73%5.8819210.7$0.0089209
fable-5 (geo)bedrock222/9793.5006.269±1.3276%79%7.9416342.0$0.015379
deepseek-v3.2 (direct)bedrock521/9793.97523.566±10.2117%94%10.435523.2150
llama3.1-405b (direct)bedrock65/9797.43828.007±7.2277%97%7.45n/a0.065

By prompt category

Category — what it testsCalls TTFB avg (s)Total avg (s) Throughput (tok/s)
Trivial greeting
A bare 'Hi' — pure latency probe, almost no generation.
19251.0181.19232
Factual recall
A one-word factual question — short, knowledge lookup.
19251.0651.12284
Math / reasoning
Multi-step word problems and arithmetic with working shown.
37501.2934.37185
Code generation & debugging
Write a function with tests; explain and fix buggy code.
37501.2386.12160
Creative writing
A short noir scene and a haiku — open-ended generation.
37501.3273.52108
Summarisation
Condense a paragraph into a fixed number of bullet points.
18751.4052.21185
Structured output
Return strict JSON only — tests format adherence.
18751.3862.31224
Instruction following
Exact-format constraints (casing, separators, no extra text).
18751.3131.49213
Long-form generation
A thorough multi-section technical explanation — sustained output.
18751.67313.54130
Large-context load
A large (~12k token) input the model must read before answering. Exposes queueing/throughput differences that small prompts hide — the regime where global vs geo vs in-region latency actually diverges (per empirical Bedrock data).
18752.0402.25271

The actual prompts used

"Fastest model" = the model with the lowest mean TTFB for that specific prompt, across all regions/runs (green cell shows the model + its mean TTFB).

CategoryIDMax tokens Fastest model (TTFB)Prompt
Trivial greetinggreeting64llama4-maverick geo 0.24sHi
Factual recallfactual_short32sonnet-5 2.10sWhat is the capital of Australia? Answer in one word.
Math / reasoningreasoning_math600nova-micro 0.24sA train leaves City A at 9:00am travelling at 60 mph. A second train leaves City B, 240 miles away, at 10:00am travelling at 80 mph toward City A. At what time do they meet? Show your working.
Math / reasoningmath_arithmetic300llama4-scout geo 0.24sCompute 47 * 89 + 1337 - 256 / 4. Show each step.
Code generation & debuggingcode_generation800llama4-scout geo 0.23sWrite a Python function `merge_intervals(intervals)` that merges overlapping intervals given as a list of [start, end] pairs. Include a docstring and a couple of inline tests.
Code generation & debuggingcode_debug500llama4-scout geo 0.25sThis Python is buggy: def avg(xs): return sum(xs) / len(xs) Explain the failure modes and give a corrected version.
Creative writingcreative_story500llama4-scout geo 0.22sWrite a 150-word noir-style opening scene about a detective who only investigates crimes that happen in libraries.
Creative writingcreative_poem128nova-micro 0.19sWrite a haiku about distributed systems failing gracefully.
Summarisationsummarisation300nova-micro 0.20sSummarise the following in exactly three bullet points: Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models from leading AI companies via a single API, along with capabilities to build generative AI applicatio…
Structured outputstructured_json300nova-micro 0.19sReturn ONLY valid JSON (no prose) describing three programming languages, each with keys: name, year_created, paradigm.
Instruction followinginstruction_following128nova-micro 0.27sList the planets of the solar system in order from the sun. Output them comma-separated on a single line, all lowercase, no other text.
Long-form generationlong_generation1200nova-micro 0.19sExplain how TCP congestion control works, covering slow start, congestion avoidance, fast retransmit and fast recovery. Aim for a thorough, well-structured explanation.
Large-context loadlarge_context80llama4-scout geo 0.46sSection 1. In the distributed ledger subsystem, node 1 maintains a replicated log with quorum size 4 and a heartbeat interval of 57 milliseconds; its committed offset is 1007 and its term number is 2. When a partition heals, node 1 reconciles by comparing v…

By region — routing / geo latency, like-for-like

Fair comparison: only the 5 models that ran successfully in every region are pooled here, so a region isn't flattered or penalised by which models happened to be available. Lower TTFB is better.

RegionGeoCalls TTFB avg (s)Throughput (tok/s)
us-east-2US3251.639166
ap-northeast-1APJ3251.666144
ap-northeast-2APJ3251.679151
eu-west-1EU3351.698157
eu-north-1EU3251.729154
ap-southeast-2APJ3251.734149
us-west-1US3251.740143
ap-southeast-1APJ3251.743151
ap-northeast-3APJ3251.745152
eu-west-2EU3251.745163
ap-south-1APJ3241.750148
us-west-2US3251.765148
us-east-1US3351.779151
eu-west-3EU3251.876159
eu-central-1EU3251.939141

Glossary — what the metrics mean

TTFB — time to first token (s)
Wall-clock from request to first streamed token: the "is it responding yet" latency. Good <1s, poor >3s. Covers network round-trip, queueing, inference-profile routing and prompt prefill.
TTFB p95 (s)
95th-percentile TTFB — the slow tail (~1 call in 20). A p95 far above the average means inconsistent latency (cold starts, contention).
TTFB σ (std dev)
Spread of TTFB across this model's calls. Low = predictable; high = jittery.
Fire rate
% of calls whose TTFB breached a per-category threshold (trivial 2s … large-context 6s). Borrowed from months of empirical Bedrock sampling: a single number capturing both latency and reliability — a model can have a decent average but a nasty tail, and fire rate catches that where the mean hides it. Lower is better.
Correct
Auto-graded answer correctness on the deterministic prompts — the factual question (Canberra), the exact-format planet list, strict-JSON output, the value planted in the large-context document, and the two maths problems. Graded by exact string/JSON checks, no LLM judge. Creative/summarisation prompts have no deterministic answer and aren't counted. This turns "fastest" into "fastest while actually right". Higher is better.
Cost/call
Mean $ per call: exact input/output token counts × per-1M-token rates. Rates come from the live AWS Price List API (fetched per region and routing scope — global/geo/ in-region rows are priced separately, and global endpoints are often cheaper) with published Anthropic pricing as the fallback for models not yet in the Price List. Refresh rates with python pricing.py --refresh. Lower is better.
Total avg (s)
Average full wall-clock per call. Depends on output length, so compare within a category. This is the metric the use-case rankings use (except chat).
Throughput — output tokens / second
Sustained generation speed during streaming. Good >150, sluggish <60. What makes a long answer finish fast.
Inter-token latency (ms)
Mean gap between output tokens — inverse of throughput. Low & steady = smooth.
Buffered (count)
Calls where the server didn't truly stream — it withheld then dumped the reply in ≤2 chunks (or at an impossible >1000 tok/s). For those, throughput and inter-token latency are not measurable and are excluded from averages. A high count means "streaming metrics N/A for this model" — its TTFB and total time are still valid. Neither good nor bad for quality.
OK (n)
Successful calls / attempted — the sample size behind every average in that row. With --repeat N each combo runs N times and all are counted.

How the tests were run — methodology

Method

Every enabled model is attempted in every enabled region (a model×region cross-product) against the full prompt suite. Each prompt has a max_tokens cap so short tasks stay cheap and long tasks aren't truncated. Averages are over all matching calls; with --repeat N each combo is measured N times and every result is recorded (not best-of), so the sample counts shown are real.

Caveat: a single run is a snapshot — latencies vary with time of day, load and cold starts. Use --repeat and compare runs over time for stable figures.