Bedrock + Mantle Benchmark Report

Scope: all runs (full history)

92241 calls 50133 ok 42108 failed/gaps 25 models 15 regions 26 runs in DB profile

Window: 30 Jun 13:38 → 1 Jul 13:50. All figures are averages over every matching call (sample counts shown per row / in cell tooltips).

Recommendations — best model per use-case

This measures speed, not answer quality. Rankings reward whichever model returns tokens fastest — usually the smallest. Use them to pick the fastest model at an acceptable quality tier, paired with your own quality judgement (e.g. Opus/Sonnet for hard work, a small model for high-volume simple turns). Chat ranks by time-to-first-token; the rest by total wall-clock time.

Chat / low-latency

Short interactive turns. Dominated by TTFB — how fast the first token arrives.

  1. 🥇 grok-4.30.00s first token
  2. 🥈 deepseek-r1 geo0.09s first token
  3. 🥉 llama4-maverick geo0.38s first token

Coding

Code generation and structured output. Ranked by total time to a complete answer — the real wait, folding in first-token latency and generation speed.

  1. 🥇 nova-micro1.92s total · 508 tok/s
  2. 🥈 llama4-maverick geo2.19s total · 237 tok/s
  3. 🥉 llama4-scout geo2.38s total · 248 tok/s

Creative / long-form

Essays, stories, summaries. Ranked by total time to finish: a model with blistering throughput but a slow first token is not actually fast here.

  1. 🥇 nova-micro1.88s total · 299 tok/s
  2. 🥈 nemotron-nano-12b2.51s total · 198 tok/s
  3. 🥉 nemotron-nano3-30b2.71s total · 215 tok/s

Reasoning / math

Multi-step reasoning. Total time to a complete answer.

  1. 🥇 nova-micro1.58s total · 575 tok/s
  2. 🥈 llama4-maverick geo1.77s total · 190 tok/s
  3. 🥉 llama4-scout geo1.96s total · 190 tok/s

Speed: time to first token (TTFB)

Time until the first token arrives — the latency you feel in a chat box. Lower is better. Bars green→red = best→worst.

Throughput: output tokens / second

Sustained generation speed once streaming starts. Higher is better. Buffered (non-streaming) models excluded — their throughput isn't client-measurable.

Latency by region — pick a model

Same model, different regional endpoint = pure routing/geo latency. Pick a model to see its TTFB in every region it's available, sorted fastest first (green→red). Lower is better. Full heat-matrix for all models is beneath.

Heat-matrix — mean TTFB (s), every model in every region

fast mid slow  ·  · = not available/run  ·  hover a cell for sample count
Modelap-northeast-1ap-northeast-2ap-northeast-3ap-south-1ap-southeast-1ap-southeast-2eu-central-1eu-north-1eu-west-1eu-west-2eu-west-3us-east-1us-east-2us-west-1us-west-2
llama4-maverick (geo)0.380.370.440.58
nemotron-nano3-30b (direct)0.680.590.620.450.470.520.470.53
deepseek-r1 (geo)0.570.600.59
nemotron-nano-12b (direct)1.030.581.090.390.610.460.850.56
nemotron-nano-9b (direct)0.680.800.940.460.650.540.641.25
llama3.3-70b (geo)1.060.600.92
gpt-oss-20b1.271.071.440.560.720.590.531.150.961.02
nova-micro (direct)0.880.781.23
llama4-scout (geo)0.350.782.340.40
grok-4.30.961.00
nemotron-super3-120b (direct)0.772.050.530.562.120.510.64
nova-2-lite (geo profile)2.410.701.061.130.611.751.141.462.21
sonnet-4.61.461.511.631.481.881.551.221.431.411.171.271.431.301.301.31
haiku-4.51.811.691.641.591.641.751.171.341.161.191.241.371.281.431.38
opus-4.71.661.721.661.632.011.551.541.821.561.571.681.571.581.591.58
nova-2-lite2.882.502.302.131.901.291.821.161.091.101.861.662.161.37
gpt-5.51.931.70
deepseek-v3.2 (direct)1.520.711.331.086.311.462.260.81
gpt-oss-120b1.872.482.831.711.331.141.074.101.572.50
opus-4.82.482.302.672.492.182.311.791.901.961.711.831.972.602.172.08
gpt-5.41.683.781.01
haiku-4.5 (geo profile)2.483.081.981.792.131.241.272.692.821.983.822.01
sonnet-52.752.622.622.892.333.002.682.452.792.502.502.392.392.732.87
fable-5 (geo)3.283.433.503.37
llama3.1-405b (direct)9.29

Trend over time

TTFB per run for the fastest models: solid line = median (p50), shaded band = p5–p95 spread. A widening band means the tail is degrading even if the median looks fine — the same signal fire rate captures. Each point is one benchmark run.

Global vs Geo vs In-region — the cost of the router

These three are not interchangeable. in-region = model served in the region you called (true residency). geo = a regional inference profile (us./eu./…) routing within a geography. global = the cross-region router. Empirically (months of 5-min sampling) global is a latency tax, not a win — it queues rather than rerouting to free capacity, so it runs slower than direct profiles and degrades hardest under load. Compared like-for-like over base models present in more than one scope: haiku-4.5, nova-2-lite. Lower is better.

ScopeCalls TTFB avg (s)TTFB p95 (s) Fire rate
geo68551.8952.2524%
global94571.6172.2044%

Scope × region — mean TTFB (s)

Same comparison broken out by region. Hover a cell for call count and fire count.

Scopeap-northeast-1ap-northeast-2ap-northeast-3ap-south-1ap-southeast-1ap-southeast-2eu-central-1eu-north-1eu-west-1eu-west-2eu-west-3us-east-1us-east-2us-west-1us-west-2
geo2.45·3.08··1.981.241.601.191.271.652.291.562.652.11
global2.352.101.641.941.881.821.231.581.161.141.171.611.471.801.38

Time of day — UTC hour (24 hour-buckets, 50133 calls)

Mean TTFB and fire rate by UTC hour, across all runs. Bedrock capacity shifts with the clock — empirically the EU window (UTC ~15:00–03:00) runs hotter than the US window. This fills in as scheduled runs (00/06/12/18) accumulate; with runs clustered in one window it'll look flat. Lower is better.

Per-model detail

best mid worst — cells heat-graded within each column. OK = calls succeeded / attempted (the sample size).
ModelProviderOK (n) TTFB avg (s)TTFB p95 (s)TTFB σ Fire rate Correct Total avg (s)Throughput (tok/s) Inter-token (ms) Cost/call Buffered
llama4-maverick (geo)bedrock1256/49050.4410.604±1.710%100%1.851946.2$0.0004105
nemotron-nano3-30b (direct)bedrock728/13650.5420.752±0.170%64%2.032155.8$0.0001305
deepseek-r1 (geo)bedrock844/41430.5873.400±1.288%0%2.96n/a2.4$0.0027731
nemotron-nano-12b (direct)bedrock728/13650.6970.837±1.741%87%2.141806.5$0.0004284
nemotron-nano-9b (direct)bedrock728/13650.7450.892±2.551%16%4.251459.5$0.000364
llama3.3-70b (geo)bedrock981/49050.8602.368±3.413%97%2.691399.1$0.000683
gpt-oss-20bmantle2494/37880.9311.406±1.310%61%2.294083.7$0.0002866
nova-micro (direct)bedrock968/49050.9631.212±4.952%86%1.674213.0$0.000193
llama4-scout (geo)bedrock1278/49050.9710.617±6.081%100%2.301975.9$0.0003104
grok-4.3mantle197/17230.9853.905±1.667%2.88n/a2.3$0.0015162
nemotron-super3-120b (direct)bedrock624/13401.0361.708±3.314%69%3.031557.1$0.0004235
nova-2-lite (geo profile)bedrock2936/49051.3862.861±5.944%85%2.732534.7$0.0011518
sonnet-4.6bedrock4890/49051.4242.573±1.584%99%6.008615.7$0.0062406
haiku-4.5bedrock4901/49051.4452.070±0.982%100%3.241578.2$0.0019420
opus-4.7bedrock4889/49051.6472.514±1.593%92%5.2717910.3$0.0123859
nova-2-litebedrock4556/49051.8022.956±6.745%82%3.092624.6$0.0011796
gpt-5.5mantle_responses421/33451.8154.121±2.3718%5.5617611.4145
deepseek-v3.2 (direct)bedrock2387/45441.9368.316±5.1010%96%7.776321.3$0.0010649
gpt-oss-120bmantle2646/39702.0665.816±2.4116%61%3.593443.8$0.0003651
opus-4.8bedrock4884/49052.1644.399±2.7115%93%5.6214010.5$0.0122923
gpt-5.4mantle_responses728/35082.18010.458±4.0913%100%9.735878.568
haiku-4.5 (geo profile)bedrock3919/49052.2761.844±7.794%100%4.031597.7$0.0019340
sonnet-5bedrock1559/15602.6346.019±2.4629%73%5.8218910.7$0.0089375
fable-5 (geo)bedrock264/13653.3955.534±1.1671%84%7.6216555.4$0.0154101
llama3.1-405b (direct)bedrock327/49059.29233.249±11.9279%97%9.37n/a2.1$0.0017288

By prompt category

Category — what it testsCalls TTFB avg (s)Total avg (s) Throughput (tok/s)
Trivial greeting
A bare 'Hi' — pure latency probe, almost no generation.
73351.6201.84229
Factual recall
A one-word factual question — short, knowledge lookup.
73351.5911.68253
Math / reasoning
Multi-step word problems and arithmetic with working shown.
146621.7494.59201
Code generation & debugging
Write a function with tests; explain and fix buggy code.
146531.7826.46181
Creative writing
A short noir scene and a haiku — open-ended generation.
146601.7033.79127
Summarisation
Condense a paragraph into a fixed number of bullet points.
73281.5802.30213
Structured output
Return strict JSON only — tests format adherence.
73271.4872.35256
Instruction following
Exact-format constraints (casing, separators, no extra text).
73251.3131.53220
Long-form generation
A thorough multi-section technical explanation — sustained output.
73111.62313.22141
Large-context load
A large (~12k token) input the model must read before answering. Exposes queueing/throughput differences that small prompts hide — the regime where global vs geo vs in-region latency actually diverges (per empirical Bedrock data).
43051.9822.22260

The actual prompts used

"Fastest model" = the model with the lowest mean TTFB for that specific prompt, across all regions/runs (green cell shows the model + its mean TTFB).

CategoryIDMax tokens Fastest model (TTFB)Prompt
Trivial greetinggreeting64llama4-maverick geo 0.38sHi
Factual recallfactual_short32gpt-oss-20b 0.97sWhat is the capital of Australia? Answer in one word.
Math / reasoningreasoning_math600llama4-maverick geo 0.38sA train leaves City A at 9:00am travelling at 60 mph. A second train leaves City B, 240 miles away, at 10:00am travelling at 80 mph toward City A. At what time do they meet? Show your working.
Math / reasoningmath_arithmetic300llama4-maverick geo 0.38sCompute 47 * 89 + 1337 - 256 / 4. Show each step.
Code generation & debuggingcode_generation800llama4-maverick geo 0.36sWrite a Python function `merge_intervals(intervals)` that merges overlapping intervals given as a list of [start, end] pairs. Include a docstring and a couple of inline tests.
Code generation & debuggingcode_debug500llama4-maverick geo 0.37sThis Python is buggy: def avg(xs): return sum(xs) / len(xs) Explain the failure modes and give a corrected version.
Creative writingcreative_story500llama4-maverick geo 0.37sWrite a 150-word noir-style opening scene about a detective who only investigates crimes that happen in libraries.
Creative writingcreative_poem128llama4-maverick geo 0.37sWrite a haiku about distributed systems failing gracefully.
Summarisationsummarisation300llama4-maverick geo 0.39sSummarise the following in exactly three bullet points: Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models from leading AI companies via a single API, along with capabilities to build generative AI applicatio…
Structured outputstructured_json300nemotron-nano-12b 0.53sReturn ONLY valid JSON (no prose) describing three programming languages, each with keys: name, year_created, paradigm.
Instruction followinginstruction_following128llama4-scout geo 0.37sList the planets of the solar system in order from the sun. Output them comma-separated on a single line, all lowercase, no other text.
Long-form generationlong_generation1200llama4-maverick geo 0.41sExplain how TCP congestion control works, covering slow start, congestion avoidance, fast retransmit and fast recovery. Aim for a thorough, well-structured explanation.
Large-context loadlarge_context80llama4-maverick geo 0.64sSection 1. In the distributed ledger subsystem, node 1 maintains a replicated log with quorum size 4 and a heartbeat interval of 57 milliseconds; its committed offset is 1007 and its term number is 2. When a partition heals, node 1 reconciles by comparing v…

By region — routing / geo latency, like-for-like

Fair comparison: only the 5 models that ran successfully in every region are pooled here, so a region isn't flattered or penalised by which models happened to be available. Lower TTFB is better.

RegionGeoCalls TTFB avg (s)Throughput (tok/s)
eu-west-2EU14091.490145
eu-central-1EU14101.521144
eu-west-3EU14101.578146
eu-west-1EU14081.618142
us-east-1US14041.641141
us-west-2US14121.681141
eu-north-1EU14071.685139
us-west-1US14121.703146
us-east-2US14111.745145
ap-northeast-2APJ14051.867145
ap-southeast-2APJ14111.878143
ap-south-1APJ14041.879139
ap-northeast-1APJ14101.922144
ap-northeast-3APJ14051.953140
ap-southeast-1APJ14051.956142

Glossary — what the metrics mean

TTFB — time to first token (s)
Wall-clock from request to first streamed token: the "is it responding yet" latency. Good <1s, poor >3s. Covers network round-trip, queueing, inference-profile routing and prompt prefill.
TTFB p95 (s)
95th-percentile TTFB — the slow tail (~1 call in 20). A p95 far above the average means inconsistent latency (cold starts, contention).
TTFB σ (std dev)
Spread of TTFB across this model's calls. Low = predictable; high = jittery.
Fire rate
% of calls whose TTFB breached a per-category threshold (trivial 2s … large-context 6s). Borrowed from months of empirical Bedrock sampling: a single number capturing both latency and reliability — a model can have a decent average but a nasty tail, and fire rate catches that where the mean hides it. Lower is better.
Correct
Auto-graded answer correctness on the deterministic prompts — the factual question (Canberra), the exact-format planet list, strict-JSON output, the value planted in the large-context document, and the two maths problems. Graded by exact string/JSON checks, no LLM judge. Creative/summarisation prompts have no deterministic answer and aren't counted. This turns "fastest" into "fastest while actually right". Higher is better.
Cost/call
Mean $ per call: exact input/output token counts × per-1M-token rates. Rates come from the live AWS Price List API (fetched per region and routing scope — global/geo/ in-region rows are priced separately, and global endpoints are often cheaper) with published Anthropic pricing as the fallback for models not yet in the Price List. Refresh rates with python pricing.py --refresh. Lower is better.
Total avg (s)
Average full wall-clock per call. Depends on output length, so compare within a category. This is the metric the use-case rankings use (except chat).
Throughput — output tokens / second
Sustained generation speed during streaming. Good >150, sluggish <60. What makes a long answer finish fast.
Inter-token latency (ms)
Mean gap between output tokens — inverse of throughput. Low & steady = smooth.
Buffered (count)
Calls where the server didn't truly stream — it withheld then dumped the reply in ≤2 chunks (or at an impossible >1000 tok/s). For those, throughput and inter-token latency are not measurable and are excluded from averages. A high count means "streaming metrics N/A for this model" — its TTFB and total time are still valid. Neither good nor bad for quality.
OK (n)
Successful calls / attempted — the sample size behind every average in that row. With --repeat N each combo runs N times and all are counted.

How the tests were run — methodology

Method

Every enabled model is attempted in every enabled region (a model×region cross-product) against the full prompt suite. Each prompt has a max_tokens cap so short tasks stay cheap and long tasks aren't truncated. Averages are over all matching calls; with --repeat N each combo is measured N times and every result is recorded (not best-of), so the sample counts shown are real.

Caveat: a single run is a snapshot — latencies vary with time of day, load and cold starts. Use --repeat and compare runs over time for stable figures.