Bedrock + Mantle Benchmark Report

Scope: all runs (full history)

92241 calls 50133 ok 42108 failed/gaps 25 models 15 regions 26 runs in DB profile —

Window: 30 Jun 13:38 → 1 Jul 13:50. All figures are averages over every matching call (sample counts shown per row / in cell tooltips).

Recommendations — best model per use-case

This measures speed, not answer quality. Rankings reward whichever model returns tokens fastest — usually the smallest. Use them to pick the fastest model at an acceptable quality tier, paired with your own quality judgement (e.g. Opus/Sonnet for hard work, a small model for high-volume simple turns). Chat ranks by time-to-first-token; the rest by total wall-clock time.

Chat / low-latency

Short interactive turns. Dominated by TTFB — how fast the first token arrives.

🥇 grok-4.30.00s first token
🥈 deepseek-r1 geo0.09s first token
🥉 llama4-maverick geo0.38s first token

Coding

Code generation and structured output. Ranked by total time to a complete answer — the real wait, folding in first-token latency and generation speed.

🥇 nova-micro1.92s total · 508 tok/s
🥈 llama4-maverick geo2.19s total · 237 tok/s
🥉 llama4-scout geo2.38s total · 248 tok/s

Creative / long-form

Essays, stories, summaries. Ranked by total time to finish: a model with blistering throughput but a slow first token is not actually fast here.

🥇 nova-micro1.88s total · 299 tok/s
🥈 nemotron-nano-12b2.51s total · 198 tok/s
🥉 nemotron-nano3-30b2.71s total · 215 tok/s

Reasoning / math

Multi-step reasoning. Total time to a complete answer.

🥇 nova-micro1.58s total · 575 tok/s
🥈 llama4-maverick geo1.77s total · 190 tok/s
🥉 llama4-scout geo1.96s total · 190 tok/s

Speed: time to first token (TTFB)

Time until the first token arrives — the latency you feel in a chat box. Lower is better. Bars green→red = best→worst.

Throughput: output tokens / second

Sustained generation speed once streaming starts. Higher is better. Buffered (non-streaming) models excluded — their throughput isn't client-measurable.

Latency by region — pick a model

Same model, different regional endpoint = pure routing/geo latency. Pick a model to see its TTFB in every region it's available, sorted fastest first (green→red). Lower is better. Full heat-matrix for all models is beneath.

Model:

Heat-matrix — mean TTFB (s), every model in every region

fast mid slow · · = not available/run · hover a cell for sample count

Model	ap-northeast-1	ap-northeast-2	ap-northeast-3	ap-south-1	ap-southeast-1	ap-southeast-2	eu-central-1	eu-north-1	eu-west-1	eu-west-2	eu-west-3	us-east-1	us-east-2	us-west-1	us-west-2
llama4-maverick (geo)	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	0.38	0.37	0.44	0.58
nemotron-nano3-30b (direct)	0.68	✗	✗	0.59	✗	0.62	✗	✗	0.45	0.47	✗	0.52	0.47	✗	0.53
deepseek-r1 (geo)	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	0.57	0.60	✗	0.59
nemotron-nano-12b (direct)	1.03	✗	✗	0.58	✗	1.09	✗	✗	0.39	0.61	✗	0.46	0.85	✗	0.56
nemotron-nano-9b (direct)	0.68	✗	✗	0.80	✗	0.94	✗	✗	0.46	0.65	✗	0.54	0.64	✗	1.25
llama3.3-70b (geo)	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	1.06	0.60	✗	0.92
gpt-oss-20b	1.27	✗	✗	1.07	✗	1.44	0.56	0.72	0.59	0.53	✗	1.15	0.96	✗	1.02
nova-micro (direct)	✗	✗	✗	✗	✗	0.88	✗	✗	✗	0.78	✗	1.23	✗	✗	✗
llama4-scout (geo)	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	0.35	0.78	2.34	0.40
grok-4.3	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	0.96	✗	✗	1.00
nemotron-super3-120b (direct)	✗	✗	✗	✗	✗	0.77	2.05	✗	0.53	0.56	✗	2.12	0.51	✗	0.64
nova-2-lite (geo profile)	2.41	✗	✗	✗	✗	✗	0.70	1.06	1.13	✗	0.61	1.75	1.14	1.46	2.21
sonnet-4.6	1.46	1.51	1.63	1.48	1.88	1.55	1.22	1.43	1.41	1.17	1.27	1.43	1.30	1.30	1.31
haiku-4.5	1.81	1.69	1.64	1.59	1.64	1.75	1.17	1.34	1.16	1.19	1.24	1.37	1.28	1.43	1.38
opus-4.7	1.66	1.72	1.66	1.63	2.01	1.55	1.54	1.82	1.56	1.57	1.68	1.57	1.58	1.59	1.58
nova-2-lite	2.88	2.50	✗	2.30	2.13	1.90	1.29	1.82	1.16	1.09	1.10	1.86	1.66	2.16	1.37
gpt-5.5	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	1.93	1.70	✗	✗
deepseek-v3.2 (direct)	1.52	✗	✗	0.71	✗	1.33	✗	1.08	✗	6.31	✗	1.46	2.26	✗	0.81
gpt-oss-120b	1.87	✗	✗	2.48	✗	2.83	1.71	1.33	1.14	1.07	✗	4.10	1.57	✗	2.50
opus-4.8	2.48	2.30	2.67	2.49	2.18	2.31	1.79	1.90	1.96	1.71	1.83	1.97	2.60	2.17	2.08
gpt-5.4	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	1.68	3.78	✗	1.01
haiku-4.5 (geo profile)	2.48	✗	3.08	✗	✗	1.98	1.79	2.13	1.24	1.27	2.69	2.82	1.98	3.82	2.01
sonnet-5	2.75	2.62	2.62	2.89	2.33	3.00	2.68	2.45	2.79	2.50	2.50	2.39	2.39	2.73	2.87
fable-5 (geo)	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	3.28	3.43	3.50	3.37
llama3.1-405b (direct)	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	9.29

Trend over time

TTFB per run for the fastest models: solid line = median (p50), shaded band = p5–p95 spread. A widening band means the tail is degrading even if the median looks fine — the same signal fire rate captures. Each point is one benchmark run.

Global vs Geo vs In-region — the cost of the router

These three are not interchangeable. in-region = model served in the region you called (true residency). geo = a regional inference profile (us./eu./…) routing within a geography. global = the cross-region router. Empirically (months of 5-min sampling) global is a latency tax, not a win — it queues rather than rerouting to free capacity, so it runs slower than direct profiles and degrades hardest under load. Compared like-for-like over base models present in more than one scope: haiku-4.5, nova-2-lite. Lower is better.

Scope	Calls	TTFB avg (s)	TTFB p95 (s)	Fire rate
geo	6855	1.895	2.252	4%
global	9457	1.617	2.204	4%

Scope × region — mean TTFB (s)

Same comparison broken out by region. Hover a cell for call count and fire count.

Scope	ap-northeast-1	ap-northeast-2	ap-northeast-3	ap-south-1	ap-southeast-1	ap-southeast-2	eu-central-1	eu-north-1	eu-west-1	eu-west-2	eu-west-3	us-east-1	us-east-2	us-west-1	us-west-2
geo	2.45	·	3.08	·	·	1.98	1.24	1.60	1.19	1.27	1.65	2.29	1.56	2.65	2.11
global	2.35	2.10	1.64	1.94	1.88	1.82	1.23	1.58	1.16	1.14	1.17	1.61	1.47	1.80	1.38

Time of day — UTC hour (24 hour-buckets, 50133 calls)

Mean TTFB and fire rate by UTC hour, across all runs. Bedrock capacity shifts with the clock — empirically the EU window (UTC ~15:00–03:00) runs hotter than the US window. This fills in as scheduled runs (00/06/12/18) accumulate; with runs clustered in one window it'll look flat. Lower is better.

Per-model detail

best mid worst — cells heat-graded within each column. OK = calls succeeded / attempted (the sample size).

Model	Provider	OK (n)	TTFB avg (s)	TTFB p95 (s)	TTFB σ	Fire rate	Correct	Total avg (s)	Throughput (tok/s)	Inter-token (ms)	Cost/call	Buffered
llama4-maverick (geo)	bedrock	1256/4905	0.441	0.604	±1.71	0%	100%	1.85	194	6.2	$0.0004	105
nemotron-nano3-30b (direct)	bedrock	728/1365	0.542	0.752	±0.17	0%	64%	2.03	215	5.8	$0.0001	305
deepseek-r1 (geo)	bedrock	844/4143	0.587	3.400	±1.28	8%	0%	2.96	n/a	2.4	$0.0027	731
nemotron-nano-12b (direct)	bedrock	728/1365	0.697	0.837	±1.74	1%	87%	2.14	180	6.5	$0.0004	284
nemotron-nano-9b (direct)	bedrock	728/1365	0.745	0.892	±2.55	1%	16%	4.25	145	9.5	$0.0003	64
llama3.3-70b (geo)	bedrock	981/4905	0.860	2.368	±3.41	3%	97%	2.69	139	9.1	$0.0006	83
gpt-oss-20b	mantle	2494/3788	0.931	1.406	±1.31	0%	61%	2.29	408	3.7	$0.0002	866
nova-micro (direct)	bedrock	968/4905	0.963	1.212	±4.95	2%	86%	1.67	421	3.0	$0.0001	93
llama4-scout (geo)	bedrock	1278/4905	0.971	0.617	±6.08	1%	100%	2.30	197	5.9	$0.0003	104
grok-4.3	mantle	197/1723	0.985	3.905	±1.66	7%	—	2.88	n/a	2.3	$0.0015	162
nemotron-super3-120b (direct)	bedrock	624/1340	1.036	1.708	±3.31	4%	69%	3.03	155	7.1	$0.0004	235
nova-2-lite (geo profile)	bedrock	2936/4905	1.386	2.861	±5.94	4%	85%	2.73	253	4.7	$0.0011	518
sonnet-4.6	bedrock	4890/4905	1.424	2.573	±1.58	4%	99%	6.00	86	15.7	$0.0062	406
haiku-4.5	bedrock	4901/4905	1.445	2.070	±0.98	2%	100%	3.24	157	8.2	$0.0019	420
opus-4.7	bedrock	4889/4905	1.647	2.514	±1.59	3%	92%	5.27	179	10.3	$0.0123	859
nova-2-lite	bedrock	4556/4905	1.802	2.956	±6.74	5%	82%	3.09	262	4.6	$0.0011	796
gpt-5.5	mantle_responses	421/3345	1.815	4.121	±2.37	18%	—	5.56	176	11.4	—	145
deepseek-v3.2 (direct)	bedrock	2387/4544	1.936	8.316	±5.10	10%	96%	7.77	63	21.3	$0.0010	649
gpt-oss-120b	mantle	2646/3970	2.066	5.816	±2.41	16%	61%	3.59	344	3.8	$0.0003	651
opus-4.8	bedrock	4884/4905	2.164	4.399	±2.71	15%	93%	5.62	140	10.5	$0.0122	923
gpt-5.4	mantle_responses	728/3508	2.180	10.458	±4.09	13%	100%	9.73	58	78.5	—	68
haiku-4.5 (geo profile)	bedrock	3919/4905	2.276	1.844	±7.79	4%	100%	4.03	159	7.7	$0.0019	340
sonnet-5	bedrock	1559/1560	2.634	6.019	±2.46	29%	73%	5.82	189	10.7	$0.0089	375
fable-5 (geo)	bedrock	264/1365	3.395	5.534	±1.16	71%	84%	7.62	165	55.4	$0.0154	101
llama3.1-405b (direct)	bedrock	327/4905	9.292	33.249	±11.92	79%	97%	9.37	n/a	2.1	$0.0017	288

By prompt category

Category — what it tests	Calls	TTFB avg (s)	Total avg (s)	Throughput (tok/s)
Trivial greeting A bare 'Hi' — pure latency probe, almost no generation.	7335	1.620	1.84	229
Factual recall A one-word factual question — short, knowledge lookup.	7335	1.591	1.68	253
Math / reasoning Multi-step word problems and arithmetic with working shown.	14662	1.749	4.59	201
Code generation & debugging Write a function with tests; explain and fix buggy code.	14653	1.782	6.46	181
Creative writing A short noir scene and a haiku — open-ended generation.	14660	1.703	3.79	127
Summarisation Condense a paragraph into a fixed number of bullet points.	7328	1.580	2.30	213
Structured output Return strict JSON only — tests format adherence.	7327	1.487	2.35	256
Instruction following Exact-format constraints (casing, separators, no extra text).	7325	1.313	1.53	220
Long-form generation A thorough multi-section technical explanation — sustained output.	7311	1.623	13.22	141
Large-context load A large (~12k token) input the model must read before answering. Exposes queueing/throughput differences that small prompts hide — the regime where global vs geo vs in-region latency actually diverges (per empirical Bedrock data).	4305	1.982	2.22	260

The actual prompts used

"Fastest model" = the model with the lowest mean TTFB for that specific prompt, across all regions/runs (green cell shows the model + its mean TTFB).

Category	ID	Max tokens	Fastest model (TTFB)	Prompt
Trivial greeting	greeting	64	llama4-maverick geo 0.38s	Hi
Factual recall	factual_short	32	gpt-oss-20b 0.97s	What is the capital of Australia? Answer in one word.
Math / reasoning	reasoning_math	600	llama4-maverick geo 0.38s	A train leaves City A at 9:00am travelling at 60 mph. A second train leaves City B, 240 miles away, at 10:00am travelling at 80 mph toward City A. At what time do they meet? Show your working.
Math / reasoning	math_arithmetic	300	llama4-maverick geo 0.38s	Compute 47 * 89 + 1337 - 256 / 4. Show each step.
Code generation & debugging	code_generation	800	llama4-maverick geo 0.36s	Write a Python function `merge_intervals(intervals)` that merges overlapping intervals given as a list of [start, end] pairs. Include a docstring and a couple of inline tests.
Code generation & debugging	code_debug	500	llama4-maverick geo 0.37s	This Python is buggy: def avg(xs): return sum(xs) / len(xs) Explain the failure modes and give a corrected version.
Creative writing	creative_story	500	llama4-maverick geo 0.37s	Write a 150-word noir-style opening scene about a detective who only investigates crimes that happen in libraries.
Creative writing	creative_poem	128	llama4-maverick geo 0.37s	Write a haiku about distributed systems failing gracefully.
Summarisation	summarisation	300	llama4-maverick geo 0.39s	Summarise the following in exactly three bullet points: Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models from leading AI companies via a single API, along with capabilities to build generative AI applicatio…
Structured output	structured_json	300	nemotron-nano-12b 0.53s	Return ONLY valid JSON (no prose) describing three programming languages, each with keys: name, year_created, paradigm.
Instruction following	instruction_following	128	llama4-scout geo 0.37s	List the planets of the solar system in order from the sun. Output them comma-separated on a single line, all lowercase, no other text.
Long-form generation	long_generation	1200	llama4-maverick geo 0.41s	Explain how TCP congestion control works, covering slow start, congestion avoidance, fast retransmit and fast recovery. Aim for a thorough, well-structured explanation.
Large-context load	large_context	80	llama4-maverick geo 0.64s	Section 1. In the distributed ledger subsystem, node 1 maintains a replicated log with quorum size 4 and a heartbeat interval of 57 milliseconds; its committed offset is 1007 and its term number is 2. When a partition heals, node 1 reconciles by comparing v…

By region — routing / geo latency, like-for-like

Fair comparison: only the 5 models that ran successfully in every region are pooled here, so a region isn't flattered or penalised by which models happened to be available. Lower TTFB is better.

Region	Geo	Calls	TTFB avg (s)	Throughput (tok/s)
eu-west-2	EU	1409	1.490	145
eu-central-1	EU	1410	1.521	144
eu-west-3	EU	1410	1.578	146
eu-west-1	EU	1408	1.618	142
us-east-1	US	1404	1.641	141
us-west-2	US	1412	1.681	141
eu-north-1	EU	1407	1.685	139
us-west-1	US	1412	1.703	146
us-east-2	US	1411	1.745	145
ap-northeast-2	APJ	1405	1.867	145
ap-southeast-2	APJ	1411	1.878	143
ap-south-1	APJ	1404	1.879	139
ap-northeast-1	APJ	1410	1.922	144
ap-northeast-3	APJ	1405	1.953	140
ap-southeast-1	APJ	1405	1.956	142

Glossary — what the metrics mean

TTFB — time to first token (s): Wall-clock from request to first streamed token: the "is it responding yet" latency. Good <1s, poor >3s. Covers network round-trip, queueing, inference-profile routing and prompt prefill.
TTFB p95 (s): 95th-percentile TTFB — the slow tail (~1 call in 20). A p95 far above the average means inconsistent latency (cold starts, contention).
TTFB σ (std dev): Spread of TTFB across this model's calls. Low = predictable; high = jittery.
Fire rate: % of calls whose TTFB breached a per-category threshold (trivial 2s … large-context 6s). Borrowed from months of empirical Bedrock sampling: a single number capturing both latency and reliability — a model can have a decent average but a nasty tail, and fire rate catches that where the mean hides it. Lower is better.
Correct: Auto-graded answer correctness on the deterministic prompts — the factual question (Canberra), the exact-format planet list, strict-JSON output, the value planted in the large-context document, and the two maths problems. Graded by exact string/JSON checks, no LLM judge. Creative/summarisation prompts have no deterministic answer and aren't counted. This turns "fastest" into "fastest while actually right". Higher is better.
Cost/call: Mean $ per call: exact input/output token counts × per-1M-token rates. Rates come from the live AWS Price List API (fetched per region and routing scope — global/geo/ in-region rows are priced separately, and global endpoints are often cheaper) with published Anthropic pricing as the fallback for models not yet in the Price List. Refresh rates with python pricing.py --refresh. Lower is better.
Total avg (s): Average full wall-clock per call. Depends on output length, so compare within a category. This is the metric the use-case rankings use (except chat).
Throughput — output tokens / second: Sustained generation speed during streaming. Good >150, sluggish <60. What makes a long answer finish fast.
Inter-token latency (ms): Mean gap between output tokens — inverse of throughput. Low & steady = smooth.
Buffered (count): Calls where the server didn't truly stream — it withheld then dumped the reply in ≤2 chunks (or at an impossible >1000 tok/s). For those, throughput and inter-token latency are not measurable and are excluded from averages. A high count means "streaming metrics N/A for this model" — its TTFB and total time are still valid. Neither good nor bad for quality.
OK (n): Successful calls / attempted — the sample size behind every average in that row. With --repeat N each combo runs N times and all are counted.

How the tests were run — methodology

Method

Every enabled model is attempted in every enabled region (a model×region cross-product) against the full prompt suite. Each prompt has a max_tokens cap so short tasks stay cheap and long tasks aren't truncated. Averages are over all matching calls; with --repeat N each combo is measured N times and every result is recorded (not best-of), so the sample counts shown are real.

Streaming. Every call is streamed, so TTFB is genuine first-token time.
Region ids. global. models use one id valid in all regions (the clean geo comparison); geo models use the regional inference profile (us./eu./au./jp./apac.); direct models use the on-demand id (only some regions — gaps elsewhere are expected and shown).
Three paths. bedrock = Converse API; mantle = gpt-oss via OpenAI chat-completions; mantle_responses = gpt-5.x via the Responses API on the bedrock-mantle host.
Concurrency. Threaded; token counts come from each provider's usage event. Fast-fail timeouts mean a model absent in a region is recorded as a gap, not a hang.
AWS profile: —. Every run is stored in SQLite, so the trend chart and history accumulate automatically.

Caveat: a single run is a snapshot — latencies vary with time of day, load and cold starts. Use --repeat and compare runs over time for stable figures.