Bedrock + Mantle Benchmark Report

Scope: all runs (full history)

24475 calls 11078 ok 13397 failed/gaps 25 models 15 regions 11 runs in DB profile —

Window: — → —. All figures are averages over every matching call (sample counts shown per row / in cell tooltips).

Recommendations — best model per use-case

This measures speed, not answer quality. Rankings reward whichever model returns tokens fastest — usually the smallest. Use them to pick the fastest model at an acceptable quality tier, paired with your own quality judgement (e.g. Opus/Sonnet for hard work, a small model for high-volume simple turns). Chat ranks by time-to-first-token; the rest by total wall-clock time.

Chat / low-latency

Short interactive turns. Dominated by TTFB — how fast the first token arrives.

🥇 deepseek-r1 geo0.13s first token
🥈 llama4-scout geo0.26s first token
🥉 llama4-maverick geo0.26s first token

Coding

Code generation and structured output. Ranked by total time to a complete answer — the real wait, folding in first-token latency and generation speed.

🥇 nova-micro1.05s total · 485 tok/s
🥈 nova-2-lite geo1.76s total · 303 tok/s
🥉 llama4-scout geo1.78s total · 222 tok/s

Creative / long-form

Essays, stories, summaries. Ranked by total time to finish: a model with blistering throughput but a slow first token is not actually fast here.

🥇 nova-micro1.51s total · 298 tok/s
🥈 nemotron-nano-12b1.93s total · 194 tok/s
🥉 nova-2-lite geo2.76s total · 203 tok/s

Reasoning / math

Multi-step reasoning. Total time to a complete answer.

🥇 nova-micro1.09s total · 570 tok/s
🥈 llama4-scout geo1.39s total · 175 tok/s
🥉 llama4-maverick geo1.86s total · 196 tok/s

Speed: time to first token (TTFB)

Time until the first token arrives — the latency you feel in a chat box. Lower is better. Bars green→red = best→worst.

Throughput: output tokens / second

Sustained generation speed once streaming starts. Higher is better. Buffered (non-streaming) models excluded — their throughput isn't client-measurable.

Latency by region — pick a model

Same model, different regional endpoint = pure routing/geo latency. Pick a model to see its TTFB in every region it's available, sorted fastest first (green→red). Lower is better. Full heat-matrix for all models is beneath.

Model:

Heat-matrix — mean TTFB (s), every model in every region

fast mid slow · · = not available/run · hover a cell for sample count

Model	ap-northeast-1	ap-northeast-2	ap-northeast-3	ap-south-1	ap-southeast-1	ap-southeast-2	eu-central-1	eu-north-1	eu-west-1	eu-west-2	eu-west-3	us-east-1	us-east-2	us-west-1	us-west-2
llama4-scout (geo)	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	0.27	0.26	0.27	0.24
nova-micro (direct)	✗	✗	✗	✗	✗	0.33	✗	✗	✗	0.21	✗	0.25	✗	✗	✗
llama4-maverick (geo)	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	0.28	0.32	0.30	0.26
nemotron-nano-12b (direct)	0.39	✗	✗	0.35	✗	0.39	✗	✗	0.36	0.36	✗	0.38	0.37	✗	0.40
nova-2-lite (geo profile)	0.33	✗	✗	✗	✗	✗	0.39	0.41	0.39	✗	0.42	0.41	0.43	0.39	0.42
nemotron-nano-9b (direct)	0.46	✗	✗	0.41	✗	0.49	✗	✗	0.48	0.64	✗	0.44	0.40	✗	0.55
llama3.3-70b (geo)	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	0.51	0.50	✗	0.49
nova-2-lite	0.50	0.53	✗	0.63	0.60	0.54	0.53	0.60	0.49	0.54	0.54	0.43	0.41	0.40	0.37
deepseek-r1 (geo)	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	0.42	0.77	✗	0.53
haiku-4.5 (geo profile)	0.77	✗	0.73	✗	✗	0.66	0.85	0.87	0.85	0.88	0.87	0.90	0.86	0.82	0.85
nemotron-nano3-30b (direct)	0.41	✗	✗	0.40	✗	0.35	✗	✗	0.42	0.48	✗	5.15	0.35	✗	0.39
nemotron-super3-120b (direct)	✗	✗	✗	✗	✗	2.59	0.46	✗	0.61	0.48	✗	1.63	0.42	✗	1.06
haiku-4.5	0.99	0.98	1.05	1.11	0.99	1.03	2.12	1.20	1.11	1.19	1.25	1.12	1.08	1.05	1.23
sonnet-4.6	1.21	1.27	1.20	1.27	1.23	1.25	1.20	1.31	1.39	1.27	1.34	1.61	1.24	1.18	1.33
opus-4.7	1.43	1.40	1.36	1.49	1.29	1.26	1.48	1.58	1.61	1.58	1.65	1.44	1.40	1.51	1.39
opus-4.8	2.20	2.24	2.42	2.02	2.59	2.37	2.02	2.11	2.02	2.05	2.19	2.36	2.02	2.28	2.10
sonnet-5	2.50	2.51	2.69	2.88	2.61	2.76	2.88	2.44	2.36	2.63	2.95	2.37	2.45	2.68	2.76
fable-5 (geo)	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	3.41	3.49	3.56	3.54
deepseek-v3.2 (direct)	1.10	✗	✗	0.57	✗	2.54	✗	0.77	✗	24.56	✗	0.69	0.89	✗	0.74
llama3.1-405b (direct)	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	7.44
gpt-5.4	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗
gpt-5.5	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗
gpt-oss-120b	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗
gpt-oss-20b	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗
grok-4.3	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗

Trend over time

TTFB per run for the fastest models: solid line = median (p50), shaded band = p5–p95 spread. A widening band means the tail is degrading even if the median looks fine — the same signal fire rate captures. Each point is one benchmark run.

Global vs Geo vs In-region — the cost of the router

These three are not interchangeable. in-region = model served in the region you called (true residency). geo = a regional inference profile (us./eu./…) routing within a geography. global = the cross-region router. Empirically (months of 5-min sampling) global is a latency tax, not a win — it queues rather than rerouting to free capacity, so it runs slower than direct profiles and degrades hardest under load. Compared like-for-like over base models present in more than one scope: haiku-4.5, nova-2-lite. Lower is better.

Scope	Calls	TTFB avg (s)	TTFB p95 (s)	Fire rate
geo	1373	0.643	1.105	0%
global	1893	0.848	1.440	0%

Scope × region — mean TTFB (s)

Same comparison broken out by region. Hover a cell for call count and fire count.

Scope	ap-northeast-1	ap-northeast-2	ap-northeast-3	ap-south-1	ap-southeast-1	ap-southeast-2	eu-central-1	eu-north-1	eu-west-1	eu-west-2	eu-west-3	us-east-1	us-east-2	us-west-1	us-west-2
geo	0.55	·	0.73	·	·	0.66	0.62	0.64	0.62	0.88	0.65	0.65	0.64	0.61	0.64
global	0.75	0.76	1.05	0.87	0.80	0.78	1.32	0.90	0.80	0.87	0.89	0.78	0.75	0.72	0.80

Time of day — UTC hour (0 hour-buckets, 0 calls)

Mean TTFB and fire rate by UTC hour, across all runs. Bedrock capacity shifts with the clock — empirically the EU window (UTC ~15:00–03:00) runs hotter than the US window. This fills in as scheduled runs (00/06/12/18) accumulate; with runs clustered in one window it'll look flat. Lower is better.

Per-model detail

best mid worst — cells heat-graded within each column. OK = calls succeeded / attempted (the sample size).

Model	Provider	OK (n)	TTFB avg (s)	TTFB p95 (s)	TTFB σ	Fire rate	Correct	Total avg (s)	Throughput (tok/s)	Inter-token (ms)	Cost/call	Buffered
llama4-scout (geo)	bedrock	262/979	0.259	0.460	±0.10	0%	100%	1.69	174	6.6	—	21
nova-micro (direct)	bedrock	197/979	0.263	0.738	±0.19	0%	81%	0.97	432	2.8	—	17
llama4-maverick (geo)	bedrock	262/979	0.288	0.529	±0.16	0%	100%	1.65	197	6.0	—	21
nemotron-nano-12b (direct)	bedrock	524/979	0.375	0.633	±0.13	0%	85%	1.62	184	5.7	—	204
nova-2-lite (geo profile)	bedrock	589/979	0.399	0.636	±0.23	0%	86%	1.66	269	4.4	—	97
nemotron-nano-9b (direct)	bedrock	524/979	0.483	0.732	±0.37	1%	14%	3.22	150	7.1	—	52
llama3.3-70b (geo)	bedrock	197/979	0.500	1.626	±0.55	1%	98%	2.51	129	9.8	—	16
nova-2-lite	bedrock	914/979	0.508	0.800	±0.17	0%	83%	1.72	277	4.2	—	151
deepseek-r1 (geo)	bedrock	197/979	0.573	3.707	±1.43	8%	0%	3.17	n/a	2.5	—	172
haiku-4.5 (geo profile)	bedrock	784/979	0.826	1.194	±0.24	0%	100%	2.53	157	7.9	$0.0023	66
nemotron-nano3-30b (direct)	bedrock	515/979	0.935	0.713	±6.39	1%	64%	2.52	208	6.1	—	216
nemotron-super3-120b (direct)	bedrock	411/979	0.989	1.942	±5.04	3%	68%	3.37	143	8.9	—	162
haiku-4.5	bedrock	979/979	1.166	1.704	±1.94	1%	100%	2.87	162	7.5	$0.0023	81
sonnet-4.6	bedrock	979/979	1.289	2.318	±1.04	3%	100%	5.74	94	15.2	$0.0072	78
opus-4.7	bedrock	979/979	1.459	2.331	±0.44	2%	91%	4.96	182	10.6	$0.0146	156
opus-4.8	bedrock	979/979	2.199	5.406	±1.46	19%	96%	5.69	137	11.4	$0.0145	168
sonnet-5	bedrock	978/979	2.630	6.573	±1.81	31%	73%	5.88	192	10.7	$0.0089	209
fable-5 (geo)	bedrock	222/979	3.500	6.269	±1.32	76%	79%	7.94	163	42.0	$0.0153	79
deepseek-v3.2 (direct)	bedrock	521/979	3.975	23.566	±10.21	17%	94%	10.43	55	23.2	—	150
llama3.1-405b (direct)	bedrock	65/979	7.438	28.007	±7.22	77%	97%	7.45	n/a	0.0	—	65

By prompt category

Category — what it tests	Calls	TTFB avg (s)	Total avg (s)	Throughput (tok/s)
Trivial greeting A bare 'Hi' — pure latency probe, almost no generation.	1925	1.018	1.19	232
Factual recall A one-word factual question — short, knowledge lookup.	1925	1.065	1.12	284
Math / reasoning Multi-step word problems and arithmetic with working shown.	3750	1.293	4.37	185
Code generation & debugging Write a function with tests; explain and fix buggy code.	3750	1.238	6.12	160
Creative writing A short noir scene and a haiku — open-ended generation.	3750	1.327	3.52	108
Summarisation Condense a paragraph into a fixed number of bullet points.	1875	1.405	2.21	185
Structured output Return strict JSON only — tests format adherence.	1875	1.386	2.31	224
Instruction following Exact-format constraints (casing, separators, no extra text).	1875	1.313	1.49	213
Long-form generation A thorough multi-section technical explanation — sustained output.	1875	1.673	13.54	130
Large-context load A large (~12k token) input the model must read before answering. Exposes queueing/throughput differences that small prompts hide — the regime where global vs geo vs in-region latency actually diverges (per empirical Bedrock data).	1875	2.040	2.25	271

The actual prompts used

"Fastest model" = the model with the lowest mean TTFB for that specific prompt, across all regions/runs (green cell shows the model + its mean TTFB).

Category	ID	Max tokens	Fastest model (TTFB)	Prompt
Trivial greeting	greeting	64	llama4-maverick geo 0.24s	Hi
Factual recall	factual_short	32	sonnet-5 2.10s	What is the capital of Australia? Answer in one word.
Math / reasoning	reasoning_math	600	nova-micro 0.24s	A train leaves City A at 9:00am travelling at 60 mph. A second train leaves City B, 240 miles away, at 10:00am travelling at 80 mph toward City A. At what time do they meet? Show your working.
Math / reasoning	math_arithmetic	300	llama4-scout geo 0.24s	Compute 47 * 89 + 1337 - 256 / 4. Show each step.
Code generation & debugging	code_generation	800	llama4-scout geo 0.23s	Write a Python function `merge_intervals(intervals)` that merges overlapping intervals given as a list of [start, end] pairs. Include a docstring and a couple of inline tests.
Code generation & debugging	code_debug	500	llama4-scout geo 0.25s	This Python is buggy: def avg(xs): return sum(xs) / len(xs) Explain the failure modes and give a corrected version.
Creative writing	creative_story	500	llama4-scout geo 0.22s	Write a 150-word noir-style opening scene about a detective who only investigates crimes that happen in libraries.
Creative writing	creative_poem	128	nova-micro 0.19s	Write a haiku about distributed systems failing gracefully.
Summarisation	summarisation	300	nova-micro 0.20s	Summarise the following in exactly three bullet points: Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models from leading AI companies via a single API, along with capabilities to build generative AI applicatio…
Structured output	structured_json	300	nova-micro 0.19s	Return ONLY valid JSON (no prose) describing three programming languages, each with keys: name, year_created, paradigm.
Instruction following	instruction_following	128	nova-micro 0.27s	List the planets of the solar system in order from the sun. Output them comma-separated on a single line, all lowercase, no other text.
Long-form generation	long_generation	1200	nova-micro 0.19s	Explain how TCP congestion control works, covering slow start, congestion avoidance, fast retransmit and fast recovery. Aim for a thorough, well-structured explanation.
Large-context load	large_context	80	llama4-scout geo 0.46s	Section 1. In the distributed ledger subsystem, node 1 maintains a replicated log with quorum size 4 and a heartbeat interval of 57 milliseconds; its committed offset is 1007 and its term number is 2. When a partition heals, node 1 reconciles by comparing v…

By region — routing / geo latency, like-for-like

Fair comparison: only the 5 models that ran successfully in every region are pooled here, so a region isn't flattered or penalised by which models happened to be available. Lower TTFB is better.

Region	Geo	Calls	TTFB avg (s)	Throughput (tok/s)
us-east-2	US	325	1.639	166
ap-northeast-1	APJ	325	1.666	144
ap-northeast-2	APJ	325	1.679	151
eu-west-1	EU	335	1.698	157
eu-north-1	EU	325	1.729	154
ap-southeast-2	APJ	325	1.734	149
us-west-1	US	325	1.740	143
ap-southeast-1	APJ	325	1.743	151
ap-northeast-3	APJ	325	1.745	152
eu-west-2	EU	325	1.745	163
ap-south-1	APJ	324	1.750	148
us-west-2	US	325	1.765	148
us-east-1	US	335	1.779	151
eu-west-3	EU	325	1.876	159
eu-central-1	EU	325	1.939	141

Glossary — what the metrics mean

TTFB — time to first token (s): Wall-clock from request to first streamed token: the "is it responding yet" latency. Good <1s, poor >3s. Covers network round-trip, queueing, inference-profile routing and prompt prefill.
TTFB p95 (s): 95th-percentile TTFB — the slow tail (~1 call in 20). A p95 far above the average means inconsistent latency (cold starts, contention).
TTFB σ (std dev): Spread of TTFB across this model's calls. Low = predictable; high = jittery.
Fire rate: % of calls whose TTFB breached a per-category threshold (trivial 2s … large-context 6s). Borrowed from months of empirical Bedrock sampling: a single number capturing both latency and reliability — a model can have a decent average but a nasty tail, and fire rate catches that where the mean hides it. Lower is better.
Correct: Auto-graded answer correctness on the deterministic prompts — the factual question (Canberra), the exact-format planet list, strict-JSON output, the value planted in the large-context document, and the two maths problems. Graded by exact string/JSON checks, no LLM judge. Creative/summarisation prompts have no deterministic answer and aren't counted. This turns "fastest" into "fastest while actually right". Higher is better.
Cost/call: Mean $ per call: exact input/output token counts × per-1M-token rates. Rates come from the live AWS Price List API (fetched per region and routing scope — global/geo/ in-region rows are priced separately, and global endpoints are often cheaper) with published Anthropic pricing as the fallback for models not yet in the Price List. Refresh rates with python pricing.py --refresh. Lower is better.
Total avg (s): Average full wall-clock per call. Depends on output length, so compare within a category. This is the metric the use-case rankings use (except chat).
Throughput — output tokens / second: Sustained generation speed during streaming. Good >150, sluggish <60. What makes a long answer finish fast.
Inter-token latency (ms): Mean gap between output tokens — inverse of throughput. Low & steady = smooth.
Buffered (count): Calls where the server didn't truly stream — it withheld then dumped the reply in ≤2 chunks (or at an impossible >1000 tok/s). For those, throughput and inter-token latency are not measurable and are excluded from averages. A high count means "streaming metrics N/A for this model" — its TTFB and total time are still valid. Neither good nor bad for quality.
OK (n): Successful calls / attempted — the sample size behind every average in that row. With --repeat N each combo runs N times and all are counted.

How the tests were run — methodology

Method

Every enabled model is attempted in every enabled region (a model×region cross-product) against the full prompt suite. Each prompt has a max_tokens cap so short tasks stay cheap and long tasks aren't truncated. Averages are over all matching calls; with --repeat N each combo is measured N times and every result is recorded (not best-of), so the sample counts shown are real.

Streaming. Every call is streamed, so TTFB is genuine first-token time.
Region ids. global. models use one id valid in all regions (the clean geo comparison); geo models use the regional inference profile (us./eu./au./jp./apac.); direct models use the on-demand id (only some regions — gaps elsewhere are expected and shown).
Three paths. bedrock = Converse API; mantle = gpt-oss via OpenAI chat-completions; mantle_responses = gpt-5.x via the Responses API on the bedrock-mantle host.
Concurrency. Threaded; token counts come from each provider's usage event. Fast-fail timeouts mean a model absent in a region is recorded as a gap, not a hang.
AWS profile: —. Every run is stored in SQLite, so the trend chart and history accumulate automatically.

Caveat: a single run is a snapshot — latencies vary with time of day, load and cold starts. Use --repeat and compare runs over time for stable figures.