Scope: all runs (full history)
Window: 30 Jun 13:38 → 1 Jul 13:50. All figures are averages over every matching call (sample counts shown per row / in cell tooltips).
This measures speed, not answer quality. Rankings reward whichever model returns tokens fastest — usually the smallest. Use them to pick the fastest model at an acceptable quality tier, paired with your own quality judgement (e.g. Opus/Sonnet for hard work, a small model for high-volume simple turns). Chat ranks by time-to-first-token; the rest by total wall-clock time.
Short interactive turns. Dominated by TTFB — how fast the first token arrives.
Code generation and structured output. Ranked by total time to a complete answer — the real wait, folding in first-token latency and generation speed.
Essays, stories, summaries. Ranked by total time to finish: a model with blistering throughput but a slow first token is not actually fast here.
Multi-step reasoning. Total time to a complete answer.
Time until the first token arrives — the latency you feel in a chat box. Lower is better. Bars green→red = best→worst.
Sustained generation speed once streaming starts. Higher is better. Buffered (non-streaming) models excluded — their throughput isn't client-measurable.
Same model, different regional endpoint = pure routing/geo latency. Pick a model to see its TTFB in every region it's available, sorted fastest first (green→red). Lower is better. Full heat-matrix for all models is beneath.
| Model | ap-northeast-1 | ap-northeast-2 | ap-northeast-3 | ap-south-1 | ap-southeast-1 | ap-southeast-2 | eu-central-1 | eu-north-1 | eu-west-1 | eu-west-2 | eu-west-3 | us-east-1 | us-east-2 | us-west-1 | us-west-2 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| llama4-maverick (geo) | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | 0.38 | 0.37 | 0.44 | 0.58 |
| nemotron-nano3-30b (direct) | 0.68 | ✗ | ✗ | 0.59 | ✗ | 0.62 | ✗ | ✗ | 0.45 | 0.47 | ✗ | 0.52 | 0.47 | ✗ | 0.53 |
| deepseek-r1 (geo) | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | 0.57 | 0.60 | ✗ | 0.59 |
| nemotron-nano-12b (direct) | 1.03 | ✗ | ✗ | 0.58 | ✗ | 1.09 | ✗ | ✗ | 0.39 | 0.61 | ✗ | 0.46 | 0.85 | ✗ | 0.56 |
| nemotron-nano-9b (direct) | 0.68 | ✗ | ✗ | 0.80 | ✗ | 0.94 | ✗ | ✗ | 0.46 | 0.65 | ✗ | 0.54 | 0.64 | ✗ | 1.25 |
| llama3.3-70b (geo) | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | 1.06 | 0.60 | ✗ | 0.92 |
| gpt-oss-20b | 1.27 | ✗ | ✗ | 1.07 | ✗ | 1.44 | 0.56 | 0.72 | 0.59 | 0.53 | ✗ | 1.15 | 0.96 | ✗ | 1.02 |
| nova-micro (direct) | ✗ | ✗ | ✗ | ✗ | ✗ | 0.88 | ✗ | ✗ | ✗ | 0.78 | ✗ | 1.23 | ✗ | ✗ | ✗ |
| llama4-scout (geo) | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | 0.35 | 0.78 | 2.34 | 0.40 |
| grok-4.3 | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | 0.96 | ✗ | ✗ | 1.00 |
| nemotron-super3-120b (direct) | ✗ | ✗ | ✗ | ✗ | ✗ | 0.77 | 2.05 | ✗ | 0.53 | 0.56 | ✗ | 2.12 | 0.51 | ✗ | 0.64 |
| nova-2-lite (geo profile) | 2.41 | ✗ | ✗ | ✗ | ✗ | ✗ | 0.70 | 1.06 | 1.13 | ✗ | 0.61 | 1.75 | 1.14 | 1.46 | 2.21 |
| sonnet-4.6 | 1.46 | 1.51 | 1.63 | 1.48 | 1.88 | 1.55 | 1.22 | 1.43 | 1.41 | 1.17 | 1.27 | 1.43 | 1.30 | 1.30 | 1.31 |
| haiku-4.5 | 1.81 | 1.69 | 1.64 | 1.59 | 1.64 | 1.75 | 1.17 | 1.34 | 1.16 | 1.19 | 1.24 | 1.37 | 1.28 | 1.43 | 1.38 |
| opus-4.7 | 1.66 | 1.72 | 1.66 | 1.63 | 2.01 | 1.55 | 1.54 | 1.82 | 1.56 | 1.57 | 1.68 | 1.57 | 1.58 | 1.59 | 1.58 |
| nova-2-lite | 2.88 | 2.50 | ✗ | 2.30 | 2.13 | 1.90 | 1.29 | 1.82 | 1.16 | 1.09 | 1.10 | 1.86 | 1.66 | 2.16 | 1.37 |
| gpt-5.5 | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | 1.93 | 1.70 | ✗ | ✗ |
| deepseek-v3.2 (direct) | 1.52 | ✗ | ✗ | 0.71 | ✗ | 1.33 | ✗ | 1.08 | ✗ | 6.31 | ✗ | 1.46 | 2.26 | ✗ | 0.81 |
| gpt-oss-120b | 1.87 | ✗ | ✗ | 2.48 | ✗ | 2.83 | 1.71 | 1.33 | 1.14 | 1.07 | ✗ | 4.10 | 1.57 | ✗ | 2.50 |
| opus-4.8 | 2.48 | 2.30 | 2.67 | 2.49 | 2.18 | 2.31 | 1.79 | 1.90 | 1.96 | 1.71 | 1.83 | 1.97 | 2.60 | 2.17 | 2.08 |
| gpt-5.4 | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | 1.68 | 3.78 | ✗ | 1.01 |
| haiku-4.5 (geo profile) | 2.48 | ✗ | 3.08 | ✗ | ✗ | 1.98 | 1.79 | 2.13 | 1.24 | 1.27 | 2.69 | 2.82 | 1.98 | 3.82 | 2.01 |
| sonnet-5 | 2.75 | 2.62 | 2.62 | 2.89 | 2.33 | 3.00 | 2.68 | 2.45 | 2.79 | 2.50 | 2.50 | 2.39 | 2.39 | 2.73 | 2.87 |
| fable-5 (geo) | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | 3.28 | 3.43 | 3.50 | 3.37 |
| llama3.1-405b (direct) | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | 9.29 |
TTFB per run for the fastest models: solid line = median (p50), shaded band = p5–p95 spread. A widening band means the tail is degrading even if the median looks fine — the same signal fire rate captures. Each point is one benchmark run.
These three are not interchangeable.
in-region = model served in the region you called (true residency).
geo = a regional inference profile (us./eu./…) routing within a geography.
global = the cross-region router. Empirically (months of 5-min sampling) global is a
latency tax, not a win — it queues rather than rerouting to free capacity, so it runs slower than
direct profiles and degrades hardest under load. Compared like-for-like over base models present in
more than one scope: haiku-4.5, nova-2-lite. Lower is better.
| Scope | Calls | TTFB avg (s) | TTFB p95 (s) | Fire rate |
|---|---|---|---|---|
| geo | 6855 | 1.895 | 2.252 | 4% |
| global | 9457 | 1.617 | 2.204 | 4% |
Same comparison broken out by region. Hover a cell for call count and fire count.
| Scope | ap-northeast-1 | ap-northeast-2 | ap-northeast-3 | ap-south-1 | ap-southeast-1 | ap-southeast-2 | eu-central-1 | eu-north-1 | eu-west-1 | eu-west-2 | eu-west-3 | us-east-1 | us-east-2 | us-west-1 | us-west-2 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| geo | 2.45 | · | 3.08 | · | · | 1.98 | 1.24 | 1.60 | 1.19 | 1.27 | 1.65 | 2.29 | 1.56 | 2.65 | 2.11 |
| global | 2.35 | 2.10 | 1.64 | 1.94 | 1.88 | 1.82 | 1.23 | 1.58 | 1.16 | 1.14 | 1.17 | 1.61 | 1.47 | 1.80 | 1.38 |
Mean TTFB and fire rate by UTC hour, across all runs. Bedrock capacity shifts with the clock — empirically the EU window (UTC ~15:00–03:00) runs hotter than the US window. This fills in as scheduled runs (00/06/12/18) accumulate; with runs clustered in one window it'll look flat. Lower is better.
| Model | Provider | OK (n) | TTFB avg (s) | TTFB p95 (s) | TTFB σ | Fire rate | Correct | Total avg (s) | Throughput (tok/s) | Inter-token (ms) | Cost/call | Buffered |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| llama4-maverick (geo) | bedrock | 1256/4905 | 0.441 | 0.604 | ±1.71 | 0% | 100% | 1.85 | 194 | 6.2 | $0.0004 | 105 |
| nemotron-nano3-30b (direct) | bedrock | 728/1365 | 0.542 | 0.752 | ±0.17 | 0% | 64% | 2.03 | 215 | 5.8 | $0.0001 | 305 |
| deepseek-r1 (geo) | bedrock | 844/4143 | 0.587 | 3.400 | ±1.28 | 8% | 0% | 2.96 | n/a | 2.4 | $0.0027 | 731 |
| nemotron-nano-12b (direct) | bedrock | 728/1365 | 0.697 | 0.837 | ±1.74 | 1% | 87% | 2.14 | 180 | 6.5 | $0.0004 | 284 |
| nemotron-nano-9b (direct) | bedrock | 728/1365 | 0.745 | 0.892 | ±2.55 | 1% | 16% | 4.25 | 145 | 9.5 | $0.0003 | 64 |
| llama3.3-70b (geo) | bedrock | 981/4905 | 0.860 | 2.368 | ±3.41 | 3% | 97% | 2.69 | 139 | 9.1 | $0.0006 | 83 |
| gpt-oss-20b | mantle | 2494/3788 | 0.931 | 1.406 | ±1.31 | 0% | 61% | 2.29 | 408 | 3.7 | $0.0002 | 866 |
| nova-micro (direct) | bedrock | 968/4905 | 0.963 | 1.212 | ±4.95 | 2% | 86% | 1.67 | 421 | 3.0 | $0.0001 | 93 |
| llama4-scout (geo) | bedrock | 1278/4905 | 0.971 | 0.617 | ±6.08 | 1% | 100% | 2.30 | 197 | 5.9 | $0.0003 | 104 |
| grok-4.3 | mantle | 197/1723 | 0.985 | 3.905 | ±1.66 | 7% | — | 2.88 | n/a | 2.3 | $0.0015 | 162 |
| nemotron-super3-120b (direct) | bedrock | 624/1340 | 1.036 | 1.708 | ±3.31 | 4% | 69% | 3.03 | 155 | 7.1 | $0.0004 | 235 |
| nova-2-lite (geo profile) | bedrock | 2936/4905 | 1.386 | 2.861 | ±5.94 | 4% | 85% | 2.73 | 253 | 4.7 | $0.0011 | 518 |
| sonnet-4.6 | bedrock | 4890/4905 | 1.424 | 2.573 | ±1.58 | 4% | 99% | 6.00 | 86 | 15.7 | $0.0062 | 406 |
| haiku-4.5 | bedrock | 4901/4905 | 1.445 | 2.070 | ±0.98 | 2% | 100% | 3.24 | 157 | 8.2 | $0.0019 | 420 |
| opus-4.7 | bedrock | 4889/4905 | 1.647 | 2.514 | ±1.59 | 3% | 92% | 5.27 | 179 | 10.3 | $0.0123 | 859 |
| nova-2-lite | bedrock | 4556/4905 | 1.802 | 2.956 | ±6.74 | 5% | 82% | 3.09 | 262 | 4.6 | $0.0011 | 796 |
| gpt-5.5 | mantle_responses | 421/3345 | 1.815 | 4.121 | ±2.37 | 18% | — | 5.56 | 176 | 11.4 | — | 145 |
| deepseek-v3.2 (direct) | bedrock | 2387/4544 | 1.936 | 8.316 | ±5.10 | 10% | 96% | 7.77 | 63 | 21.3 | $0.0010 | 649 |
| gpt-oss-120b | mantle | 2646/3970 | 2.066 | 5.816 | ±2.41 | 16% | 61% | 3.59 | 344 | 3.8 | $0.0003 | 651 |
| opus-4.8 | bedrock | 4884/4905 | 2.164 | 4.399 | ±2.71 | 15% | 93% | 5.62 | 140 | 10.5 | $0.0122 | 923 |
| gpt-5.4 | mantle_responses | 728/3508 | 2.180 | 10.458 | ±4.09 | 13% | 100% | 9.73 | 58 | 78.5 | — | 68 |
| haiku-4.5 (geo profile) | bedrock | 3919/4905 | 2.276 | 1.844 | ±7.79 | 4% | 100% | 4.03 | 159 | 7.7 | $0.0019 | 340 |
| sonnet-5 | bedrock | 1559/1560 | 2.634 | 6.019 | ±2.46 | 29% | 73% | 5.82 | 189 | 10.7 | $0.0089 | 375 |
| fable-5 (geo) | bedrock | 264/1365 | 3.395 | 5.534 | ±1.16 | 71% | 84% | 7.62 | 165 | 55.4 | $0.0154 | 101 |
| llama3.1-405b (direct) | bedrock | 327/4905 | 9.292 | 33.249 | ±11.92 | 79% | 97% | 9.37 | n/a | 2.1 | $0.0017 | 288 |
| Category — what it tests | Calls | TTFB avg (s) | Total avg (s) | Throughput (tok/s) |
|---|---|---|---|---|
| Trivial greeting A bare 'Hi' — pure latency probe, almost no generation. | 7335 | 1.620 | 1.84 | 229 |
| Factual recall A one-word factual question — short, knowledge lookup. | 7335 | 1.591 | 1.68 | 253 |
| Math / reasoning Multi-step word problems and arithmetic with working shown. | 14662 | 1.749 | 4.59 | 201 |
| Code generation & debugging Write a function with tests; explain and fix buggy code. | 14653 | 1.782 | 6.46 | 181 |
| Creative writing A short noir scene and a haiku — open-ended generation. | 14660 | 1.703 | 3.79 | 127 |
| Summarisation Condense a paragraph into a fixed number of bullet points. | 7328 | 1.580 | 2.30 | 213 |
| Structured output Return strict JSON only — tests format adherence. | 7327 | 1.487 | 2.35 | 256 |
| Instruction following Exact-format constraints (casing, separators, no extra text). | 7325 | 1.313 | 1.53 | 220 |
| Long-form generation A thorough multi-section technical explanation — sustained output. | 7311 | 1.623 | 13.22 | 141 |
| Large-context load A large (~12k token) input the model must read before answering. Exposes queueing/throughput differences that small prompts hide — the regime where global vs geo vs in-region latency actually diverges (per empirical Bedrock data). | 4305 | 1.982 | 2.22 | 260 |
"Fastest model" = the model with the lowest mean TTFB for that specific prompt, across all regions/runs (green cell shows the model + its mean TTFB).
| Category | ID | Max tokens | Fastest model (TTFB) | Prompt |
|---|---|---|---|---|
| Trivial greeting | greeting | 64 | llama4-maverick geo 0.38s | Hi |
| Factual recall | factual_short | 32 | gpt-oss-20b 0.97s | What is the capital of Australia? Answer in one word. |
| Math / reasoning | reasoning_math | 600 | llama4-maverick geo 0.38s | A train leaves City A at 9:00am travelling at 60 mph. A second train leaves City B, 240 miles away, at 10:00am travelling at 80 mph toward City A. At what time do they meet? Show your working. |
| Math / reasoning | math_arithmetic | 300 | llama4-maverick geo 0.38s | Compute 47 * 89 + 1337 - 256 / 4. Show each step. |
| Code generation & debugging | code_generation | 800 | llama4-maverick geo 0.36s | Write a Python function `merge_intervals(intervals)` that merges overlapping intervals given as a list of [start, end] pairs. Include a docstring and a couple of inline tests. |
| Code generation & debugging | code_debug | 500 | llama4-maverick geo 0.37s | This Python is buggy: def avg(xs): return sum(xs) / len(xs) Explain the failure modes and give a corrected version. |
| Creative writing | creative_story | 500 | llama4-maverick geo 0.37s | Write a 150-word noir-style opening scene about a detective who only investigates crimes that happen in libraries. |
| Creative writing | creative_poem | 128 | llama4-maverick geo 0.37s | Write a haiku about distributed systems failing gracefully. |
| Summarisation | summarisation | 300 | llama4-maverick geo 0.39s | Summarise the following in exactly three bullet points: Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models from leading AI companies via a single API, along with capabilities to build generative AI applicatio… |
| Structured output | structured_json | 300 | nemotron-nano-12b 0.53s | Return ONLY valid JSON (no prose) describing three programming languages, each with keys: name, year_created, paradigm. |
| Instruction following | instruction_following | 128 | llama4-scout geo 0.37s | List the planets of the solar system in order from the sun. Output them comma-separated on a single line, all lowercase, no other text. |
| Long-form generation | long_generation | 1200 | llama4-maverick geo 0.41s | Explain how TCP congestion control works, covering slow start, congestion avoidance, fast retransmit and fast recovery. Aim for a thorough, well-structured explanation. |
| Large-context load | large_context | 80 | llama4-maverick geo 0.64s | Section 1. In the distributed ledger subsystem, node 1 maintains a replicated log with quorum size 4 and a heartbeat interval of 57 milliseconds; its committed offset is 1007 and its term number is 2. When a partition heals, node 1 reconciles by comparing v… |
Fair comparison: only the 5 models that ran successfully in every region are pooled here, so a region isn't flattered or penalised by which models happened to be available. Lower TTFB is better.
| Region | Geo | Calls | TTFB avg (s) | Throughput (tok/s) |
|---|---|---|---|---|
| eu-west-2 | EU | 1409 | 1.490 | 145 |
| eu-central-1 | EU | 1410 | 1.521 | 144 |
| eu-west-3 | EU | 1410 | 1.578 | 146 |
| eu-west-1 | EU | 1408 | 1.618 | 142 |
| us-east-1 | US | 1404 | 1.641 | 141 |
| us-west-2 | US | 1412 | 1.681 | 141 |
| eu-north-1 | EU | 1407 | 1.685 | 139 |
| us-west-1 | US | 1412 | 1.703 | 146 |
| us-east-2 | US | 1411 | 1.745 | 145 |
| ap-northeast-2 | APJ | 1405 | 1.867 | 145 |
| ap-southeast-2 | APJ | 1411 | 1.878 | 143 |
| ap-south-1 | APJ | 1404 | 1.879 | 139 |
| ap-northeast-1 | APJ | 1410 | 1.922 | 144 |
| ap-northeast-3 | APJ | 1405 | 1.953 | 140 |
| ap-southeast-1 | APJ | 1405 | 1.956 | 142 |
Every enabled model is attempted in every enabled region (a model×region cross-product) against the full prompt suite. Each prompt has a max_tokens cap so short tasks stay cheap and long tasks aren't truncated. Averages are over all matching calls; with --repeat N each combo is measured N times and every result is recorded (not best-of), so the sample counts shown are real.
Caveat: a single run is a snapshot — latencies vary with time of day, load and cold starts. Use --repeat and compare runs over time for stable figures.