Scope: all runs (full history)
Window: — → —. All figures are averages over every matching call (sample counts shown per row / in cell tooltips).
This measures speed, not answer quality. Rankings reward whichever model returns tokens fastest — usually the smallest. Use them to pick the fastest model at an acceptable quality tier, paired with your own quality judgement (e.g. Opus/Sonnet for hard work, a small model for high-volume simple turns). Chat ranks by time-to-first-token; the rest by total wall-clock time.
Short interactive turns. Dominated by TTFB — how fast the first token arrives.
Code generation and structured output. Ranked by total time to a complete answer — the real wait, folding in first-token latency and generation speed.
Essays, stories, summaries. Ranked by total time to finish: a model with blistering throughput but a slow first token is not actually fast here.
Multi-step reasoning. Total time to a complete answer.
Time until the first token arrives — the latency you feel in a chat box. Lower is better. Bars green→red = best→worst.
Sustained generation speed once streaming starts. Higher is better. Buffered (non-streaming) models excluded — their throughput isn't client-measurable.
Same model, different regional endpoint = pure routing/geo latency. Pick a model to see its TTFB in every region it's available, sorted fastest first (green→red). Lower is better. Full heat-matrix for all models is beneath.
| Model | ap-northeast-1 | ap-northeast-2 | ap-northeast-3 | ap-south-1 | ap-southeast-1 | ap-southeast-2 | eu-central-1 | eu-north-1 | eu-west-1 | eu-west-2 | eu-west-3 | us-east-1 | us-east-2 | us-west-1 | us-west-2 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| llama4-scout (geo) | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | 0.27 | 0.26 | 0.27 | 0.24 |
| nova-micro (direct) | ✗ | ✗ | ✗ | ✗ | ✗ | 0.33 | ✗ | ✗ | ✗ | 0.21 | ✗ | 0.25 | ✗ | ✗ | ✗ |
| llama4-maverick (geo) | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | 0.28 | 0.32 | 0.30 | 0.26 |
| nemotron-nano-12b (direct) | 0.39 | ✗ | ✗ | 0.35 | ✗ | 0.39 | ✗ | ✗ | 0.36 | 0.36 | ✗ | 0.38 | 0.37 | ✗ | 0.40 |
| nova-2-lite (geo profile) | 0.33 | ✗ | ✗ | ✗ | ✗ | ✗ | 0.39 | 0.41 | 0.39 | ✗ | 0.42 | 0.41 | 0.43 | 0.39 | 0.42 |
| nemotron-nano-9b (direct) | 0.46 | ✗ | ✗ | 0.41 | ✗ | 0.49 | ✗ | ✗ | 0.48 | 0.64 | ✗ | 0.44 | 0.40 | ✗ | 0.55 |
| llama3.3-70b (geo) | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | 0.51 | 0.50 | ✗ | 0.49 |
| nova-2-lite | 0.50 | 0.53 | ✗ | 0.63 | 0.60 | 0.54 | 0.53 | 0.60 | 0.49 | 0.54 | 0.54 | 0.43 | 0.41 | 0.40 | 0.37 |
| deepseek-r1 (geo) | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | 0.42 | 0.77 | ✗ | 0.53 |
| haiku-4.5 (geo profile) | 0.77 | ✗ | 0.73 | ✗ | ✗ | 0.66 | 0.85 | 0.87 | 0.85 | 0.88 | 0.87 | 0.90 | 0.86 | 0.82 | 0.85 |
| nemotron-nano3-30b (direct) | 0.41 | ✗ | ✗ | 0.40 | ✗ | 0.35 | ✗ | ✗ | 0.42 | 0.48 | ✗ | 5.15 | 0.35 | ✗ | 0.39 |
| nemotron-super3-120b (direct) | ✗ | ✗ | ✗ | ✗ | ✗ | 2.59 | 0.46 | ✗ | 0.61 | 0.48 | ✗ | 1.63 | 0.42 | ✗ | 1.06 |
| haiku-4.5 | 0.99 | 0.98 | 1.05 | 1.11 | 0.99 | 1.03 | 2.12 | 1.20 | 1.11 | 1.19 | 1.25 | 1.12 | 1.08 | 1.05 | 1.23 |
| sonnet-4.6 | 1.21 | 1.27 | 1.20 | 1.27 | 1.23 | 1.25 | 1.20 | 1.31 | 1.39 | 1.27 | 1.34 | 1.61 | 1.24 | 1.18 | 1.33 |
| opus-4.7 | 1.43 | 1.40 | 1.36 | 1.49 | 1.29 | 1.26 | 1.48 | 1.58 | 1.61 | 1.58 | 1.65 | 1.44 | 1.40 | 1.51 | 1.39 |
| opus-4.8 | 2.20 | 2.24 | 2.42 | 2.02 | 2.59 | 2.37 | 2.02 | 2.11 | 2.02 | 2.05 | 2.19 | 2.36 | 2.02 | 2.28 | 2.10 |
| sonnet-5 | 2.50 | 2.51 | 2.69 | 2.88 | 2.61 | 2.76 | 2.88 | 2.44 | 2.36 | 2.63 | 2.95 | 2.37 | 2.45 | 2.68 | 2.76 |
| fable-5 (geo) | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | 3.41 | 3.49 | 3.56 | 3.54 |
| deepseek-v3.2 (direct) | 1.10 | ✗ | ✗ | 0.57 | ✗ | 2.54 | ✗ | 0.77 | ✗ | 24.56 | ✗ | 0.69 | 0.89 | ✗ | 0.74 |
| llama3.1-405b (direct) | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | 7.44 |
| gpt-5.4 | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| gpt-5.5 | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| gpt-oss-120b | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| gpt-oss-20b | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
| grok-4.3 | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ |
TTFB per run for the fastest models: solid line = median (p50), shaded band = p5–p95 spread. A widening band means the tail is degrading even if the median looks fine — the same signal fire rate captures. Each point is one benchmark run.
These three are not interchangeable.
in-region = model served in the region you called (true residency).
geo = a regional inference profile (us./eu./…) routing within a geography.
global = the cross-region router. Empirically (months of 5-min sampling) global is a
latency tax, not a win — it queues rather than rerouting to free capacity, so it runs slower than
direct profiles and degrades hardest under load. Compared like-for-like over base models present in
more than one scope: haiku-4.5, nova-2-lite. Lower is better.
| Scope | Calls | TTFB avg (s) | TTFB p95 (s) | Fire rate |
|---|---|---|---|---|
| geo | 1373 | 0.643 | 1.105 | 0% |
| global | 1893 | 0.848 | 1.440 | 0% |
Same comparison broken out by region. Hover a cell for call count and fire count.
| Scope | ap-northeast-1 | ap-northeast-2 | ap-northeast-3 | ap-south-1 | ap-southeast-1 | ap-southeast-2 | eu-central-1 | eu-north-1 | eu-west-1 | eu-west-2 | eu-west-3 | us-east-1 | us-east-2 | us-west-1 | us-west-2 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| geo | 0.55 | · | 0.73 | · | · | 0.66 | 0.62 | 0.64 | 0.62 | 0.88 | 0.65 | 0.65 | 0.64 | 0.61 | 0.64 |
| global | 0.75 | 0.76 | 1.05 | 0.87 | 0.80 | 0.78 | 1.32 | 0.90 | 0.80 | 0.87 | 0.89 | 0.78 | 0.75 | 0.72 | 0.80 |
Mean TTFB and fire rate by UTC hour, across all runs. Bedrock capacity shifts with the clock — empirically the EU window (UTC ~15:00–03:00) runs hotter than the US window. This fills in as scheduled runs (00/06/12/18) accumulate; with runs clustered in one window it'll look flat. Lower is better.
| Model | Provider | OK (n) | TTFB avg (s) | TTFB p95 (s) | TTFB σ | Fire rate | Correct | Total avg (s) | Throughput (tok/s) | Inter-token (ms) | Cost/call | Buffered |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| llama4-scout (geo) | bedrock | 262/979 | 0.259 | 0.460 | ±0.10 | 0% | 100% | 1.69 | 174 | 6.6 | — | 21 |
| nova-micro (direct) | bedrock | 197/979 | 0.263 | 0.738 | ±0.19 | 0% | 81% | 0.97 | 432 | 2.8 | — | 17 |
| llama4-maverick (geo) | bedrock | 262/979 | 0.288 | 0.529 | ±0.16 | 0% | 100% | 1.65 | 197 | 6.0 | — | 21 |
| nemotron-nano-12b (direct) | bedrock | 524/979 | 0.375 | 0.633 | ±0.13 | 0% | 85% | 1.62 | 184 | 5.7 | — | 204 |
| nova-2-lite (geo profile) | bedrock | 589/979 | 0.399 | 0.636 | ±0.23 | 0% | 86% | 1.66 | 269 | 4.4 | — | 97 |
| nemotron-nano-9b (direct) | bedrock | 524/979 | 0.483 | 0.732 | ±0.37 | 1% | 14% | 3.22 | 150 | 7.1 | — | 52 |
| llama3.3-70b (geo) | bedrock | 197/979 | 0.500 | 1.626 | ±0.55 | 1% | 98% | 2.51 | 129 | 9.8 | — | 16 |
| nova-2-lite | bedrock | 914/979 | 0.508 | 0.800 | ±0.17 | 0% | 83% | 1.72 | 277 | 4.2 | — | 151 |
| deepseek-r1 (geo) | bedrock | 197/979 | 0.573 | 3.707 | ±1.43 | 8% | 0% | 3.17 | n/a | 2.5 | — | 172 |
| haiku-4.5 (geo profile) | bedrock | 784/979 | 0.826 | 1.194 | ±0.24 | 0% | 100% | 2.53 | 157 | 7.9 | $0.0023 | 66 |
| nemotron-nano3-30b (direct) | bedrock | 515/979 | 0.935 | 0.713 | ±6.39 | 1% | 64% | 2.52 | 208 | 6.1 | — | 216 |
| nemotron-super3-120b (direct) | bedrock | 411/979 | 0.989 | 1.942 | ±5.04 | 3% | 68% | 3.37 | 143 | 8.9 | — | 162 |
| haiku-4.5 | bedrock | 979/979 | 1.166 | 1.704 | ±1.94 | 1% | 100% | 2.87 | 162 | 7.5 | $0.0023 | 81 |
| sonnet-4.6 | bedrock | 979/979 | 1.289 | 2.318 | ±1.04 | 3% | 100% | 5.74 | 94 | 15.2 | $0.0072 | 78 |
| opus-4.7 | bedrock | 979/979 | 1.459 | 2.331 | ±0.44 | 2% | 91% | 4.96 | 182 | 10.6 | $0.0146 | 156 |
| opus-4.8 | bedrock | 979/979 | 2.199 | 5.406 | ±1.46 | 19% | 96% | 5.69 | 137 | 11.4 | $0.0145 | 168 |
| sonnet-5 | bedrock | 978/979 | 2.630 | 6.573 | ±1.81 | 31% | 73% | 5.88 | 192 | 10.7 | $0.0089 | 209 |
| fable-5 (geo) | bedrock | 222/979 | 3.500 | 6.269 | ±1.32 | 76% | 79% | 7.94 | 163 | 42.0 | $0.0153 | 79 |
| deepseek-v3.2 (direct) | bedrock | 521/979 | 3.975 | 23.566 | ±10.21 | 17% | 94% | 10.43 | 55 | 23.2 | — | 150 |
| llama3.1-405b (direct) | bedrock | 65/979 | 7.438 | 28.007 | ±7.22 | 77% | 97% | 7.45 | n/a | 0.0 | — | 65 |
| Category — what it tests | Calls | TTFB avg (s) | Total avg (s) | Throughput (tok/s) |
|---|---|---|---|---|
| Trivial greeting A bare 'Hi' — pure latency probe, almost no generation. | 1925 | 1.018 | 1.19 | 232 |
| Factual recall A one-word factual question — short, knowledge lookup. | 1925 | 1.065 | 1.12 | 284 |
| Math / reasoning Multi-step word problems and arithmetic with working shown. | 3750 | 1.293 | 4.37 | 185 |
| Code generation & debugging Write a function with tests; explain and fix buggy code. | 3750 | 1.238 | 6.12 | 160 |
| Creative writing A short noir scene and a haiku — open-ended generation. | 3750 | 1.327 | 3.52 | 108 |
| Summarisation Condense a paragraph into a fixed number of bullet points. | 1875 | 1.405 | 2.21 | 185 |
| Structured output Return strict JSON only — tests format adherence. | 1875 | 1.386 | 2.31 | 224 |
| Instruction following Exact-format constraints (casing, separators, no extra text). | 1875 | 1.313 | 1.49 | 213 |
| Long-form generation A thorough multi-section technical explanation — sustained output. | 1875 | 1.673 | 13.54 | 130 |
| Large-context load A large (~12k token) input the model must read before answering. Exposes queueing/throughput differences that small prompts hide — the regime where global vs geo vs in-region latency actually diverges (per empirical Bedrock data). | 1875 | 2.040 | 2.25 | 271 |
"Fastest model" = the model with the lowest mean TTFB for that specific prompt, across all regions/runs (green cell shows the model + its mean TTFB).
| Category | ID | Max tokens | Fastest model (TTFB) | Prompt |
|---|---|---|---|---|
| Trivial greeting | greeting | 64 | llama4-maverick geo 0.24s | Hi |
| Factual recall | factual_short | 32 | sonnet-5 2.10s | What is the capital of Australia? Answer in one word. |
| Math / reasoning | reasoning_math | 600 | nova-micro 0.24s | A train leaves City A at 9:00am travelling at 60 mph. A second train leaves City B, 240 miles away, at 10:00am travelling at 80 mph toward City A. At what time do they meet? Show your working. |
| Math / reasoning | math_arithmetic | 300 | llama4-scout geo 0.24s | Compute 47 * 89 + 1337 - 256 / 4. Show each step. |
| Code generation & debugging | code_generation | 800 | llama4-scout geo 0.23s | Write a Python function `merge_intervals(intervals)` that merges overlapping intervals given as a list of [start, end] pairs. Include a docstring and a couple of inline tests. |
| Code generation & debugging | code_debug | 500 | llama4-scout geo 0.25s | This Python is buggy: def avg(xs): return sum(xs) / len(xs) Explain the failure modes and give a corrected version. |
| Creative writing | creative_story | 500 | llama4-scout geo 0.22s | Write a 150-word noir-style opening scene about a detective who only investigates crimes that happen in libraries. |
| Creative writing | creative_poem | 128 | nova-micro 0.19s | Write a haiku about distributed systems failing gracefully. |
| Summarisation | summarisation | 300 | nova-micro 0.20s | Summarise the following in exactly three bullet points: Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models from leading AI companies via a single API, along with capabilities to build generative AI applicatio… |
| Structured output | structured_json | 300 | nova-micro 0.19s | Return ONLY valid JSON (no prose) describing three programming languages, each with keys: name, year_created, paradigm. |
| Instruction following | instruction_following | 128 | nova-micro 0.27s | List the planets of the solar system in order from the sun. Output them comma-separated on a single line, all lowercase, no other text. |
| Long-form generation | long_generation | 1200 | nova-micro 0.19s | Explain how TCP congestion control works, covering slow start, congestion avoidance, fast retransmit and fast recovery. Aim for a thorough, well-structured explanation. |
| Large-context load | large_context | 80 | llama4-scout geo 0.46s | Section 1. In the distributed ledger subsystem, node 1 maintains a replicated log with quorum size 4 and a heartbeat interval of 57 milliseconds; its committed offset is 1007 and its term number is 2. When a partition heals, node 1 reconciles by comparing v… |
Fair comparison: only the 5 models that ran successfully in every region are pooled here, so a region isn't flattered or penalised by which models happened to be available. Lower TTFB is better.
| Region | Geo | Calls | TTFB avg (s) | Throughput (tok/s) |
|---|---|---|---|---|
| us-east-2 | US | 325 | 1.639 | 166 |
| ap-northeast-1 | APJ | 325 | 1.666 | 144 |
| ap-northeast-2 | APJ | 325 | 1.679 | 151 |
| eu-west-1 | EU | 335 | 1.698 | 157 |
| eu-north-1 | EU | 325 | 1.729 | 154 |
| ap-southeast-2 | APJ | 325 | 1.734 | 149 |
| us-west-1 | US | 325 | 1.740 | 143 |
| ap-southeast-1 | APJ | 325 | 1.743 | 151 |
| ap-northeast-3 | APJ | 325 | 1.745 | 152 |
| eu-west-2 | EU | 325 | 1.745 | 163 |
| ap-south-1 | APJ | 324 | 1.750 | 148 |
| us-west-2 | US | 325 | 1.765 | 148 |
| us-east-1 | US | 335 | 1.779 | 151 |
| eu-west-3 | EU | 325 | 1.876 | 159 |
| eu-central-1 | EU | 325 | 1.939 | 141 |
Every enabled model is attempted in every enabled region (a model×region cross-product) against the full prompt suite. Each prompt has a max_tokens cap so short tasks stay cheap and long tasks aren't truncated. Averages are over all matching calls; with --repeat N each combo is measured N times and every result is recorded (not best-of), so the sample counts shown are real.
Caveat: a single run is a snapshot — latencies vary with time of day, load and cold starts. Use --repeat and compare runs over time for stable figures.