Why I modeled LLM rate limiting on TCP congestion control
June 2025
When I built PromptForge, I needed to push LLM API providers as fast as they would allow without getting rate limited into silence. The naive approach — read the docs, hard-code the RPM limit, add a sleep — has an obvious failure mode: the stated limit is often wrong. Providers throttle differently by tier, by model, by account age, and sometimes by time of day. The limit in the docs is a ceiling, not a guarantee.
I wanted a system that would discover the real limit automatically and then sustain it indefinitely. The more I thought about this, the more it sounded like a problem the networking community had already solved — in 1988, with TCP congestion control.
The TCP analogy
TCP's job is to saturate a network link without overwhelming it. It doesn't know the link's capacity in advance — it discovers it by sending data and watching what happens. When packets get through, it sends more. When packets drop, it backs off. Over time it converges on the actual throughput ceiling of the link.
LLM rate limiting is the same problem. The "link" is the provider's API. The "packets" are requests. "Packet drop" is a 429 response. The goal is to saturate the provider without tripping the rate limiter.
Slow start
TCP slow start begins at a small congestion window and doubles it every round-trip time until the first loss event. I adapted this for LLM requests:
- Start at
RPM_stated × 1.5 / 30s— slightly above the stated limit to probe whether the actual ceiling is higher. - Every 30 seconds without a 429: increase the rate by 50%.
- First 429: record the current rate as the ceiling estimate, halve it, switch to congestion avoidance.
The 30-second window is important. LLM providers bucket their rate limits per minute (sometimes per 10 seconds), so probing faster than 30s gives noisy signal — you might hit a bucket boundary rather than a real limit.
Congestion avoidance
Once we've had a 429, we know roughly where the ceiling is. TCP switches to additive increase / multiplicative decrease (AIMD): increase the window by 1 per RTT (additive), halve it on loss (multiplicative).
My adaptation: after the rate halves, increase by +1 RPM per 30s window. This is more conservative than TCP's AIMD because provider 429s have a backoff penalty — hitting one again within a short window often results in a longer cooldown than the first. So I want to approach the ceiling slowly the second time.
The TPM ceiling problem
RPM is only half the story. Providers also impose tokens-per-minute (TPM) limits, and a request that sends a 4000-token prompt counts much more than a 50-token one. You can stay under RPM and still get rate-limited on TPM.
I added a rolling p95 token window: every outgoing request records its token count with a timestamp. Every 30 seconds, the controller computes the p95 token count per request in the last window and estimates the effective RPM ceiling implied by the TPM limit:
effective_rpm = TPM_limit / p95_tokens_per_requestThe actual rate is capped at min(rpm_ceiling_estimate, effective_rpm). This means the controller self-adjusts for heavy prompts — if you switch from short prompts to long ones mid-job, the TPM ceiling kicks in automatically.
OTel observability
The rate-learning state is exposed as OTel gauges: current rate, ceiling estimate, slow-start vs. congestion-avoidance mode, p95 token count, 429 count per window. These feed into Axiom dashboards so you can watch the controller converge on the ceiling in real time. The first few minutes of a big job look like a sawtooth — rate climbs, hits a 429, drops, climbs again — and then it plateaus at the actual ceiling and stays there.
Results
On a job with 50K prompts against a tier-1 OpenAI account, the controller converged on the actual ceiling (about 3,400 RPM, not the stated 3,500) within 4 minutes and sustained it for the rest of the job with a 429 rate under 0.3%. A hard-coded 3,500 RPM would have produced roughly 15-20% 429s and constant retry overhead.
What I got wrong first
My first version used exponential backoff on 429s — standard stuff. It worked but it was too conservative. After a 429, exponential backoff would back off to something like 60% of the current rate, then 30%, then 15%, and the job would crawl for minutes before recovering. The TCP model is more aggressive about recovery, which is the right call when the 429 penalty is known and bounded.
I also initially used per-request timers instead of 30-second windows. This was wrong — individual request latency varies too much (streaming responses, network jitter) to give a clean signal. Windowed aggregation is the right abstraction.