Paramjeet Pradhan

The problem

Running LLM inference at scale is not just an API problem — it's a systems problem. When you need to process millions of prompts, you hit three walls fast: memory (loading large batches), rate limits (provider TPM/RPM ceilings), and fault tolerance (one pod crash = lost work).

Most solutions are either a naive loop with a sleep, or a queue + worker pattern that still fails on large-scale restarts. I wanted something that could handle a 10M-prompt job with the same memory footprint as a 100-prompt job, recover from crashes in under 30 seconds, and push providers exactly as fast as they'll allow without hard-coding limits.

Architecture

The pipeline is event-driven:

User uploads a JSONL file to GCS.
Eventarc triggers a Cloud Run launcher.
Launcher provisions a dedicated GKE pod for the job.
Pod streams prompts from GCS one chunk at a time — memory stays constant.
Results write back to GCS as they complete.

One pod per job means zero cross-job interference. GKE handles pod scheduling; Pulumi handles the infrastructure as code.

The hard part: adaptive rate learning

Provider rate limits are not static. They vary by tier, model, time of day, and account history. Hard-coding RPM values means either leaving throughput on the table or getting 429s constantly.

I modeled the rate controller on TCP congestion control:

Slow start: Begin at RPM × 1.5 / 30s. Grow aggressively until the first 429.
Congestion avoidance: After a 429, halve the rate and switch to additive increase (+1 RPM per 30s window).
TPM ceiling: A rolling p95 token window caps the effective RPM so we never exceed the token-per-minute limit even when request count is low.

The system discovers the actual ceiling of the provider account automatically over the first few minutes of a job, then sustains that rate indefinitely. OTel gauges expose the rate-learning state in real time via Axiom.

Crash recovery

Every 30 seconds, the pod checkpoints its progress to GCS using a bitset — one bit per prompt, set when the result is written. If the pod dies, a new pod reads the bitset, skips completed prompts, and resumes from exactly where the last one left off. No duplicate processing, no lost work.

What I'd do differently

The current design provisions one pod per job, which means cold-start latency for small jobs (~15s). For jobs under ~1000 prompts, a shared worker pool would be faster. I'd add a job-size heuristic at the launcher level to route small jobs to a pool and large jobs to dedicated pods.

The rate-learning algorithm also doesn't account for provider-side burst capacity. Some APIs allow short bursts above the stated RPM. Detecting and exploiting burst windows would further improve throughput.

PromptForge

The problem

Architecture

The hard part: adaptive rate learning

Crash recovery

What I'd do differently