GCP Cloud Run returning 429 Out of Instances during spikes and the instance concurrency + queueing strategy that stopped throttled requests

Facebook X LinkedIn

Imagine this: one fine morning, your app goes viral. 🚀 People are flooding in. But instead of celebrating, you’re staring at a screen full of scary 429 “Out of Instances” errors coming from Cloud Run on Google Cloud Platform (GCP). Yikes!

TLDR:

GCP Cloud Run can return 429 errors during traffic spikes if it can’t spin up new instances fast enough. This happens when you’ve hit your max instance limit or scaling lags behind demand. The fix? Increase maximum concurrency and implement smart request queueing. These changes help Cloud Run absorb spikes without panicking.

What is Cloud Run?

Cloud Run is Google’s serverless platform for running containerized apps. It’s easy to use. You send it a Docker container. It scales things up or down. You only pay for what you use. Sounds like magic, right?

Magic, until it’s not. When traffic surges faster than it can scale, it throws its hands up and gives you a:

429: Too Many Requests - Out of Instances

What Just Happened?

Cloud Run starts one or more instances of your container when requests come in. But starting instances isn’t instant—it takes time. Meanwhile, requests keep pouring in.

If all your running instances are busy, and Cloud Run can’t start a new one fast enough (or you’ve hit your instance cap), it starts rejecting incoming traffic.

Why 429s Hurt

Your users see errors
Your API clients start retrying (making things worse)
Search engines crawl your site less
It just feels… bad

Basically: traffic spikes are great ✨, unless your backend throws a tantrum.

What We Saw

We launched a campaign. Suddenly, we hit 15x normal traffic. Requests per second skyrocketed. Within seconds, error logs flooded with 429s. We were losing traffic, fast.

Logs said: “Out of instances”. We thought Cloud Run would auto-scale—but we missed something important.

Digging Into the Culprit

Our Cloud Run service had:

Max Instances: 100
Concurrency: 1

That means: only one request per instance, and up to 100 instances. So the most traffic we could handle at once was… 100 concurrent requests. That’s it.

Our app could serve multiple concurrent requests per container. But we didn’t tell Cloud Run that. So it kept spinning up more and more instances instead of reusing ones.

Concurrency to the Rescue

Cloud Run lets you set a concurrency setting. It’s how many requests can be handled at the same time in one instance.

We bumped ours up from 1 to 10. And boom 💥—each instance could now process 10x more traffic before Cloud Run had to spin up new ones.

Benefits:

Fewer instances started = faster scale
Less cold-start lag
Cheaper! (fewer instances running means less cost)

But Wait… There’s Queueing

Increasing concurrency helped a lot. But we still needed to be prepared for those *super quick* spikes that Cloud Run could still fumble on.

Luckily, Cloud Run doesn’t immediately throw away traffic. If an instance is at capacity, Cloud Run will wait a little to see if it can start a new one or finish a request soon. This is internal buffering.

But this buffering is pretty short—just a few seconds max. So we added our own lightweight request queue in front inside our application.

How Our Queueing Strategy Worked:

Incoming requests hit our app
If too many arrived at once, we added them to a short-lived in-memory queue
The app fetched requests from the queue as threads became free

This smooths the spike. Instead of rejecting new requests immediately, we gave them a short window (about 1-2 seconds) to wait for an open thread.

Think of it like a fast-food line: If there are 10 cashiers and 30 people arrive at once, you don’t slam the door. You get them to wait a few seconds for their turn.

We also set a timeout: if a request couldn’t be handled in 2 seconds, only then was it rejected. That way, we kept good user experience without overloading our app.

We Found the Sweet Spot

Our app could comfortably process 10-15 concurrent requests per instance. So we:

Set Cloud Run maxConcurrency = 15
Added a 5-10 request queue per instance with 2s timeout
Increased Max Instances to 200, just in case

With this combo, we handled huge traffic spikes during product launches with zero 429s.

One Extra Trick: Pre-Warmed Instances

You can set minimum instances to avoid cold starts completely.

We did this too. Before big launches, we set:

Min Instances = 5 to 10

Why? Because startup time still matters. If you know traffic is coming, load up early.

Final Thoughts

Getting 429s from Cloud Run during spikes is like your app saying, “Too many people! I give up!” But there’s no need to panic.

With some tuning—like concurrency, better limits, and smart queuing—you can turn chaos into calm. After all, success should be sweet, not stressful.

Takeaways

Don’t use concurrency = 1 unless you really need to
Add smart in-memory queueing to buffer short spikes
Set min instances ahead of big traffic events
Scale max instances generously

With these changes, Cloud Run becomes a resilient, auto-scaling hero ready for prime-time traffic. 🎉

So next time your app goes viral, celebrate. Cloud Run’s got your back. 😎

Facebook X LinkedIn