Scaling
Configure autoscaling, scale-to-zero, and idle timeout for your services.
SMLL supports two scaling modes depending on your service's deployment mode.
Always-on scaling (HPA)
When your service is set to always on with max_replicas > min_replicas, SMLL creates a Kubernetes Horizontal Pod Autoscaler (HPA) that automatically adjusts replicas based on load.
- Min replicas: the minimum number of instances always running
- Max replicas: the upper bound for autoscaling
Scale-to-zero (on demand)
Services with on demand deployment mode scale down to zero replicas when idle, eliminating costs during inactive periods.
How it works
SMLL uses the KEDA HTTP add-on to manage scale-to-zero:
- An interceptor proxy sits in front of your service
- When traffic arrives and no replicas are running, KEDA spins up a new instance
- Traffic is held at the proxy until the instance is ready (typically a few seconds)
- After a period of no traffic, the service scales back to zero
Idle timeout
The idle timeout controls how long a service stays running after the last request:
- Default: 300 seconds (5 minutes)
- Configurable: set any value in the service settings
A shorter timeout saves costs but increases cold start frequency. A longer timeout keeps the service warm for follow-up requests.
Configuring
- Go to your service settings
- Set Deployment mode to On demand
- Set Max replicas (min is automatically 0)
- Adjust Idle timeout if needed
- Save
Cold starts
The first request after scale-to-zero triggers a cold start. The latency depends on:
- Image pull time (cached images are faster)
- Application startup time
- Health check completion
To minimize cold starts, keep your Docker images small and your application's startup fast.
Quota limits
Scaling is bounded by your VPC's resource quotas:
| Tier | Max pods |
|---|---|
| Free | 20 |
| Pay-as-you-go | 100 |
See Quotas for details on requesting increases.