Scaling

SMLL supports two scaling modes depending on your service's deployment mode.

Always-on scaling (HPA)

When your service is set to always on with max_replicas > min_replicas, SMLL creates a Kubernetes Horizontal Pod Autoscaler (HPA) that automatically adjusts replicas based on load.

Min replicas: the minimum number of instances always running
Max replicas: the upper bound for autoscaling

Scale-to-zero (on demand)

Services with on demand deployment mode scale down to zero replicas when idle, eliminating costs during inactive periods.

How it works

SMLL uses the KEDA HTTP add-on to manage scale-to-zero:

An interceptor proxy sits in front of your service
When traffic arrives and no replicas are running, KEDA spins up a new instance
Traffic is held at the proxy until the instance is ready (typically a few seconds)
After a period of no traffic, the service scales back to zero

Idle timeout

The idle timeout controls how long a service stays running after the last request:

Default: 300 seconds (5 minutes)
Configurable: set any value in the service settings

A shorter timeout saves costs but increases cold start frequency. A longer timeout keeps the service warm for follow-up requests.

Configuring

Go to your service settings
Set Deployment mode to On demand
Set Max replicas (min is automatically 0)
Adjust Idle timeout if needed
Save

Cold starts

The first request after scale-to-zero triggers a cold start. The latency depends on:

Image pull time (cached images are faster)
Application startup time
Health check completion

To minimize cold starts, keep your Docker images small and your application's startup fast.

Quota limits

Scaling is bounded by your VPC's resource quotas:

Tier	Max pods
Free	20
Pay-as-you-go	100

See Quotas for details on requesting increases.