What is learning rate in machine learning?
- Published on
- Arnab Mondal--3 min read
Overview
- Overview
- How learning rate affects training
- Choosing a learning rate
- Schedules and warmup
- Practical defaults
- Conclusion
The learning rate is a core hyperparameter that sets the step size used by optimization algorithms (like SGD or Adam) to update model weights based on the gradient of the loss. In short, it controls how aggressively or cautiously your model learns.
If the learning rate is too small, training crawls and can get stuck in bad regions. If it’s too large, loss can bounce around or even explode, failing to converge.
How learning rate affects training
- Too low: Slow convergence; the model can spend a long time descending a gentle slope.
- Just right: Steady progress toward a good minimum with stable loss curves.
- Too high: Divergence or oscillation; the loss may increase or fluctuate wildly.
Analogy: Think of gradient descent as hiking downhill in fog. Learning rate is your step size. Tiny steps (low LR) get you there slowly. Giant strides (high LR) risk overshooting and stumbling. A balanced stride gets you down efficiently.
Choosing a learning rate
- Start simple: Common starting points are 1e-3 for Adam and 1e-2 for SGD (with momentum). Adjust from there.
- Use learning rate ranges: Try a sweep across orders of magnitude (e.g., 1e-5 → 1e-1) to see where loss improves fastest.
- Early stopping & checkpoints: Combine with validation monitoring; back off if loss spikes.
Schedules and warmup
- Step decay: Reduce LR by a factor at certain epochs (e.g., x0.1 every 30 epochs).
- Cosine decay: Smoothly anneal LR to near-zero over training.
- Exponential decay: Multiply LR by a constant factor each step or epoch.
- Warmup: Start with a small LR and ramp up for a few epochs to stabilize early updates.
These strategies help you train fast early on and refine carefully later.
Practical defaults
- Adam/AdamW: start at 1e-3; lower to 3e-4 or 1e-4 for large models.
- SGD+Momentum: start at 1e-2; pair with momentum 0.9 and consider a schedule.
- Vision Transformers / LLMs: use warmup + cosine; learning rates are often smaller (1e-4 to 1e-5) with large batch sizes.
Conclusion
The learning rate is the most sensitive knob in your training loop. Tune it first, and pair it with a schedule and early stopping. You’ll get faster convergence, better stability, and stronger generalization.
Available for hire - If you're looking for a skilled full-stack developer with AI integration experience, feel free to reach out at hire@codewarnab.in