Building a Production-Ready Load Balancer from Scratch in Go

Overview

The Itch That Started It All
What Is a Load Balancer, Really?
The Architecture We're Building
The Server Pool: Managing Multiple Backends
Health Checks: The Heartbeat of Reliability
The Retry Strategy: Don't Give Up Too Easily
Graceful Shutdown: Don't Drop Requests
What's Next?

The Itch That Started It All

It started with a question I couldn't shake: "What actually happens when thousands of requests hit my servers?"

Sure, I knew the textbook answer—load balancers distribute traffic across servers. But knowing something conceptually and understanding it deeply are two different things. I'd been deploying apps behind NGINX and AWS ALBs for years, treating them as black boxes. One day, that stopped being okay.

So I decided to build one from scratch. In Go. No frameworks, no shortcuts—just the standard library and a lot of curiosity.

What I discovered along the way wasn't just how load balancers work, but why every design decision matters when you're standing between users and your backend servers. This post is that journey—part case study, part tutorial. By the end, you'll understand the internals well enough to build your own.

What Is a Load Balancer, Really?

Before we write any code, let's get crystal clear on what we're building. A load balancer is essentially a traffic cop that sits between clients and your backend servers:

Clients

→

Load Balancer

1. Receives

2. Chooses

3. Forwards

4. Returns

→

Backend 1

Backend 2

Backend 3

When a request arrives, the load balancer:

Receives the incoming HTTP request
Chooses which backend server should handle it (using an algorithm like Round Robin)
Forwards the request to that backend
Returns the backend's response to the client

Simple in concept. Deceptively tricky in execution.

The Architecture We're Building

Here's what we'll implement:

Round Robin selection: Fair distribution across healthy backends
Health checks: Both passive (detect failures from real requests) and active (periodic probes)
Automatic failover: If a backend dies, skip it and try another
Retry logic: Don't give up on the first hiccup
Graceful shutdown: Finish in-flight requests before stopping
Connection pooling: Reuse connections for efficiency

Let's dive in.

Setting Up the Foundation

Request-Scoped Tracking with Context

The first challenge: how do we track attempts and retries per request when multiple requests are being handled concurrently? This is where Go's context.Context shines.

// Create unique types for context keys (avoids key collisions)
type attemptKey struct{}
type retryKey struct{}

// Extract attempt count from request context
func getAttempts(r *http.Request) int {
    if v, ok := r.Context().Value(attemptKey{}).(int); ok {
        return v
    }
    return 1 // First attempt
}

// Extract retry count from request context
func getRetries(r *http.Request) int {
    if v, ok := r.Context().Value(retryKey{}).(int); ok {
        return v
    }
    return 0
}

Why context instead of global variables? Each HTTP request runs in its own goroutine. If we used globals, requests would interfere with each other's counters. Context lets us carry request-specific data through the call chain safely.

Why empty struct types as keys? Using string keys like "attempts" could collide with other packages. Empty struct types are unique and take zero bytes—they're just type markers.

The Backend Struct

Each backend server needs to track its URL, health status, and have a reverse proxy ready to forward requests:

type Backend struct {
    URL          *url.URL              // Parsed URL of this backend
    alive        bool                  // Is this backend healthy?
    mu           sync.RWMutex          // Protects the alive field
    ReverseProxy *httputil.ReverseProxy // Does the actual forwarding
}

func (b *Backend) SetAlive(alive bool) {
    b.mu.Lock()
    defer b.mu.Unlock()
    b.alive = alive
}

func (b *Backend) IsAlive() bool {
    b.mu.RLock()
    defer b.mu.RUnlock()
    return b.alive
}

Notice the sync.RWMutex—this is critical. Multiple goroutines will read the alive status simultaneously (every request checks it), but only health checks write to it. RWMutex allows multiple readers OR one writer, which is more efficient than a regular mutex.

The Server Pool: Managing Multiple Backends

The ServerPool is where the magic happens. It holds all backends and implements Round Robin selection:

type ServerPool struct {
    backends []*Backend
    current  uint64        // Atomic counter for round-robin
    mu       sync.RWMutex  // Protects the backends slice
}

func (p *ServerPool) AddBackend(b *Backend) {
    p.mu.Lock()
    defer p.mu.Unlock()
    p.backends = append(p.backends, b)
}

Round Robin Selection

Round Robin is elegantly simple: Backend 1, then Backend 2, then Backend 3, then back to Backend 1. Like taking turns fairly.

func (p *ServerPool) NextIndex() int {
    p.mu.RLock()
    defer p.mu.RUnlock()

    if len(p.backends) == 0 {
        return 0
    }

    // Atomic increment + modulo wraps around
    return int(atomic.AddUint64(&p.current, 1) % uint64(len(p.backends)))
}

Why atomic operations? Multiple requests hit the load balancer concurrently. Without atomic.AddUint64, two goroutines could read the same value, both increment it, and both write back the same result—losing an increment. Atomic operations guarantee the increment happens without interruption.

Getting the Next Available Peer

Here's where it gets interesting. We don't just pick the next backend—we pick the next alive backend:

func (p *ServerPool) GetNextPeer() *Backend {
    p.mu.RLock()
    defer p.mu.RUnlock()

    if len(p.backends) == 0 {
        return nil
    }

    next := int(atomic.AddUint64(&p.current, 1) % uint64(len(p.backends)))
    limit := len(p.backends) + next // Full cycle limit

    for i := next; i < limit; i++ {
        idx := i % len(p.backends)

        if p.backends[idx].IsAlive() {
            // Update current if we skipped dead backends
            if i != next {
                atomic.StoreUint64(&p.current, uint64(idx))
            }
            return p.backends[idx]
        }
    }

    return nil // All backends are down
}

The loop checks up to one full rotation of backends. If the first choice is dead, we try the next, and so on. If we had to skip some dead backends, we update current so the next request doesn't waste time checking those same dead backends first.

Health Checks: The Heartbeat of Reliability

A load balancer is only as good as its knowledge of backend health. We implement two types of health checking:

Passive Health Checks

These happen automatically when a real request fails. If the reverse proxy can't reach a backend, we mark it as down:

serverPool.MarkBackendStatus(backendURL, false)

This is reactive—we discover failures from actual traffic.

Active Health Checks

We also proactively probe backends at regular intervals:

func (p *ServerPool) HealthCheck(timeout time.Duration, path string) {
    // Copy backends slice to release lock quickly
    p.mu.RLock()
    backends := make([]*Backend, len(p.backends))
    copy(backends, p.backends)
    p.mu.RUnlock()

    var wg sync.WaitGroup

    for _, b := range backends {
        wg.Add(1)
        go func(backend *Backend) {
            defer wg.Done()

            alive := isBackendAlive(backend.URL, timeout, path)
            backend.SetAlive(alive)

            status := "up"
            if !alive {
                status = "down"
            }
            log.Printf("healthcheck: %s [%s]\n", backend.URL, status)
        }(b)
    }

    wg.Wait()
}

Why copy the slice? Without copying, we'd hold the lock for the entire duration of all health checks (potentially seconds), blocking other goroutines from adding/removing backends. Copying gives us a safe snapshot to work with while releasing the lock immediately.

Why pass b as a parameter? This is a classic Go gotcha. If we wrote go func() { ... backend ... }(), all goroutines would capture the same loop variable and see its final value. Passing it as a parameter gives each goroutine its own copy.

The Health Check Client

For efficiency, we use a shared HTTP client with connection pooling:

var healthCheckClient = &http.Client{
    Timeout: 2 * time.Second,
    Transport: &http.Transport{
        MaxIdleConns:        10,               // Total idle connections
        MaxIdleConnsPerHost: 2,                // Per-backend idle connections
        IdleConnTimeout:     30 * time.Second, // Close stale connections
    },
}

func isBackendAlive(base *url.URL, timeout time.Duration, path string) bool {
    u := *base // Copy the URL struct
    u.Path = path
    u.RawQuery = ""
    u.Fragment = ""

    ctx, cancel := context.WithTimeout(context.Background(), timeout)
    defer cancel()

    req, err := http.NewRequestWithContext(ctx, "GET", u.String(), nil)
    if err != nil {
        return false
    }

    resp, err := healthCheckClient.Do(req)
    if err != nil {
        return false
    }
    defer resp.Body.Close()

    // 2xx or 3xx = alive
    return resp.StatusCode >= 200 && resp.StatusCode < 400
}

Why connection pooling? Creating TCP connections is expensive. If we have 10 backends and check health every 10 seconds, that's 10 new connections every 10 seconds—wasteful. With pooling, we reuse existing connections.

The Load Balancer Handler

This is the core function that handles every incoming request:

func (cfg lbConfig) lb(w http.ResponseWriter, r *http.Request) {
    attempts := getAttempts(r)

    // Give up if we've tried too many backends
    if attempts > cfg.maxAttempts {
        log.Printf("%s %s: max attempts reached (%d)\n",
            r.RemoteAddr, r.URL.Path, attempts)
        http.Error(w, "Service not available", http.StatusServiceUnavailable)
        return
    }

    // Get the next available backend
    peer := serverPool.GetNextPeer()
    if peer == nil {
        http.Error(w, "Service not available", http.StatusServiceUnavailable)
        return
    }

    // Forward the request
    peer.ReverseProxy.ServeHTTP(w, r)
}

Clean and simple. The complexity is hidden in the reverse proxy's error handler.

The Retry Strategy: Don't Give Up Too Easily

When a backend fails, we don't immediately give up. Our strategy:

Retry the same backend a few times (maybe it's a temporary glitch)
If retries exhausted, mark the backend as down
Try a different backend (if we haven't exceeded maxAttempts)

proxy.ErrorHandler = func(w http.ResponseWriter, r *http.Request, err error) {
    log.Printf("[%s] proxy error: %v\n", backendURL.Host, err)

    // Client disconnected?
    if r.Context().Err() != nil {
        http.Error(w, "Request cancelled", http.StatusRequestTimeout)
        return
    }

    retries := getRetries(r)

    // Retry same backend?
    if retries < cfg.maxRetries {
        select {
        case <-time.After(cfg.retryDelay):
            ctx := context.WithValue(r.Context(), retryKey{}, retries+1)
            proxy.ServeHTTP(w, r.WithContext(ctx))
        case <-r.Context().Done():
            http.Error(w, "Request cancelled", http.StatusRequestTimeout)
        }
        return
    }

    // Mark this backend as down (passive health check)
    serverPool.MarkBackendStatus(backendURL, false)

    // Try a different backend
    attempts := getAttempts(r)
    ctx := context.WithValue(r.Context(), attemptKey{}, attempts+1)
    cfg.lb(w, r.WithContext(ctx))
}

The select statement is elegant—it either waits for the retry delay OR detects if the client cancelled the request. No wasted retries on abandoned connections.

Setting Up Backends with Connection Tuning

Production-grade connections need proper timeouts:

proxy.Transport = &http.Transport{
    DialContext: (&net.Dialer{
        Timeout:   30 * time.Second, // Connection establishment timeout
        KeepAlive: 30 * time.Second, // Keep connections alive
    }).DialContext,
    MaxIdleConns:          100,              // Total idle connections
    MaxIdleConnsPerHost:   10,               // Per-backend idle connections
    IdleConnTimeout:       90 * time.Second, // Close stale connections
    TLSHandshakeTimeout:   10 * time.Second, // TLS timeout
    ExpectContinueTimeout: 1 * time.Second,  // 100-continue timeout
    ResponseHeaderTimeout: 10 * time.Second, // Header read timeout
}

Each timeout prevents a different failure mode:

DialTimeout: Don't wait forever for unresponsive hosts
IdleConnTimeout: Clean up unused connections
ResponseHeaderTimeout: Detect backends that accept connections but never respond

Graceful Shutdown: Don't Drop Requests

When someone hits Ctrl+C, we don't want to drop in-flight requests. Graceful shutdown:

// Channel to receive OS signals
quit := make(chan os.Signal, 1)
signal.Notify(quit, syscall.SIGINT, syscall.SIGTERM)

// Start server in background
go func() {
    log.Printf("load balancer listening on http://localhost:%d\n", port)
    if err := srv.ListenAndServe(); err != nil && err != http.ErrServerClosed {
        log.Fatalf("Server failed: %v", err)
    }
}()

// Wait for shutdown signal
<-quit
log.Println("shutting down load balancer...")

// Stop health checks
healthCheckCancel()

// Give requests 30 seconds to complete
shutdownCtx, shutdownCancel := context.WithTimeout(context.Background(), 30*time.Second)
defer shutdownCancel()

if err := srv.Shutdown(shutdownCtx); err != nil {
    log.Printf("Error during shutdown: %v\n", err)
} else {
    log.Println("load balancer stopped gracefully")
}

srv.Shutdown() stops accepting new requests but lets existing ones finish—up to our 30-second timeout.

Running It: A Quick Demo

First, create a simple backend server to test with:

// backend/main.go
package main

import (
    "encoding/json"
    "flag"
    "fmt"
    "log"
    "net/http"
    "time"
)

type response struct {
    Name   string `json:"name"`
    Port   int    `json:"port"`
    Method string `json:"method"`
    Path   string `json:"path"`
    Time   string `json:"time"`
}

func main() {
    var port int
    var name string

    flag.IntVar(&port, "port", 8081, "Port to listen on")
    flag.StringVar(&name, "name", "backend-1", "Backend name")
    flag.Parse()

    mux := http.NewServeMux()

    // Health endpoint
    mux.HandleFunc("/health", func(w http.ResponseWriter, r *http.Request) {
        w.WriteHeader(http.StatusOK)
        w.Write([]byte("ok"))
    })

    // Main endpoint - shows which backend handled the request
    mux.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
        w.Header().Set("Content-Type", "application/json")
        json.NewEncoder(w).Encode(response{
            Name:   name,
            Port:   port,
            Method: r.Method,
            Path:   r.URL.Path,
            Time:   time.Now().Format(time.RFC3339Nano),
        })
    })

    srv := &http.Server{
        Addr:    fmt.Sprintf(":%d", port),
        Handler: mux,
    }

    log.Printf("backend %q listening on http://localhost:%d\n", name, port)
    log.Fatal(srv.ListenAndServe())
}

Now spin everything up:

bash

# Terminal 1: Start backend 1
go run backend/main.go -port 8081 -name backend-1

# Terminal 2: Start backend 2
go run backend/main.go -port 8082 -name backend-2

# Terminal 3: Start backend 3
go run backend/main.go -port 8083 -name backend-3

# Terminal 4: Start the load balancer
go run lb/main.go -backends "http://localhost:8081,http://localhost:8082,http://localhost:8083"

Test the round-robin distribution:

bash

# Hit the load balancer multiple times
curl http://localhost:8080/
curl http://localhost:8080/
curl http://localhost:8080/

You'll see responses from different backends in rotation:

json

{"name":"backend-1","port":8081,"method":"GET","path":"/","time":"..."}
{"name":"backend-2","port":8082,"method":"GET","path":"/","time":"..."}
{"name":"backend-3","port":8083,"method":"GET","path":"/","time":"..."}

Now kill one backend and watch the load balancer automatically route around it!

Key Takeaways

Building this load balancer taught me several things:

Concurrency is everything: Go's goroutines and channels make concurrent health checks elegant. Without them, checking 10 backends would be painfully slow.
Context is your friend: Request-scoped data like attempt counts should live in context, not globals. This keeps concurrent requests isolated.
Defense in depth: We have both passive health checks (detect failures from real traffic) and active health checks (proactive probing). Either alone isn't enough.
Timeouts everywhere: Every network operation needs a timeout. Without them, a single slow backend can exhaust all your connections.
Graceful shutdown matters: In production, you don't want to drop requests mid-flight. The few extra lines of code are worth it.

What's Next?

This load balancer is functional but basic. In a production scenario, you might want to add:

Weighted Round Robin: Give more traffic to beefier backends
Least Connections: Route to the backend with fewest active connections
Sticky Sessions: Keep a user on the same backend (useful for stateful apps)
Rate Limiting: Prevent any single client from overwhelming your backends
Metrics/Observability: Export Prometheus metrics for monitoring
TLS Termination : Handle HTTPS at the load balancer

The foundation we've built makes all of these extensions possible. The hard part—concurrent request handling, health checking, graceful failover—is already done.

Building this was one of those projects that made systems concepts click in a way that reading never could. I hope it does the same for you. If you build something cool with this foundation—or find bugs I missed—I'd love to hear about it.

Want to discuss distributed systems or Go? Feel free to reach out at hi@codewarnab.in

Previous Blog← Building a PyTorch RNN-Based Question Answering System

Next BlogUnderstanding Dense Passage Retrieval (DPR): The Engine Behind Modern Search →