Load Testing a Chat Server with k6
Back to blog

Load Testing a Chat Server with k6

·Dan Castrillo

why load test a chat server

so Chattorum is this chat app i built: Go backend, WebSocket connections, a stock bot that fetches prices when you type /stock=AAPL. it works great when it's just me and a few friends poking at it. but i had no idea what would happen under real load. turns out neither did the server.

i picked k6 because it has native WebSocket support and the scripting is just JavaScript. you define scenarios declaratively and k6 handles the orchestration.

the six scenarios

i wrote six test scenarios. each one targets a different failure mode.

scenario 1: connection storm. 1000 WebSocket connections in 10 seconds. the "everyone joins the room at once" test. i don't care about messages here, just whether the server can handle the handshake flood.

import { check } from "k6"
import ws from "k6/ws"
 
export const options = {
  scenarios: {
    connection_storm: {
      executor: "ramping-vus",
      startVUs: 0,
      stages: [
        { duration: "10s", target: 1000 },
        { duration: "30s", target: 1000 },
        { duration: "10s", target: 0 },
      ],
    },
  },
}
 
export default function () {
  const res = ws.connect(
    "ws://localhost:8080/ws?room=loadtest",
    {},
    (socket) => {
      socket.on("open", () => {
        socket.setTimeout(() => socket.close(), 30000)
      })
    }
  )
  check(res, { "connected successfully": (r) => r && r.status === 101 })
}

scenario 2: active chat. 200 users sending messages continuously. every VU connects and sends a message every 500ms. this simulates a busy room where everyone is actually talking, not just lurking.

scenario 3: stock bot stress. 50 users all spamming /stock=AAPL and /stock=MSFT as fast as they can. the stock bot makes an external API call for each command, so this tests whether the bot goroutine can keep up or if it becomes a backpressure nightmare.

scenario 4: spike test. start at 100 users, suddenly jump to 500, hold for a minute, drop back down. this is the most realistic failure mode: your server is humming along fine and then something happens and traffic quintuples.

scenario 5: soak test. 200 users for 30 minutes straight. not a huge load, but sustained. this is where memory leaks, goroutine leaks, and file descriptor exhaustion show up.

scenario 6: mixed workload. a realistic distribution: 60% lurkers (connected but silent), 30% active chatters, 10% stock bot users. this is closest to what real traffic actually looks like.

what broke

the connection storm exposed the problem. around 500 concurrent connections, message latency started climbing. by 700, some connections saw 2-3 second delays on messages. by 900, connections timed out entirely.

i added pprof instrumentation and the bottleneck was obvious: the hub goroutine. Chattorum had a single hub that managed all connections across all rooms. every message, every join, every leave funneled through one goroutine's select loop.

// the old hub — single goroutine handling everything
func (h *Hub) run() {
    for {
        select {
        case client := <-h.register:
            h.clients[client] = true
        case client := <-h.unregister:
            delete(h.clients, client)
        case message := <-h.broadcast:
            for client := range h.clients {
                client.send <- message
            }
        }
    }
}

that broadcast case is the killer. when you have 1000 clients and you're iterating over all of them to send a message, you're blocking the register and unregister channels the entire time. new connections queue up behind message broadcasts. it's a classic single-writer bottleneck.

the fix: sharding by room

instead of one hub for the whole server, each chat room gets its own hub goroutine. a room with 50 users only iterates over 50 clients on broadcast, not 1000. rooms operate independently. a busy room can't block joins in a quiet room.

type RoomHub struct {
    room       string
    clients    map[*Client]bool
    register   chan *Client
    unregister chan *Client
    broadcast  chan []byte
}
 
// each room gets its own goroutine
func (s *Server) getOrCreateRoom(name string) *RoomHub {
    s.mu.Lock()
    defer s.mu.Unlock()
    if hub, ok := s.rooms[name]; ok {
        return hub
    }
    hub := newRoomHub(name)
    s.rooms[name] = hub
    go hub.run()
    return hub
}

after the sharding fix, i re-ran all six scenarios. the connection storm handled 1000 connections with sub-100ms message latency. the soak test ran for 30 minutes with flat memory usage. the stock bot stress test was still a little slow but that's the external API, not the hub.

the whole exercise took a day: half writing the k6 scripts, half fixing the hub. without the load tests i never would have found this. manual testing with 3 connections tells you nothing about what happens at 500.

Related Posts