why load test a chat server

so Chattorum is this chat app I built — Go backend, WebSocket connections, a stock bot that fetches prices when you type /stock=AAPL. it works great when it's just me and a few friends poking at it. but I had no idea what would happen under real load. turns out neither did the server.

I picked k6 for this because it has native WebSocket support and the scripting is just JavaScript. you define scenarios declaratively and k6 handles the orchestration. no need to spin up a thousand terminal tabs.

the six scenarios

I ended up writing six distinct test scenarios. each one targets a different failure mode.

scenario 1: connection storm. 1000 WebSocket connections in 10 seconds. this is the "everyone joins the room at once" test — think a popular streamer raids your chat. I don't care about messages here, just whether the server can handle the handshake flood.

import { check } from "k6"
import ws from "k6/ws"
 
export const options = {
  scenarios: {
    connection_storm: {
      executor: "ramping-vus",
      startVUs: 0,
      stages: [
        { duration: "10s", target: 1000 },
        { duration: "30s", target: 1000 },
        { duration: "10s", target: 0 },
      ],
    },
  },
}
 
export default function () {
  const res = ws.connect(
    "ws://localhost:8080/ws?room=loadtest",
    {},
    (socket) => {
      socket.on("open", () => {
        socket.setTimeout(() => socket.close(), 30000)
      })
    }
  )
  check(res, { "connected successfully": (r) => r && r.status === 101 })
}

scenario 2: active chat. 200 users sending messages continuously. every VU connects and sends a message every 500ms. this simulates a busy room where everyone is actually talking, not just lurking.

scenario 3: stock bot stress. 50 users all spamming /stock=AAPL and /stock=MSFT as fast as they can. the stock bot makes an external API call for each command, so this tests whether the bot goroutine can keep up or if it becomes a backpressure nightmare.

scenario 4: spike test. start at 100 users, suddenly jump to 500, hold for a minute, drop back down. this is the most realistic failure mode — your server is humming along fine and then something happens and traffic quintuples.

scenario 5: soak test. 200 users for 30 minutes straight. not a huge load, but sustained. this is where you find memory leaks, goroutine leaks, file descriptor exhaustion — all the stuff that only shows up over time.

scenario 6: mixed workload. a realistic distribution — 60% lurkers (connected but silent), 30% active chatters, 10% stock bot users. this is closest to what real traffic actually looks like.

what broke

the connection storm test was the first thing to expose the problem. around 500 concurrent connections, message latency started climbing. by 700 connections, some users were seeing 2-3 second delays on messages. by 900, connections were timing out entirely.

I added some pprof instrumentation and the bottleneck was obvious: the hub goroutine. Chattorum had a single hub that managed all connections across all rooms. every message, every join, every leave — all funneled through one goroutine's select loop.

// the old hub — single goroutine handling everything
func (h *Hub) run() {
    for {
        select {
        case client := <-h.register:
            h.clients[client] = true
        case client := <-h.unregister:
            delete(h.clients, client)
        case message := <-h.broadcast:
            for client := range h.clients {
                client.send <- message
            }
        }
    }
}

that broadcast case is the killer. when you have 1000 clients and you're iterating over all of them to send a message, you're blocking the register and unregister channels the entire time. new connections queue up behind message broadcasts. it's a classic single-writer bottleneck.

the fix: sharding by room

the fix was straightforward once I saw the problem. instead of one hub for the whole server, each chat room gets its own hub goroutine. a room with 50 users only iterates over 50 clients on broadcast, not 1000. and rooms operate independently — a busy room can't block joins in a quiet room.

type RoomHub struct {
    room       string
    clients    map[*Client]bool
    register   chan *Client
    unregister chan *Client
    broadcast  chan []byte
}
 
// each room gets its own goroutine
func (s *Server) getOrCreateRoom(name string) *RoomHub {
    s.mu.Lock()
    defer s.mu.Unlock()
    if hub, ok := s.rooms[name]; ok {
        return hub
    }
    hub := newRoomHub(name)
    s.rooms[name] = hub
    go hub.run()
    return hub
}

after the sharding fix, I re-ran all six scenarios. the connection storm handled 1000 connections with sub-100ms message latency. the soak test ran for 30 minutes with flat memory usage. the stock bot stress test was still a little slow but that's the external API, not the hub.

the whole exercise took maybe a day — half writing the k6 scripts, half fixing the hub. but without the load tests I never would have found this. manual testing with 3 connections tells you nothing about what happens at 500.