Our SSE stream was perfect in local dev. On Railway it crashed the Node process twice on the first day. The dashboard sidebar indicator blinked red every few minutes, users saw gaps in the live template list, and one crash took the whole container down until the health check rebooted it. Three separate bugs were stacked on top of each other: Railway's undocumented 5-minute HTTP cap killing requests mid-frame, a known Next.js App Router ResponseAborted throw when a write lands after client abort, and a Node MaxListenersExceededWarning every time HMR reloaded the bus module. This post is the exact code PDF4.dev ships for all three.
Why SSE and not WebSockets
PDF4.dev's dashboard pushes seven event types: template.created, template.updated, template.deleted, three matching component events, and log.created when a PDF render completes. That list is one-way. The browser never needs to write back on the same channel, because every user action already goes through the REST API. Server-Sent Events is the protocol that matches that shape. It is plain HTTP with Content-Type: text/event-stream, which means no upgrade handshake, no sticky-session dance, and no special proxy configuration. Railway's edge proxy forwards it as-is once X-Accel-Buffering: no disables response buffering.
WebSockets would add a duplex channel we do not use. Polling every 3 seconds would cost 20 HTTP requests per minute per open tab and still feel laggy on log arrivals. Managed services like Pusher, Ably, Liveblocks, or Convex solve real-time generically, but they each charge per-connection or per-message and add a second vendor to the outage surface. The scaling path from in-process SSE is clear: swap the in-memory EventEmitter for a Redis pub/sub subscription (Upstash serverless Redis is the obvious fit on Railway), keep the same route handler, keep the same client hook. Boring is the goal.
Gotcha 1: Railway's 5-minute HTTP cap
Railway enforces a hard 5-minute timeout on every HTTP request that flows through its edge proxy, SSE included. The cap is not configurable, keepalive comments do not extend it, and the proxy cuts the request mid-frame when the clock hits 300 seconds. The canonical discussion is Railway Station thread 08117407, which documents the exact symptom we hit: AbortSignal fires from the proxy side, not from the client closing the tab.
The naive approach is to set a 15-second keepalive interval and hope the client's built-in EventSource retry covers the seam. It does not. When the proxy kills the request at 5 minutes, the browser's retry timer is whatever the last retry: field told it (default 3 seconds in Chromium, but implementation-defined), and the reconnect lands on an unpredictable schedule. Events published during the gap are lost because the in-process EventEmitter has no replay buffer. Users see the sidebar indicator flash red, sometimes for a fraction of a second, sometimes for several seconds if the retry coincides with a redeploy.
The fix has two halves. The server closes proactively at 4 minutes, giving itself a full minute of safety margin against Railway's ceiling. The client rotates its EventSource at 3 minutes 45 seconds so the client always opens the new connection before the server closes the old one. The server constant lives in app/api/v1/events/route.ts:
// Railway enforces a hard 5-minute timeout on every HTTP request
// (long polling and SSE alike). Keepalive comments do NOT extend it. We proactively
// close the stream at 4 minutes so the client's scheduled reconnect picks
// up cleanly before Railway kills the connection mid-frame.
const MAX_STREAM_DURATION_MS = 4 * 60 * 1000;
const maxDurationTimer = setTimeout(() => {
send(": reconnect\n\n");
cleanup();
}, MAX_STREAM_DURATION_MS);The client constant lives in hooks/useLiveEvents.ts:
// Railway enforces a 5-minute hard cap on every HTTP request. The server
// proactively closes the SSE stream at 4 minutes; we rotate the connection
// at 3m45s so the client always owns the reconnect window and never blocks
// on a surprise server-side drop mid-frame.
const SCHEDULED_ROTATION_MS = 3 * 60 * 1000 + 45 * 1000;The 15-second overlap between client rotation at 3:45 and server close at 4:00 is deliberate. The client opens a fresh EventSource first and only then closes the old one, so any event published during the handoff reaches the new subscriber before the old one unsubscribes. The server timer is a backstop for a lagging or frozen client: if the browser's setTimeout misses its slot (tab throttled in background, debugger paused, laptop resumed from sleep), the server still cuts the stream before Railway does, and the browser's built-in EventSource retry takes over. 4 minutes on the server is strictly safer than 3:45 on the client because the server timer runs in a dedicated Node process we control and is immune to browser throttling.
Gotcha 2: the Next.js App Router ResponseAborted crash
This is the one that crashed the Node process. Next.js App Router has a known issue where writing to a ReadableStream controller after the client aborts throws ResponseAborted synchronously, and if nothing catches it, the error escapes the route handler into the runtime where it is treated as an unhandled rejection. The tracking threads are vercel/next.js#56529 and vercel/next.js discussion #61972. Both are open at the time of writing.
The failure is a race. User closes a dashboard tab. The browser tears down the EventSource and aborts the HTTP request. The Next.js runtime fires request.signal.abort a few milliseconds later. In that tiny window, another request to PUT /api/v1/templates/:id finishes, calls liveBus.publish(organizationId, { type: "template.updated", ... }), and the bus synchronously invokes every subscriber on that channel. One of those subscribers is the (event) => send(JSON.stringify(event)) closure inside the aborted stream. The closure calls controller.enqueue, the controller rejects with ResponseAborted, and because the bus emit is synchronous, the throw propagates up into whichever request handler triggered the publish. That second request then crashes, and because there is no surrounding try/catch, Node's unhandledRejection handler kicks in and the process dies. We saw it twice in the first 90 minutes of staging traffic.
The fix has three parts and the order matters.
First, a closed flag local to the stream closure. Every write checks the flag and short-circuits if the stream is already torn down. Every try/catch around enqueue sets the flag on error, so the first failed write marks the stream dead and every subsequent write becomes a no-op. Second, a send helper wraps the only writable point in the stream so nothing in the route handler can call controller.enqueue directly. Third, the cleanup function unsubscribes from the bus before it closes the controller, which is the single most important line in this file:
const encoder = new TextEncoder();
const stream = new ReadableStream<Uint8Array>({
start(controller) {
let closed = false;
const send = (chunk: string) => {
if (closed) return;
try {
controller.enqueue(encoder.encode(chunk));
} catch {
// Client aborted or stream already torn down. Swallow and mark
// closed so subsequent writes short-circuit.
closed = true;
}
};
send(": connected\n\n");
const keepalive = setInterval(() => {
send(": keepalive\n\n");
}, 15_000);
const unsubscribe = liveBus.subscribe(organizationId, (event) => {
send(`data: ${JSON.stringify(event)}\n\n`);
});
const cleanup = () => {
if (closed) return;
closed = true;
// Order matters: unsubscribe FIRST so no late callbacks can reach
// controller.enqueue after close() runs.
unsubscribe();
clearInterval(keepalive);
clearInterval(sessionRecheck);
clearTimeout(maxDurationTimer);
try {
controller.close();
} catch {
// already closed
}
};
request.signal.addEventListener("abort", cleanup);
},
});The order is unsubscribe() first, then clearInterval, then controller.close() inside its own try/catch. Reverse that and a synchronous publish landing between controller.close() and unsubscribe() reaches a closed controller and throws again, defeating the whole guard. The closed = true assignment at the top of cleanup also makes the function idempotent: the abort listener and the 4-minute timer can both fire on the same disconnect, and the second call becomes a no-op. The try/catch inside send is the second belt: even if a publish somehow beats the unsubscribe (for example, a listener invoked synchronously during emit() before off() had a chance to run), the enqueue failure is caught and the stream marks itself closed instead of crashing.
Gotcha 3: EventEmitter leaks and the globalThis singleton
The third bug is the most embarrassing because it only shows up in dev. Every time Next.js HMR reloads lib/events/bus.ts, a fresh new LiveBus() instance is constructed and exported. Route modules that re-import after the reload see the new instance. Modules that have already closed over the old reference (a running SSE stream inside a still-live ReadableStream, for example) keep publishing and subscribing on the old bus, which is now orphaned. After 11 HMR reloads with at least one open dashboard tab, Node prints MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 listeners added. The tracking threads are vercel/next.js discussion #48427 and vercel/next.js discussion #68572.
Two independent fixes, because each one covers a different failure mode. First, pin the bus to globalThis so HMR cannot replace it. Second, set a concrete listener cap instead of disabling the check entirely. Both together, from lib/events/bus.ts:
class LiveBus {
private emitter = new EventEmitter();
constructor() {
// Keep Node's leak detector active by setting a high but concrete
// cap. 1000 comfortably covers dozens of orgs with multiple
// dashboard tabs each, and anything past that is a real leak.
this.emitter.setMaxListeners(1000);
}
// ...
}
// Survive HMR in dev and guarantee a single bus across all route modules.
const globalForBus = globalThis as typeof globalThis & {
__pdf4LiveBus?: LiveBus;
};
if (!globalForBus.__pdf4LiveBus) {
globalForBus.__pdf4LiveBus = new LiveBus();
}
export const liveBus: LiveBus = globalForBus.__pdf4LiveBus;The globalThis pin is the standard Next.js dev-time fix: any module that grabs liveBus gets the same instance regardless of how many times the module file is re-evaluated, because the instance lives on a key of the global object, not in module scope. setMaxListeners(1000) is the detail most snippets get wrong. The lazy fix is setMaxListeners(0), which disables the warning entirely, and it is also the fix that lets a real leak run unchecked until the process runs out of heap. 1000 is chosen to cover the worst realistic case (a single deployment handling ~100 concurrent organizations with up to 10 tabs each, which is already past the MAX_CONNECTIONS_PER_USER cap) with a comfortable margin, while still failing loudly on a cleanup regression. If the warning ever fires with this cap in place, it is a real bug.
Client-side details that matter
The client hook looks short but carries three subtleties that took longer to get right than the server. First, exponential backoff with jitter. After a shared outage (a Railway edge restart, a brief DNS blip, a bad deploy rolled back), every connected client will try to reconnect in the same millisecond unless the retry delay is randomized. A thundering herd on a single Node instance is a real problem at small numbers of connections, and the formula from useLiveEvents.ts costs nothing:
const base = Math.min(1000 * 2 ** retry, 30_000);
const jitter = base * 0.3 * (Math.random() * 2 - 1);
const delay = Math.max(250, Math.round(base + jitter));The spread is ±30% of the base delay, capped at 30 seconds, with a 250ms floor so an unlucky Math.random() cannot schedule an instant retry. The retry counter resets on onopen, which means a successful reconnect after a long outage puts the client back at the 1-second base delay for the next failure rather than carrying over the previous exponent.
Second, the onEvent callback is tracked via a ref (onEventRef.current = onEvent) so callers can pass inline closures. Without the ref, every parent re-render would produce a new function identity, the useEffect dependency array would fire, the EventSource would close and reopen, and the sidebar indicator would flap. Third, the hook returns { status } so the sidebar can render a green, yellow, or red dot without subscribing directly to the EventSource. That status is what users actually see when any of the three gotchas above fails, and keeping it visible is the only reason we caught the first production crash in under ten minutes.
Security hardening worth naming
The security audit flagged three guards as non-negotiable for a multi-tenant SSE endpoint.
- Sanitize error messages before publishing: the
log.createdevent includes a nullableerrorfield sourced from failed render logs. Raw stack traces contain container filesystem paths (/app/.next/server/...), source-location suffixes, and occasional environment leakage.lib/events/notify.tsstrips those before the event reaches the bus. - Re-validate the session every 30 seconds inside the stream: the route's
SESSION_RECHECK_INTERVAL_MS = 30 * 1000timer callsgetSessionUser(request)again and tears down the stream if the user signed out, the session was revoked, or the user's organization changed. Without this, a signed-out user would keep receiving events for up to 4 minutes until the scheduled rotation. - Per-user connection cap of 10:
MAX_CONNECTIONS_PER_USER = 10with an in-memory counter keyed onuserId. Over the limit, the route returns429withRetry-After: 30. Browsers already cap at ~6 streams per origin, but a scripted client holding a valid session could otherwise open thousands of streams and pin listeners on the bus.
What we would do differently at scale
Single-instance Railway holds up fine for PDF4.dev at current volume, because every connected dashboard tab is subscribed to the same in-process EventEmitter and publishes are synchronous memory operations. The moment we run more than one container behind a load balancer, a publish on one box reaches zero subscribers on the other. The planned migration is Upstash Redis pub/sub: replace LiveBus.publish with redis.publish(organizationId, JSON.stringify(event)) and LiveBus.subscribe with a per-connection redis.subscribe, keep the same route handler, keep the exact same client hook. No dashboard code changes, no protocol change, no client rewrite. Greenfield projects whose real-time needs start at "collaborative editing with presence cursors" should probably skip all of this and start on a sync engine like Convex or Zero/Rocicorp, which ship CRDT-based state and handle offline replay out of the box. For a one-way dashboard push, SSE and a hardened route handler are still the right answer.
Start generating PDFs
Build PDF templates with a visual editor. Render them via API from any language in ~300ms.



