Insights

Deep dives into distributed systems, AI integration patterns, and backend optimization. Documenting the engineering journey.

Featured

SYSTEM_DESIGN 2024.10.28

Scaling WebSockets for Real-Time AI Inference Streaming

Architecting a resilient, low-latency WebSocket cluster using Go and Redis to handle thousands of concurrent generation streams without dropping frames.

Read post

ML_OPS 2024.09.15

Deploying Llama 3 on Edge: A Kubernetes Approach

Strategies for minimizing memory footprint and squeezing inference speed when deploying large language models on constrained edge clusters.

Read post

cache_invalidation.rs

DISTRIBUTED_SYS 2024.08.02

The Subtle Art of Distributed Cache Invalidation

Why simply deleting keys isn't enough when you have multi-region replicas. A dive into timestamp-based tombstones and conflict-free convergence.

> read_snippet

PERFORMANCE 2024.06.19

Dynamic Batching: Getting 3x More Out of Your GPUs

How adaptive request batching and continuous batching reshaped our inference economics — and the latency trade-offs you need to watch.

Read post