Insights
Deep dives into distributed systems, AI integration patterns, and backend optimization. Documenting the engineering journey.
SYSTEM_DESIGN 2024.10.28
Scaling WebSockets for Real-Time AI Inference Streaming
Architecting a resilient, low-latency WebSocket cluster using Go and Redis to handle thousands of concurrent generation streams without dropping frames.
Read post ML_OPS 2024.09.15
Deploying Llama 3 on Edge: A Kubernetes Approach
Strategies for minimizing memory footprint and squeezing inference speed when deploying large language models on constrained edge clusters.
Read post DISTRIBUTED_SYS 2024.08.02
The Subtle Art of Distributed Cache Invalidation
Why simply deleting keys isn't enough when you have multi-region replicas. A dive into timestamp-based tombstones and conflict-free convergence.
> read_snippet PERFORMANCE 2024.06.19
Dynamic Batching: Getting 3x More Out of Your GPUs
How adaptive request batching and continuous batching reshaped our inference economics — and the latency trade-offs you need to watch.
Read post