Dev Kumar — C++ Low-Latency Engineer

01ETHOS

Why microseconds matter — and why I chose them.

Low-latency isn't a niche. It's a way of thinking — where every cache miss, every syscall, every branch mispredict is a design decision you have to own.

I didn't stumble into low-latency engineering. I'm drawn to it because it's the one place where software, hardware and mathematics fuse into a single discipline.

"You can't optimise what you don't measure, and you can't measure what you don't understand. HFT is the sharpest version of that loop."

— my working principle

Every line of C++ I write is a negotiation with a CPU cache, a TLB, a branch predictor, and a NIC. Kernel syscalls are not free. Virtual calls are not free. Allocation on the hot path is definitely not free. I build as if each instruction has a receipt.

There's nothing like watching a flamegraph hotspot disappear after an hour with perf, or seeing p99 collapse 40% after a ring-buffer realignment. Trading is the ultimate proving ground — your correctness and your speed both translate directly into basis points.

I want to build at that edge: market-data handlers that unpack CME MDP 3.0 at line rate, strategy engines whose tail latency doesn't drift, FPGA parsers that shave a handful of microseconds off a critical path — and C++ code that my future self is proud to debug at 3 a.m. during a live incident.

— Dev

02ARCHITECTURE

The hot path — wire to strategy to wire.

Every microsecond on the critical path has a home. Below is the data-flow I obsess over: kernel-bypass ingress, lock-free parsing, strategy hand-off, and FPGA-offloaded egress.

typical ingress path I've worked on egress path with FPGA offload values reflect representative budgets I've delivered against in production.

03LATENCY BUDGET

Where the microseconds actually go.

Two views of the same system. Left: traditional kernel stack. Right: the kernel-bypass re-architecture I deployed in production. Same hardware, fundamentally different budget.

Before · kernel TCP/UDP stack

syscall · sockets

~4.8 μs

kernel context switch

~3.1 μs

interrupt / softirq

~1.8 μs

copy to user buffer

~2.0 μs

application parse

~0.3 μs

total wire → application: ≈ 12 μs · dominated by kernel overhead and copies. Tail latency spikes whenever the scheduler or IRQ steals a core.

After · kernel-bypass + FPGA

NIC → userspace (ef_vi)

~320 ns

FPGA decode · DMA

~420 ns

lock-free ring handoff

~80 ns

book builder (C++20)

~1.20 μs

strategy signal gen

~380 ns

total wire → application: ≈ 2.4 μs · ∼ 5× reduction. Tail latency bounded by busy-poll and core isolation — no scheduler, no softirq, no copies.

04STACK

The toolbox I reach for instinctively.

Seven years of building at the intersection of modern C++, Linux internals, networking hardware and FPGA fabric.

Modern C++

C++11 / 14 / 17 / 20 — concepts, ranges, coroutines
Templates, SFINAE, metaprogramming
RAII, move semantics, smart pointers
Lock-free data structures, atomics, memory order
Cache-line alignment, false-sharing avoidance
STL, Boost, Google Benchmark

Linux & Systems

Kernel internals, scheduling, NUMA
CPU pinning, isolcpus, nohz_full
epoll / kqueue, io_uring
IPC, shared memory, signals
perf, ftrace, eBPF, flamegraphs
gdb, valgrind, address/thread sanitizers

Concurrency

std::thread, pthreads, thread pools
Atomics, memory ordering, happens-before
SPSC / MPMC lock-free queues
Reader–writer locks, seqlocks
Futures / promises, coroutines
OpenMP for compute-heavy kernels

Low-Latency Networking

TCP/IP, UDP multicast, raw sockets
DPDK, Solarflare ef_vi, Onload
Zero-copy I/O, busy-poll, SO_BUSY_POLL
Binary protocols, SBE, FIX, CME MDP 3.0
PCIe, DMA, NIC timestamping (PTP/hw)
Wireshark, tcpdump, bpftrace

FPGA / Hardware

PCIe DMA, AXI-Stream, AXI-Lite
Host ↔ fabric C++ interface design
CRC / checksum / filter offload
Protocol decode in RTL-adjacent C++
Hardware timestamping, PPS sync
Vivado / Verilator familiarity

Trading & Markets

CME MDP 3.0 multicast feeds
Order-book construction & maintenance
Exchange connectivity patterns
Pre-trade risk, throttles, kill switches
Futures / options pricing intuition
Volatility modelling, hedging logic

Build & Tooling

CMake, Bazel, Conan
gcc, clang, LTO, PGO
Git, GitHub Actions, Jenkins
Docker, systemd, Ansible
TDD, Catch2 / GTest
Continuous benchmarking pipelines

Systems Design

Event-driven architectures
Multi-tiered trading-grade apps
Deterministic latency guarantees
Failover, redundancy, hot-standby
Design patterns for performance
Observability without overhead

05EXPERIENCE

Seven years of shipping latency-sensitive software.

Each role deepened the same obsession — from kernel modules on factory floors to kernel-bypass stacks powering AI services.

MAR 2025 — PRESENT

C++ Software Engineer

Sprouts AI Chicago, IL

Architected a multi-threaded C++20 backend on Linux — concepts, ranges and coroutines alongside lock-free queues, std::atomic primitives and custom thread pools — sustaining sub-millisecond p99 across 500K+ daily requests. Re-engineered hot paths with move semantics, cache-line alignment and custom allocators. Deployed DPDK user-space stack with FPGA-offloaded packet parsing over PCIe DMA, cutting round-trip latency from 12 μs to 2.4 μs.

2.4 μs p99 wire-to-app, from 12 μs

2.5× throughput uplift on hot path

−40% tail latency via flamegraph-driven tuning

C++20 DPDK FPGA lock-free coroutines perf valgrind

AUG 2023 — MAR 2025

Software Engineer — C++ Systems

Resilience Inc Chicago, IL

Built high-performance C++ backend components for high-traffic endpoints on Linux, sustaining 10K+ concurrent connections with deterministic response times. Modernised the legacy codebase to C++20 — concepts, ranges, coroutines — and bound sockets to Solarflare ef_vi for user-space packet processing, cutting wire-to-application latency by 70%.

−70% wire-to-app latency (ef_vi)

−90% memory leaks after modernisation

10K+ concurrent connections / node

Solarflare Onload ef_vi C++20 Google Benchmark epoll microservices

APR 2019 — AUG 2023

Software Engineer — C++ / Systems Programming

Tata Steel Remote

Wrote low-level C++ device drivers and kernel-space modules for real-time industrial monitoring, achieving deterministic 5 ms response times across 50+ concurrent processes. Offloaded CRC validation and filtering to FPGA over PCIe/DMA for 8× throughput. Architected scalable multi-tiered networking with TCP/UDP sockets, epoll event loops and custom binary protocols — 99.99% uptime across distributed systems. Also built financial-analytics systems processing CME feeds with futures/options pricing and hedging logic.

8× throughput gain via FPGA offload

5 ms deterministic response, 50+ procs

99.99% uptime on mission-critical infra

FPGA PCIe / DMA kernel modules epoll binary protocols CME feeds

06THE LAB

Projects that prove the point.

Where I reach beyond my day job — building HFT-grade infrastructure in the open, with measurable performance I can defend line by line.

PROJECT · 001 actively maintained

CME MDP 3.0
Multicast Feed Handler

A production-grade CME MDP 3.0 market-data feed handler in C++17 on Linux. Decodes SBE-encoded messages from UDP multicast groups, rebuilds the order book and hands ticks off via a lock-free SPSC ring. Benchmarked at 500K+ packets/sec with sub-microsecond parse latency and zero-copy hot path.

C++17 UDP Multicast SBE lock-free ring zero-copy Linux

source ↗ technical notes ↗

PROJECT · 002 actively maintained

TradeSentinel
Surveillance & Anomaly Detection

Real-time trade surveillance platform: low-latency C++ market-data ingestion, multi-threaded order-flow analysis, and anomaly detection for aberrant price action. Designed as a trading-grade monitoring layer — the kind of tooling a quant desk would actually run alongside their strategies.

C++ Real-time multi-threading order flow alerts

source ↗ live demo ↗

07RESEARCH & EDU

Peer-reviewed, published, and always learning.

IEEE · INCOFT 2021

Comparative study of movie recommendation systems using feature engineering and an improved error function

Published IEEE paper exploring feature-engineering techniques and loss-function design for large-scale recommender systems. DOI: 10.1109/INCOFT55651.2022.10094480.

IEEE paper ↗ code ↗ certificate ↗

EDUCATION

Illinois Institute of Technology

M.S. in Computer Science

National Institute of Technology, Calicut

B.Tech in Computer Science

Continuous self-study

C++ core standard committee papers · HFT engineering talks · Agner Fog micro-architecture guides

08CONTACT · OPEN FOR OPPORTUNITIES

Let's talk about
latency.

Hiring for a C++ engineer to sit on the hot path? I'm actively interviewing and would love to hear what you're building.

devkumar.dklv@gmail.com

Email ↗ +1 (312) 532-8223 GitHub ↗ LeetCode ↗