C++ · Linux · Kernel‑Bypass · FPGA · Available for HFT roles

I build systems that
win in the last microsecond.

I'm Dev Kumar — a C++ software engineer with 7+ years architecting high-performance, low-latency systems on Linux. I obsess over cache lines, lock-free queues, packet paths and flamegraphs. My work lives where software, hardware and markets collide — DPDK, Solarflare Onload, FPGA offload, modern C++20.

locChicago, IL
stackC++20 · Linux · DPDK · ef_vi · FPGA · CME MDP 3.0
tzCST · UTC−06:00
p99 latency target
< 2.4μs
wire → application
feed throughput
500k/s
CME multicast, per handler
experience
7yrs
C++ · Linux · Networking
uptime delivered
99.99%
mission-critical infra
01ETHOS

Why microseconds matter — and why I chose them.

Low-latency isn't a niche. It's a way of thinking — where every cache miss, every syscall, every branch mispredict is a design decision you have to own.

I didn't stumble into low-latency engineering. I'm drawn to it because it's the one place where software, hardware and mathematics fuse into a single discipline.

"You can't optimise what you don't measure, and you can't measure what you don't understand. HFT is the sharpest version of that loop."
— my working principle

Every line of C++ I write is a negotiation with a CPU cache, a TLB, a branch predictor, and a NIC. Kernel syscalls are not free. Virtual calls are not free. Allocation on the hot path is definitely not free. I build as if each instruction has a receipt.

There's nothing like watching a flamegraph hotspot disappear after an hour with perf, or seeing p99 collapse 40% after a ring-buffer realignment. Trading is the ultimate proving ground — your correctness and your speed both translate directly into basis points.

I want to build at that edge: market-data handlers that unpack CME MDP 3.0 at line rate, strategy engines whose tail latency doesn't drift, FPGA parsers that shave a handful of microseconds off a critical path — and C++ code that my future self is proud to debug at 3 a.m. during a live incident.

— Dev
02ARCHITECTURE

The hot path — wire to strategy to wire.

Every microsecond on the critical path has a home. Below is the data-flow I obsess over: kernel-bypass ingress, lock-free parsing, strategy hand-off, and FPGA-offloaded egress.

INGRESS · MARKET DATA EGRESS · ORDER ENTRY EXCHANGE CME · NYSE · ICE UDP multicast fiber SOLARFLARE NIC ef_vi · DPDK kernel bypass FPGA PARSER PCIe · DMA MDP 3.0 decode LOCK-FREE RING SPSC · cache-aligned zero-copy handoff BOOK BUILDER order-book state C++20 · ranges STRATEGY signals · risk coroutines ORDER GENERATOR pre-trade risk FIX · binary FPGA TX PATH AXI-Stream wire format NIC (TX) user-space stack busy-poll EXCHANGE matching engine ~1 ms fiber ~200 ns bypass ~400 ns FPGA ~80 ns ring ~1.2 μs book ~800 ns strat hot path (cache-aligned, lock-free) hardware offload (FPGA / NIC) software component ingress flow egress flow
typical ingress path I've worked on egress path with FPGA offload values reflect representative budgets I've delivered against in production.
03LATENCY BUDGET

Where the microseconds actually go.

Two views of the same system. Left: traditional kernel stack. Right: the kernel-bypass re-architecture I deployed in production. Same hardware, fundamentally different budget.

Before  ·  kernel TCP/UDP stack

syscall · sockets
~4.8 μs
kernel context switch
~3.1 μs
interrupt / softirq
~1.8 μs
copy to user buffer
~2.0 μs
application parse
~0.3 μs
total wire → application: ≈ 12 μs  ·  dominated by kernel overhead and copies. Tail latency spikes whenever the scheduler or IRQ steals a core.

After  ·  kernel-bypass + FPGA

NIC → userspace (ef_vi)
~320 ns
FPGA decode · DMA
~420 ns
lock-free ring handoff
~80 ns
book builder (C++20)
~1.20 μs
strategy signal gen
~380 ns
total wire → application: ≈ 2.4 μs  ·   ∼ 5× reduction. Tail latency bounded by busy-poll and core isolation — no scheduler, no softirq, no copies.
04STACK

The toolbox I reach for instinctively.

Seven years of building at the intersection of modern C++, Linux internals, networking hardware and FPGA fabric.

Modern C++

  • C++11 / 14 / 17 / 20 — concepts, ranges, coroutines
  • Templates, SFINAE, metaprogramming
  • RAII, move semantics, smart pointers
  • Lock-free data structures, atomics, memory order
  • Cache-line alignment, false-sharing avoidance
  • STL, Boost, Google Benchmark

Linux & Systems

  • Kernel internals, scheduling, NUMA
  • CPU pinning, isolcpus, nohz_full
  • epoll / kqueue, io_uring
  • IPC, shared memory, signals
  • perf, ftrace, eBPF, flamegraphs
  • gdb, valgrind, address/thread sanitizers

Concurrency

  • std::thread, pthreads, thread pools
  • Atomics, memory ordering, happens-before
  • SPSC / MPMC lock-free queues
  • Reader–writer locks, seqlocks
  • Futures / promises, coroutines
  • OpenMP for compute-heavy kernels

Low-Latency Networking

  • TCP/IP, UDP multicast, raw sockets
  • DPDK, Solarflare ef_vi, Onload
  • Zero-copy I/O, busy-poll, SO_BUSY_POLL
  • Binary protocols, SBE, FIX, CME MDP 3.0
  • PCIe, DMA, NIC timestamping (PTP/hw)
  • Wireshark, tcpdump, bpftrace

FPGA / Hardware

  • PCIe DMA, AXI-Stream, AXI-Lite
  • Host ↔ fabric C++ interface design
  • CRC / checksum / filter offload
  • Protocol decode in RTL-adjacent C++
  • Hardware timestamping, PPS sync
  • Vivado / Verilator familiarity

Trading & Markets

  • CME MDP 3.0 multicast feeds
  • Order-book construction & maintenance
  • Exchange connectivity patterns
  • Pre-trade risk, throttles, kill switches
  • Futures / options pricing intuition
  • Volatility modelling, hedging logic

Build & Tooling

  • CMake, Bazel, Conan
  • gcc, clang, LTO, PGO
  • Git, GitHub Actions, Jenkins
  • Docker, systemd, Ansible
  • TDD, Catch2 / GTest
  • Continuous benchmarking pipelines

Systems Design

  • Event-driven architectures
  • Multi-tiered trading-grade apps
  • Deterministic latency guarantees
  • Failover, redundancy, hot-standby
  • Design patterns for performance
  • Observability without overhead
05EXPERIENCE

Seven years of shipping latency-sensitive software.

Each role deepened the same obsession — from kernel modules on factory floors to kernel-bypass stacks powering AI services.

MAR 2025 — PRESENT

C++ Software Engineer

Sprouts AI Chicago, IL

Architected a multi-threaded C++20 backend on Linux — concepts, ranges and coroutines alongside lock-free queues, std::atomic primitives and custom thread pools — sustaining sub-millisecond p99 across 500K+ daily requests. Re-engineered hot paths with move semantics, cache-line alignment and custom allocators. Deployed DPDK user-space stack with FPGA-offloaded packet parsing over PCIe DMA, cutting round-trip latency from 12 μs to 2.4 μs.

2.4 μs p99 wire-to-app, from 12 μs
2.5× throughput uplift on hot path
−40% tail latency via flamegraph-driven tuning
C++20 DPDK FPGA lock-free coroutines perf valgrind
AUG 2023 — MAR 2025

Software Engineer — C++ Systems

Resilience Inc Chicago, IL

Built high-performance C++ backend components for high-traffic endpoints on Linux, sustaining 10K+ concurrent connections with deterministic response times. Modernised the legacy codebase to C++20 — concepts, ranges, coroutines — and bound sockets to Solarflare ef_vi for user-space packet processing, cutting wire-to-application latency by 70%.

−70% wire-to-app latency (ef_vi)
−90% memory leaks after modernisation
10K+ concurrent connections / node
Solarflare Onload ef_vi C++20 Google Benchmark epoll microservices
APR 2019 — AUG 2023

Software Engineer — C++ / Systems Programming

Tata Steel Remote

Wrote low-level C++ device drivers and kernel-space modules for real-time industrial monitoring, achieving deterministic 5 ms response times across 50+ concurrent processes. Offloaded CRC validation and filtering to FPGA over PCIe/DMA for 8× throughput. Architected scalable multi-tiered networking with TCP/UDP sockets, epoll event loops and custom binary protocols — 99.99% uptime across distributed systems. Also built financial-analytics systems processing CME feeds with futures/options pricing and hedging logic.

throughput gain via FPGA offload
5 ms deterministic response, 50+ procs
99.99% uptime on mission-critical infra
FPGA PCIe / DMA kernel modules epoll binary protocols CME feeds
06THE LAB

Projects that prove the point.

Where I reach beyond my day job — building HFT-grade infrastructure in the open, with measurable performance I can defend line by line.

PROJECT · 001 actively maintained

CME MDP 3.0
Multicast Feed Handler

A production-grade CME MDP 3.0 market-data feed handler in C++17 on Linux. Decodes SBE-encoded messages from UDP multicast groups, rebuilds the order book and hands ticks off via a lock-free SPSC ring. Benchmarked at 500K+ packets/sec with sub-microsecond parse latency and zero-copy hot path.

UDP multicast · 500k pps 60 s window
C++17 UDP Multicast SBE lock-free ring zero-copy Linux
PROJECT · 002 actively maintained

TradeSentinel
Surveillance & Anomaly Detection

Real-time trade surveillance platform: low-latency C++ market-data ingestion, multi-threaded order-flow analysis, and anomaly detection for aberrant price action. Designed as a trading-grade monitoring layer — the kind of tooling a quant desk would actually run alongside their strategies.

anomaly · σ > 6 t price · mid
C++ Real-time multi-threading order flow alerts
07RESEARCH & EDU

Peer-reviewed, published, and always learning.

IEEE · INCOFT 2021

Comparative study of movie recommendation systems using feature engineering and an improved error function

Published IEEE paper exploring feature-engineering techniques and loss-function design for large-scale recommender systems. DOI: 10.1109/INCOFT55651.2022.10094480.

EDUCATION
Illinois Institute of Technology
M.S. in Computer Science
National Institute of Technology, Calicut
B.Tech in Computer Science
Continuous self-study
C++ core standard committee papers · HFT engineering talks · Agner Fog micro-architecture guides
08CONTACT · OPEN FOR OPPORTUNITIES

Let's talk about
latency.

Hiring for a C++ engineer to sit on the hot path? I'm actively interviewing and would love to hear what you're building.