Short intro, then most of our time on the project.
Background and the systems work I do.
How it finds wasted GPU time, validates fixes, and banks the savings.
Where AI infra is going — and why I want to build it.
Ph.D. in applied math. I've worked on algorithms, ML, data infra, databases, platforms, and infrastructure — startups to big companies. Now a production engineer at Meta on ML-training efficiency and reliability.
I bridge researchers and engineers — turning research needs into reliable systems, and system limits into feedback.
GPU hours aren't just cost — they're often the limit on how fast ML teams train, evaluate, and iterate.
Compute is a planning bottleneck. Saving GPU-hours = creating capacity.
Shorter jobs = faster debugging, optimization, and iteration.
In H1 I used Claude Code to automate the optimization work — now working on Amdahl Loop to remove me from the loop.
Performance has been my throughline: algorithms, ETL, databases, low-latency systems, ML platforms, GPU jobs.
With Claude Code and Codex: more hypotheses in flight, faster code understanding, more rigorous A/B.
Needs codebase + business logic + infra + latency expertise in one head. Even then, win rates are low.
Correlation ≠ causation. Data shift and dependency noise bury the signal — you need controlled A/B.
Hours to days per experiment, and single-threaded. One expert, one hypothesis at a time.
Using 4 GPUs for illustration. Only the green intervals are GPU-productive — but you pay for all GPU hours.
Steady-state compute is ~half the timeline; an infinitely fast kernel caps near 2× end-to-end. The serial phases — queue, warmup, input, checkpoint — pay more.
SM utilization is usually low. Real wins come from the mix of GPU, CPU, network, I/O, locality, and algorithm. A fast kernel can't help a data-starved GPU.
Kernel tuning is a specialist arms race. My leverage is infra + systems — scheduling, pipelines, storage, orchestration. Exactly where the waste lives.
Make performance work stateful and repeatable — every run finds a validated win or records why one failed.
Ranks the fleet by wasted time, reads code, classifies the bottleneck — intuition becomes a repeatable phase.
Baseline vs variant on the same job, then N refute checks — a win survives only if the majority agrees.
Submits return a worker_run_id; the next cycle reconciles it. The ledger lets every cycle resume cold.
Verdicts are remembered, so nothing re-runs; prior wins inject fresh ideas — many experiments run in parallel, not one per expert.
Eval jobs now run twice as fast on half the GPUs — a net 75% cut in GPU hours.
A personal prediction: most of the world's intelligence will be AI-generated — and cost will decide who gets to use it.
I'm a builder and an optimizer — I keep coming back to performance because I love making systems faster.