AGENDA

Presentation flow

Short intro, then most of our time on the project.

01

Self introduction

Background and the systems work I do.

02

Amdahl Loop project deep dive

How it finds wasted GPU time, validates fixes, and banks the savings.

03

Why RadixArk

Where AI infra is going — and why I want to build it.

BACKGROUND

Self introduction

Background

Ph.D. in applied math. I've worked on algorithms, ML, data infra, databases, platforms, and infrastructure — startups to big companies. Now a production engineer at Meta on ML-training efficiency and reliability.

I bridge researchers and engineers — turning research needs into reliable systems, and system limits into feedback.

Engineering philosophy

  • Faster feedback loop: shorten learning, debugging, and optimization cycles.
  • Verification: no change without verification.
  • Rigor: correlation is not causation.
PROJECT CONTEXT

Performance optimization for HPC jobs

GPU hours aren't just cost — they're often the limit on how fast ML teams train, evaluate, and iterate.

01

GPU-hours are scarce

Compute is a planning bottleneck. Saving GPU-hours = creating capacity.

02

Faster MLE loops

Shorter jobs = faster debugging, optimization, and iteration.

Every optimization pays twice: lower cost today, and more experiments — faster iteration tomorrow.
PROJECT DEEP DIVE

Amdahl Loop: AI-native performance optimization

In H1 I used Claude Code to automate the optimization work — now working on Amdahl Loop to remove me from the loop.

01

An old problem

Performance has been my throughline: algorithms, ETL, databases, low-latency systems, ML platforms, GPU jobs.

02

Coding agents raised the ceiling

With Claude Code and Codex: more hypotheses in flight, faster code understanding, more rigorous A/B.

$30M saved out of $40M total on offline eval  ·  50% fewer GPU hours at 2× throughput  ·  additional $40M savings identified
MOTIVATION

Why traditional optimization work is hard

01

Heroic guesswork

Needs codebase + business logic + infra + latency expertise in one head. Even then, win rates are low.

02

Verification is hard

Correlation ≠ causation. Data shift and dependency noise bury the signal — you need controlled A/B.

03

Hostile to human attention

Hours to days per experiment, and single-threaded. One expert, one hypothesis at a time.

The bottleneck is expert attention: low win rate × single-threaded × hours per try = a trickle of wins.
ANATOMY OF AN HPC JOB

Anatomy of a job: where accelerator time goes

Using 4 GPUs for illustration. Only the green intervals are GPU-productive — but you pay for all GPU hours.

time →
idle — GPU allocated, doing nothing
low utilization — input-bound / partial occupancy
high utilization — productive compute
STRATEGY

Why I do not chase pure GPU FLOPs

Even infinitely fast compute still pays the fixed serial phases — so end-to-end speedup caps near .
serial floor — always paid
today
serial
compute
1.0×
2× faster compute
1.33×
∞ fast compute
2.0× ceiling
01

Amdahl's law

Steady-state compute is ~half the timeline; an infinitely fast kernel caps near end-to-end. The serial phases — queue, warmup, input, checkpoint — pay more.

02

Optimization is system balance

SM utilization is usually low. Real wins come from the mix of GPU, CPU, network, I/O, locality, and algorithm. A fast kernel can't help a data-starved GPU.

03

Play to my edge

Kernel tuning is a specialist arms race. My leverage is infra + systems — scheduling, pipelines, storage, orchestration. Exactly where the waste lives.

Optimize the whole job timeline, not just FLOPs — that's where the accelerator-minutes hide.
WORKFLOW MECHANICS

How Amdahl Loop works

Make performance work stateful and repeatable — every run finds a validated win or records why one failed.

propose
scan
rank fleet jobs by idle-accelerator waste; pick one
profile
BPF, PyTorch Profiler, NCU: correlate dynamic info with static code analysis
hypothesize
classify the bottleneck → {needs_build, change}
submit
submit A/B experiments with the change
verify
collect
metrics for finished experiments
compare
paired A/B: baseline vs variant
refute
N adversarial checks; majority must agree
promote
human gate: reviews surviving wins; verdict is always recorded
dream
search
periodically search optimization wins inside and outside the company
inject
seed fresh hypotheses into scan — new ideas keep entering Amdahl Loop
persistent memory across every step — refuted hypotheses are never retried; bottleneck analysis refreshes each cycle
MECHANISMS

Every pain point gets a mechanism

01Heroic guesswork

The agent does scan + hypothesize

Ranks the fleet by wasted time, reads code, classifies the bottleneck — intuition becomes a repeatable phase.

02Correlation ≠ causation

Paired A/B + adversarial refute

Baseline vs variant on the same job, then N refute checks — a win survives only if the majority agrees.

03Hours-to-days loop

Fully async, cron-paced

Submits return a worker_run_id; the next cycle reconciles it. The ledger lets every cycle resume cold.

04Low, single-threaded win rate

Persistent memory + hypothesis injection

Verdicts are remembered, so nothing re-runs; prior wins inject fresh ideas — many experiments run in parallel, not one per expert.

Every verdict is recorded — a refuted experiment is knowledge, not waste.
RESULT

Cheaper and faster

Eval jobs now run twice as fast on half the GPUs — a net 75% cut in GPU hours.

ROI anecdote: ~$500K in tokens and experiments → ~$30M in savings.
Scope note: H1 changes were eval-side only; savings could be larger with training-side changes such as model export.
$30M
captured from offline optimization
50%
fewer GPU hours on targeted workloads
throughput, same footprint
$40M
more savings identified

What changed technically

  • Reduced data starvation, host contention, communication, and lifecycle overhead.
  • Cut CPU serial work that left GPUs idle.
  • Cut frequent thread wakeups slowing MKL GEMM.
  • Fewer sync stalls (metric logging, range checks).
LEARNINGS

Learnings

AI Agent learning

  • Agentic trend: prompts → instructing agents → writing loops.
  • Second brain: persistent memory matters.
  • Verdict ledger: hypothesis, evidence, result, decision.
  • AI-native workflow: replacing traditional orchestration (e.g. Opus 4.8 workflows).
  • New runtime: Claude Code and Codex.

System learning

  • Verification throughput: tests become the bottleneck. P=NP, practically: if you can verify a solution, you can find the solution.
  • Profiling first: find opportunities, trace code paths, verify results.
  • Chase the biggest bottleneck: borrowed techniques often don't move the needle.
  • Cluster noise: queue time, noisy neighbors, transient failures hide the effect.
  • Input drift: a variant can look faster just because the workload shifted.
WHY RADIXARK

A personal prediction: most of the world's intelligence will be AI-generated — and cost will decide who gets to use it.

Enterprise

  • Ownership: frontier capability, full control.
  • Inference gateways: one control point for models, cost, and policy.
  • End of token-maxing: scale needs cheaper, production-ready models.

Personal developers

  • Local inference: the local model is the new OS — fast, private, cheap.
  • Hybrid routing: cheap by default, escalate to frontier only when needed.

How I can help

I'm a builder and an optimizer — I keep coming back to performance because I love making systems faster.

FOUNDATIONApplied-math Ph.D. — I reason from first principles.
RANGEAlgorithms, databases, data infra, ML platforms, GPU cluster.
EDGEI build and optimize the whole system.
← → to navigate