AGENDA

Presentation flow

Short intro, then most of our time on the project.

01

Self introduction

Background and the systems work I do.

02

Amdahl Loop project deep dive

How it finds wasted GPU time, validates fixes, and banks the savings.

03

Why RadixArk

Where AI infra is going — and why I want to build it.

BACKGROUND

Self introduction

Background

Ph.D. in applied math. I've worked on algorithms, ML, data infra, databases, platforms, and infrastructure — startups to big companies. Now a production engineer at Meta on ML-training efficiency and reliability.

I bridge researchers and engineers — turning research needs into reliable systems, and system limits into feedback.

Engineering philosophy

Faster feedback loop: shorten learning, debugging, and optimization cycles.
Verification: no change without verification.
Rigor: correlation is not causation.

PROJECT CONTEXT

Performance optimization for HPC jobs

GPU hours aren't just cost — they're often the limit on how fast ML teams train, evaluate, and iterate.

01

GPU-hours are scarce

Compute is a planning bottleneck. Saving GPU-hours = creating capacity.

02

Faster MLE loops

Shorter jobs = faster debugging, optimization, and iteration.

Every optimization pays twice: lower cost today, and more experiments — faster iteration tomorrow.

PROJECT DEEP DIVE

Amdahl Loop: AI-native performance optimization

In H1 I used Claude Code to automate the optimization work — now working on Amdahl Loop to remove me from the loop.

01

An old problem

Performance has been my throughline: algorithms, ETL, databases, low-latency systems, ML platforms, GPU jobs.

02

Coding agents raised the ceiling

With Claude Code and Codex: more hypotheses in flight, faster code understanding, more rigorous A/B.

$30M saved out of $40M total on offline eval · 50% fewer GPU hours at 2× throughput · additional $40M savings identified

MOTIVATION

Why traditional optimization work is hard

01

Heroic guesswork

Needs codebase + business logic + infra + latency expertise in one head. Even then, win rates are low.

02

Verification is hard

Correlation ≠ causation. Data shift and dependency noise bury the signal — you need controlled A/B.

03

Hostile to human attention

Hours to days per experiment, and single-threaded. One expert, one hypothesis at a time.

The bottleneck is expert attention: low win rate × single-threaded × hours per try = a trickle of wins.

ANATOMY OF AN HPC JOB

Anatomy of a job: where accelerator time goes

Using 4 GPUs for illustration. Only the green intervals are GPU-productive — but you pay for all GPU hours.

time →

idle — GPU allocated, doing nothing

low utilization — input-bound / partial occupancy

high utilization — productive compute

STRATEGY

Why I do not chase pure GPU FLOPs

Even infinitely fast compute still pays the fixed serial phases — so end-to-end speedup caps near 2×.

serial floor — always paid

today

serial

compute

1.0×

2× faster compute

1.33×

∞ fast compute

2.0× ceiling

01

Amdahl's law

Steady-state compute is ~half the timeline; an infinitely fast kernel caps near 2× end-to-end. The serial phases — queue, warmup, input, checkpoint — pay more.

02

Optimization is system balance

SM utilization is usually low. Real wins come from the mix of GPU, CPU, network, I/O, locality, and algorithm. A fast kernel can't help a data-starved GPU.

03

Play to my edge

Kernel tuning is a specialist arms race. My leverage is infra + systems — scheduling, pipelines, storage, orchestration. Exactly where the waste lives.

Optimize the whole job timeline, not just FLOPs — that's where the accelerator-minutes hide.

WORKFLOW MECHANICS

How Amdahl Loop works

Make performance work stateful and repeatable — every run finds a validated win or records why one failed.

propose

scan

rank fleet jobs by idle-accelerator waste; pick one

→

profile

BPF, PyTorch Profiler, NCU: correlate dynamic info with static code analysis

→

hypothesize

classify the bottleneck → {needs_build, change}

→

submit

submit A/B experiments with the change

verify

collect

metrics for finished experiments

→

compare

paired A/B: baseline vs variant

→

refute

N adversarial checks; majority must agree

→

promote

human gate: reviews surviving wins; verdict is always recorded

dream

search

periodically search optimization wins inside and outside the company

→

inject

seed fresh hypotheses into scan — new ideas keep entering Amdahl Loop

persistent memory across every step — refuted hypotheses are never retried; bottleneck analysis refreshes each cycle

MECHANISMS

Every pain point gets a mechanism

01Heroic guesswork

→

The agent does scan + hypothesize

Ranks the fleet by wasted time, reads code, classifies the bottleneck — intuition becomes a repeatable phase.

02Correlation ≠ causation

→

Paired A/B + adversarial refute

Baseline vs variant on the same job, then N refute checks — a win survives only if the majority agrees.

03Hours-to-days loop

→

Fully async, cron-paced

Submits return a worker_run_id; the next cycle reconciles it. The ledger lets every cycle resume cold.

04Low, single-threaded win rate

→

Persistent memory + hypothesis injection

Verdicts are remembered, so nothing re-runs; prior wins inject fresh ideas — many experiments run in parallel, not one per expert.

Every verdict is recorded — a refuted experiment is knowledge, not waste.

RESULT

Cheaper and faster

Eval jobs now run twice as fast on half the GPUs — a net 75% cut in GPU hours.

ROI anecdote: ~$500K in tokens and experiments → ~$30M in savings.
Scope note: H1 changes were eval-side only; savings could be larger with training-side changes such as model export.

$30M

captured from offline optimization

50%

fewer GPU hours on targeted workloads

2×

throughput, same footprint

$40M

more savings identified

What changed technically

Reduced data starvation, host contention, communication, and lifecycle overhead.
Cut CPU serial work that left GPUs idle.
Cut frequent thread wakeups slowing MKL GEMM.
Fewer sync stalls (metric logging, range checks).

LEARNINGS

Learnings

AI Agent learning

Agentic trend: prompts → instructing agents → writing loops.
Second brain: persistent memory matters.
Verdict ledger: hypothesis, evidence, result, decision.
AI-native workflow: replacing traditional orchestration (e.g. Opus 4.8 workflows).
New runtime: Claude Code and Codex.

System learning

Verification throughput: tests become the bottleneck. P=NP, practically: if you can verify a solution, you can find the solution.
Profiling first: find opportunities, trace code paths, verify results.
Chase the biggest bottleneck: borrowed techniques often don't move the needle.
Cluster noise: queue time, noisy neighbors, transient failures hide the effect.
Input drift: a variant can look faster just because the workload shifted.

WHY RADIXARK

A personal prediction: most of the world's intelligence will be AI-generated — and cost will decide who gets to use it.

Enterprise

Ownership: frontier capability, full control.
Inference gateways: one control point for models, cost, and policy.
End of token-maxing: scale needs cheaper, production-ready models.

Personal developers

Local inference: the local model is the new OS — fast, private, cheap.
Hybrid routing: cheap by default, escalate to frontier only when needed.

How I can help

I'm a builder and an optimizer — I keep coming back to performance because I love making systems faster.

FOUNDATIONApplied-math Ph.D. — I reason from first principles.

RANGEAlgorithms, databases, data infra, ML platforms, GPU cluster.

EDGEI build and optimize the whole system.