San Francisco Bay Area
Prev. SpaceX (youngest SWE on mission-critical EIS Build Reliability team) & Google/YouTube (managed tech for a $1B+ partner portfolio). Yale & UIUC. Currently building ML evaluation infrastructure at Bespoke Labs AI. I design tasks that test whether AI agents can actually do data science work on a simulated retail environment with real-world messiness baked in. Think complex tasks involving timezone inconsistencies across regions, currency unit mismatches, misleading documentation, biased training views. The goal is to build challenges that a skilled agent should be able to solve but a lazy one will fail. The hardest part is making things exactly the right amount of hard.
Building adversarial evaluation tasks for AI agents on a production-scale retail simulation. I design and evaluate the data science challenges for high-performance LLM agents.
<1% acceptance rate, highly vetted global network of elite, independent product builders, engineers, and strategists.