Hi, I'm

Arbit Chen

Building production-ready infrastructure for AI systems.

CTO at Arklex AI. I focus on making AI systems production-ready — reliable under load, predictable under failure, and efficient under constraint.

Led Kubernetes migration at Airbnb, saving $63M in infrastructure costs.

Co-built Gunrock, the 2018 Amazon Alexa Prize-winning conversational AI. Previously at Airbnb and HTC. NTU alum.

What I Do

Production AI Infrastructure Failure-Aware Orchestration Multi-Provider Reliability Budget-Constrained Routing Kubernetes Distributed Systems AWS GCP

Approach

AI systems fail in predictable ways.

Most teams ignore reliability and cost constraints until production.

I design systems where routing, failure, and budget are first-class concerns.

Writing

LLM Routers Are Not Enough Why per-request routing misses the point — and what workflow-level cost control looks like. Reproducible Testing Reveals the Hidden Risk in Autonomous Agents: Idempotency Why autonomous agents need deterministic testing and what idempotency failures look like in production. Agents Should Be Tested Like Applications, Not Evaluated Like Models The case for treating agent systems as software — with integration tests, not just evals.

Projects

Arklex Framework

Open Source

Agent-first organization framework — the official Python library for building structured, production-ready AI agent systems at Arklex.

TokenWise

Open Source

An open-source experiment in budget-aware LLM routing and multi-provider failover — inspired by production infrastructure patterns.