Model Routing for Cost: What to Actually Measure Before Switching a Workload from GPT-4 to Haiku
Most "use the cheaper model" posts skip the rigor. Real model routing decisions have four layers: token cost, quality regression on an eval set, latency impact, and governance risk. This article walks through each layer with the questions a platform engineer should answer before flipping a workload from a frontier model to a smaller one, plus an example routing rule expressed at the gateway layer. The gateway is the right place to enforce routing because it has identity and policy context the application does not.