Using Not Diamond to reduce Not Diamond’s inference costs by 51% (with one line of code)

October 15, 2024

A few months ago we released general availability for our routing platform. The next day, we launched our chatbot to widespread adoption and the top spot on Product Hunt. It’s been great to see end users get excited about the value of routing, with tens of thousands of folks finding Not Diamond’s hyper-personalized routing the solution to the timeless question of “which model should I use?”

Not Diamond’s chatbot uses our AI model routing algorithm to find the best model for your prompt and learns your preferences in real-time based on feedback.

As is familiar to anyone building post-PMF AI products, model inference can get expensive very quickly. Providers like OpenAI and Anthropic raise large rounds to fund expansions in compute and research capacity, reportedly experiencing large negative margins in the process. Meanwhile, developers of generative AI products like Limitless receive meme-worthy high monthly bills for API usage. Like those who have gone before us, we found our costs rising quickly as our chatbot grew in popularity.

How did we address this? We dogfooded our own product.

‍

Dogfooding Not Diamond to save over $750k/year on inference costs

Not Diamond is explicitly designed to maximize response quality above all else by sending each query to the best-suited option from the set of LLMs defined by a developer. In addition though, our API also gives teams the ability to define explicit cost and latency tradeoffs. When requests include a cost tradeoff, our algorithm will evaluate whether the query can be fulfilled by a cheaper or faster model without degrading the quality of the response.

‍

Up until mid-September, we were using Not Diamond’s default (quality-maximizing) routing algorithm. By implementing cost tradeoffs, we saved 51% in inference costs, representing $750,000 in savings on an annualized basis.

Without Not Diamond, this change would have required extensive engineering and even more extensive ongoing maintenance. We would have had to experiment with model playgrounds, run in-depth evaluations, and hand craft brittle heuristics with if/else rules or regex (yes, we’ve seen people try to route with regex). And with every release of a new model—and every update of an old model—we would have then had to re-run our entire evaluation pipeline and re-build our routing architecture. We have seen companies throw entire teams of full-time engineers at this problem, easily spending more on headcount than any cost-savings routing can achieve.

‍

Instead, we decreased our costs by updating one line of code in TypeScript - the line at the bottom of this snippet:

Shipping faster, saving costs, and preserving quality

Any perceptive reader would naturally ask how this change affected our chatbot experience. While we did not run a controlled experiment to roll out this change, we can leverage concepts from regression discontinuity design to explore whether users changed their engagement with our chatbot before and after we released this change on September 12th.

As costs decreased by 51% on average, we also saw

a very small decline in power user messaging habits and median messaging habits
- the 90th %ile of users dropped from 23 messages / day to 21 messages / day
- the median user sent 4 messages / day instead of 5 messages / day
week-over-week growth in messages of 43% after implementing the cost tradeoff on September 12th, versus 16% for the previous week

‍

Conclusion

The results speak for themselves: changing one parameter in the Not Diamond API helped Not Diamond save over 50% on our chatbot inference costs without meaningfully impacting user experience. Model routing offers teams the ability to manage costs while maximizing response quality, avoiding both vendor lock-in and expensive in-house engineering friction.

‍

You can learn more about how Not Diamond can help you manage your own inference costs by scheduling some time with our team.