The Not Diamond chat app collects human feedback on millions of LLM arena-style responses every month. Below we present our most recent evaluation results on various frontier models. Unlike other AI arena evaluation tools, our data is high-volume, based on real-world use cases, and represents a diverse global user base, with 87.8% of users residing outside the US.
We evaluate head-to-head model performance across 1,629,706 prompts recorded from October 17th to December 17th, 2024. Of these battles, 797,533 battles identified a clear winner, while the remaining 832,173 battles ended in a tie. Below, we report ELO scores and win rates for each pair of supported models:
As can be seen, OpenAI's experimental ChatGPT-4o model is preferred by users over all other models, with strong performance from the Claude models as well as Mistral Large 2 and Perplexity. Interestingly though, while OpenAI's model performance tops the chart, on the whole Anthropic was the most preferred provider:
This data was collected from a highly diverse user base distributed primarily across North America, Asia, and Europe:
Alongside tools like LMSYS’s Chatbot Arena and Artificial Analysis, our findings contribute to our collective ability to evaluate AI models on human preference. Notably, our analytics are based on a high volume of real-world data that is organic, diverse, and multi-turn, rather than experimental and single-shot.
If you would like to contribute to our work, explore partnerships, or deploy Not Diamond’s dynamic routing capabilities in your application, you can schedule a call.