1.5 Million Human Preference Arena Rankings on LLM Responses

December 19, 2024

The Not Diamond chat app collects human feedback on millions of LLM arena-style responses every month. Below we present our most recent evaluation results on various frontier models. Unlike other AI arena evaluation tools, our data is high-volume, based on real-world use cases, and represents a diverse global user base, with 87.8% of users residing outside the US.

‍

We evaluate head-to-head model performance across 1,629,706 prompts recorded from October 17th to December 17th, 2024. Of these battles, 797,533 battles identified a clear winner, while the remaining 832,173 battles ended in a tie. Below, we report ELO scores and win rates for each pair of supported models:

As can be seen, OpenAI's experimental ChatGPT-4o model is preferred by users over all other models, with strong performance from the Claude models as well as Mistral Large 2 and Perplexity. Interestingly though, while OpenAI's model performance tops the chart, on the whole Anthropic was the most preferred provider:

This data was collected from a highly diverse user base distributed primarily across North America, Asia, and Europe:

Not Diamond’s user base is highly diverse.

Alongside tools like LMSYS’s Chatbot Arena and Artificial Analysis, our findings contribute to our collective ability to evaluate AI models on human preference. Notably, our analytics are based on a high volume of real-world data that is organic, diverse, and multi-turn, rather than experimental and single-shot.

‍

If you would like to contribute to our work, explore partnerships, or deploy Not Diamond’s dynamic routing capabilities in your application, you can schedule a call.