AI Model Rankings on Human Preference from Not Diamond’s Arena Mode

October 22,  2024

Since launching our Not Diamond-powered chat app two months ago, we've collected millions of samples of human evaluation data on pairwise LLM responses and image generations. Our data offers two significant benefits over other AI arena evaluation tools. First, we significantly outpace other tools on sample volume, with over one million arena samples collected every month. Secondly, our users are not submitting experimental test queries—they're using Not Diamond as their daily chatbot to solve problems, ask questions, and learn new things. In this blog post, we share performance rankings across human preference and describe the most interesting findings from the data.

Not Diamond’s Arena Mode allows users to compare responses from two AI models. One model is recommended by our routing algorithm, while the other is chosen randomly.

Battle statistics

We evaluate head-to-head model performance across 528,000 prompts recorded from September 17th to October 4th, 2024. Of these battles,

  • 276,000 battles identified a clear winner, while the remaining 252,000 battles ended in a tie,
  • 47,800 battles involved images submitted by a chatbot user, and
  • 18,000 image battles ended with a victorious model (versus approximately 30,000 battles ending in a tie)

In logging over 500,000 arena votes during this two week period, our chatbot compares well LMSYS’s recent report that they’ve collected a total of 2 million votes over the past 18 months.

Below, we report ELO scores and win rates for each pair of supported models:

As can be seen, Anthropic’s Claude 3.5 Sonnet is clearly preferred by users over all other models. Surprisingly however, the next-best models are Mistral Large 2 and Perplexity. We find that Mistral Large 2 displays impressively high-quality responses to a variety of prompts—especially queries involving coding, writing, or translation. While it’s arguable that Perplexity does not qualify as a standalone model, developers and consumers are increasingly leveraging its specialized search-enhanced responses rather than building their own search tools, and its strong performance in Not Diamond demonstrates the value of high-quality real-time web access.

Additionally, analyzing these ratings and head-to-head win probabilities surfaced some other interesting findings:

  • GPT-4-Turbo outperforms its successor GPT-4o by a thin margin, with the two models effectively tied. While GPT-4o is much cheaper and faster and thus would represent the better option overall, our data suggests ambivalence in the actual quality gains between the two models. We theorize that preference data may differentiate the models more strongly on particular sub-domains of the distribution.
  • Among the small tier of models, Gemini Flash competes admirably on human preference, logging a small margin of victory over all OpenAI models.

We have also aggregated these results by provider below:

Head-to-head image results

In addition to language model responses, Not Diamond also supports various image models. Below we report their ELO scores and win rates:

Users strictly preferred FLUX.1 to all other image models, including classic, popular options such as Stable Diffusion and DALL-E.

Conclusion

Alongside tools like LMSYS’s Chatbot Arena and Artificial Analysis, our findings contribute to our collective ability to evaluate AI models on human preference. Notably, our analytics are based on a high volume of real-world data that is organic, applied, and multi-turn, rather than experimental and single-shot.

Due to our privacy commitments, we will not be open-sourcing the chat data, but we will continue to publish aggregate, anonymized analytics that can help inform the community’s evaluation of novel AI models as they’re released. We also look forward to analyzing model performance across different domains, such as coding and open domain question answering. If you would like to contribute to our work, explore partnerships, or deploy Not Diamond’s dynamic routing capabilities in your application, you can email us or schedule a call.