Since launching our Not Diamond-powered chat app two months ago, we've collected millions of samples of human evaluation data on pairwise LLM responses and image generations. Our data offers two significant benefits over other AI arena evaluation tools. First, we significantly outpace other tools on sample volume, with over one million arena samples collected every month. Secondly, our users are not submitting experimental test queries—they're using Not Diamond as their daily chatbot to solve problems, ask questions, and learn new things. In this blog post, we share performance rankings across human preference and describe the most interesting findings from the data.
We evaluate head-to-head model performance across 528,000 prompts recorded from September 17th to October 4th, 2024. Of these battles,
In logging over 500,000 arena votes during this two week period, our chatbot compares well LMSYS’s recent report that they’ve collected a total of 2 million votes over the past 18 months.
Below, we report ELO scores and win rates for each pair of supported models:
As can be seen, Anthropic’s Claude 3.5 Sonnet is clearly preferred by users over all other models. Surprisingly however, the next-best models are Mistral Large 2 and Perplexity. We find that Mistral Large 2 displays impressively high-quality responses to a variety of prompts—especially queries involving coding, writing, or translation. While it’s arguable that Perplexity does not qualify as a standalone model, developers and consumers are increasingly leveraging its specialized search-enhanced responses rather than building their own search tools, and its strong performance in Not Diamond demonstrates the value of high-quality real-time web access.
Additionally, analyzing these ratings and head-to-head win probabilities surfaced some other interesting findings:
We have also aggregated these results by provider below:
In addition to language model responses, Not Diamond also supports various image models. Below we report their ELO scores and win rates:
Users strictly preferred FLUX.1 to all other image models, including classic, popular options such as Stable Diffusion and DALL-E.
Alongside tools like LMSYS’s Chatbot Arena and Artificial Analysis, our findings contribute to our collective ability to evaluate AI models on human preference. Notably, our analytics are based on a high volume of real-world data that is organic, applied, and multi-turn, rather than experimental and single-shot.
Due to our privacy commitments, we will not be open-sourcing the chat data, but we will continue to publish aggregate, anonymized analytics that can help inform the community’s evaluation of novel AI models as they’re released. We also look forward to analyzing model performance across different domains, such as coding and open domain question answering. If you would like to contribute to our work, explore partnerships, or deploy Not Diamond’s dynamic routing capabilities in your application, you can email us or schedule a call.