In a recent AI hackathon in San Francisco, a new artificial intelligence benchmark inspired by the classic arcade game Street Fighter III was unveiled. The open-source LLM Colosseum benchmark, created by Stan Girard and Quivr Brain, allows large language models (LLMs) to compete against each other in the emulator version of the game, showcasing unconventional yet impressive combat scenarios.
Enthusiast Matthew Berman demonstrated the beat-em-up-based LLM tournament in a video presentation, guiding viewers on how to install the project on their own computers for testing purposes. Unlike traditional benchmarks, the Street Fighter III-inspired benchmark emphasizes the advantage of smaller models with quicker response times, akin to human players relying on swift reactions in combat scenarios.
During the AI-vs-AI battles, the LLMs make real-time decisions based on analyzing the game state, considering move options such as closing in, retreating, or executing different attacks like fireballs and special moves. The fights appear fluid and strategic in nature, with players showcasing counter moves, blocks, and special attacks, albeit limited to using the Ken character at present.
Further testing revealed that OpenAI’s GPT 3.5 Turbo emerged as the top contender in the competition, boasting an ELO rating of 1776 in Girard’s trials. Meanwhile, a separate examination conducted by Amazon’s Banjo Obayomi saw Anthropic’s claude_3_haiku as the victorious LLM with an ELO rating of 1613 after 314 matches played across 14 LLMs.
Obayomi also noted peculiar occurrences like AI hallucinations and safety rails influencing the LLMs’ beat-em-up performance. While the new AI benchmark presents an intriguing challenge for LLMs, questions linger regarding its utility compared to more intricate games that could offer deeper insights albeit with greater interpretive challenges.