Researchers have tested artificial intelligence’s ability to adapt quickly through the classic game Super Mario Bros.
The Claude 3.7 model excelled in fast responses and jump planning, while other models faced noticeable difficulties.
The experiment raised questions about how relevant game-based tests are to real-world AI capabilities.
As the quest to measure AI abilities continues, researchers are turning to a new approach that goes beyond traditional mathematical tests and enters the realm of games—something both fun and equally challenging.
Following Anthropic’s testing of its latest Claude 3.7 Sonnet model in Pokémon, a fresh attempt emerged using the iconic Super Mario Bros., a game released by Nintendo in 1985. This now serves as a new testing platform for AI’s capabilities, symbolizing a shift from classic logical puzzles to dynamic jumping challenges.
This innovative approach comes from the Hao AI lab at the University of California, San Diego, where researchers tested multiple advanced AI models using Super Mario as an evaluation tool. Rather than using traditional metrics, the team decided to assess AI in an environment that humans instinctively understand.
To carry out the experiment, they used an emulator version of the game combined with the GamingAgent system—a custom framework developed by the lab. This system provided the AI models with basic control instructions, guidance conditions, and real-time screenshots of the game, with the AI models controlling Mario’s movements via Python code. Although Super Mario Bros. is a relatively simple adventure game, researchers at Hao AI discovered that it required the AI to engage in complex planning and adapt rapidly. Success wasn’t just about computational power but also about making strategic decisions and performing precise, sequential actions in a fast-changing environment.
At the conclusion of the experiment, the Claude 3.7 model from Anthropic stood out as the most impressive, displaying rapid responses and expertly timed jumps while avoiding enemies. The Claude 3.5 model also performed well. However, the real surprise came with AI models designed for logical reasoning, such as GPT-4o from OpenAI and Gemini 1.5 Pro from Google, which struggled to keep pace with the demands of the game.
Researchers highlighted timing as a key factor in this test, noting that a fraction of a second could make the difference between success and failure. Models relying on deep logical reasoning tend to process information in sequential steps, which makes them slower to respond to quickly changing scenarios, leading to frequent game losses.
While using games to assess AI abilities isn’t new, some experts are questioning how relevant these tests are to real-world AI. Games often simplify real-world complexity, offering limited training data compared to the unpredictability and intricacy of the actual world.
In this context, AI researcher Andrej Karpathy raised concerns about a “valuation crisis” in the field, suggesting that the current testing methods—especially those involving games—might not provide an accurate picture of true AI progress.
An interesting and somewhat amusing question arises: If AI struggles to navigate the Mushroom Kingdom, can we trust it to handle the complexities of the real world? While the Super Mario test is an exciting way to explore AI’s capabilities, it also serves as a reminder of the challenges that even seemingly simple tasks present. For those interested in exploring this further, the Hao AI lab has made the GamingAgent framework open-source on GitHub.