Template-Type: ReDIF-Paper 1.0 Author-Name: Fernando Perez-Cruz Author-X-Name-First: Fernando Author-X-Name-Last: Perez-Cruz Author-Name: Hyun Song Shin Author-X-Name-First: Hyun Song Author-X-Name-Last: Shin Title: Putting AI agents through their paces on general tasks Abstract: Multimodal large language models (LLMs), trained on vast datasets are becoming increasingly capable in many settings. However, the capabilities of such models are typically evaluated in narrow tasks, much like standard machine learning models trained for specific objectives. We take a different tack by putting the latest LLM agents through their paces in general tasks involved in solving three popular games - Wordle, Face Quiz and Flashback. These games are easily tackled by humans but they demand a degree of self-awareness and higher-level abilities to experiment, to learn from mistakes and to plan accordingly. We find that the LLM agents display mixed performance in these general tasks. They lack the awareness to learn from mistakes and the capacity for self-correction. LLMs' performance in the most complex cognitive subtasks may not be the limiting factor for their deployment in real-world environments. Instead, it would be important to evaluate the capabilities of AGI-aspiring LLMs through general tests that encompass multiple cognitive tasks, enabling them to solve complete, real-world applications. Creation-Date: 2025-02 File-URL: https://www.bis.org/publ/work1245.pdf File-Format: Application/pdf File-Function: Full PDF document File-URL: https://www.bis.org/publ/work1245.htm File-Format: text/html Number: 1245 Keywords: AI Agents, LLMs evaluation Classification-JEL: C88 Handle: RePEc:bis:biswps:1245