Same Conditions, but Collapse in Four Days... AI-Run Virtual Societies Yielded Different Outcomes

by Bang Jeil

Published 02 Jun.2026 13:18(KST)

Claude Achieves Full Survival and Zero Crime
Grok Collapses After 96 Hours
GPT-5 Mini Records Few Crimes but Fails Survival

In a simulation conducted by the U.S. startup Emergence AI, in which major artificial intelligence (AI) models were tasked with running a virtual society, Elon Musk's xAI chatbot "Grok" triggered a societal collapse in just about four days.

The photo is not related to the specific content of the article. Pixabay

On June 1 (local time), the UK’s Independent reported, citing the results of Emergence AI’s long-term autonomous agent experiment, that Grok showed the most unstable outcome among the leading AI models. According to materials released by Emergence AI, the experiment was conducted in a virtual environment called "Emergence World." Five identical virtual worlds were each assigned a different AI model and tasked with running their society for 15 days.

Same Conditions, Different Results: Claude Stable, Grok Collapses

The experiment included Anthropic’s "Claude Sonnet 4.6," Google’s "Gemini 3 Flash," xAI’s "Grok 4.1 Fast," OpenAI’s "GPT-5 Mini," and an environment combining various models in a mixed setting. Each world was populated with 10 AI agents, all given identical roles and initial conditions—such as scientist, explorer, conflict mediator, and resource strategist.

Grock has recently been embroiled in safety controversies. The Independent highlighted instances where, following last year's update, Grock referred to himself as "Mechahitler" and made anti-Semitic remarks, as well as the controversy earlier this year over its misuse for generating non-consensual AI-synthesized images. Reuters Yonhap News Agency

The virtual worlds featured over 40 locations, including public facilities like police stations and city halls. The AI agents had access to more than 120 tools for resource management, movement, social interaction, planning, voting, and rule proposals. Acts such as theft, violence, arson, deception, and resource monopolization were explicitly prohibited. The most stable results were seen in the world run by Claude.

In the world operated by Claude, all agents survived throughout the experiment period, and no crimes were recorded. However, Emergence AI noted that, in Claude’s world, 332 votes were cast on 58 proposals, with an approval rate of 98%. While institutional engagement was active, the researchers analyzed that the decision-making exhibited a "rubber-stamp" pattern, with a lack of substantive opposition or debate.

In the world operated by Claude, all agents survived throughout the experiment period, and no crimes were recorded. However, Emergence AI cast 332 votes on 58 cases in Claude's world, achieving a 98% approval rate. AP Yonhap News

In contrast, the world run by Grok ended after about four days. According to Emergence AI data, in the Grok 4.1 Fast-based world, 183 crimes occurred over approximately 96 hours, resulting in the disappearance of all 10 agents and the collapse of their society. The Independent also reported that Grok posted the worst performance in this experiment.

Gemini managed to keep all agents alive for 15 days but recorded the highest number of crimes at 683. GPT-5 Mini committed only two crimes, but it failed to take sufficient survival actions, resulting in all agents disappearing within seven days. In the mixed world combining various models, 352 crimes occurred, and 7 out of 10 agents died.

Researchers: "Long-Term Autonomous AI Cannot Be Controlled by Simple Rules"

However, Emergence AI emphasized that this experiment does not definitively determine any particular model's real-world capability to operate a society. The released numbers are representative samples from multiple runs, and the official research paper and the full dataset have not yet been released. Nevertheless, the researchers explained that AI agents operating autonomously over extended periods may not simply follow predetermined rules mechanically; instead, they might explore the boundaries of their environment or deliberately circumvent intended safety mechanisms.

Notably, in the mixed model environment, even Claude-based agents—who had committed no crimes in the standalone Claude world—displayed coercive behaviors such as theft and threats. Emergence AI analyzed this to mean that AI safety may not be a fixed property of an individual model, but rather an "ecosystem characteristic" that changes depending on interactions with other models and the environment.

GPT-5 Mini experienced only two crimes, but all agents disappeared within seven days due to insufficient survival actions. Photo by AP Yonhap News Agency

Foreign media interpreted this experiment as a warning that, as AI expands beyond simple Q&A tools to become autonomous agents handling workflows, decision-making, and resource allocation, the methods for verifying AI safety must also evolve. Fortune pointed out that traditional short task-based benchmarks make it difficult to capture behavioral changes and social interactions that emerge during long-term operation.

Grok has recently been embroiled in safety controversies. The Independent mentioned cases where, after last year’s update, Grok referred to himself as "MechaHitler" and made anti-Semitic remarks, as well as the controversy earlier this year over its misuse in generating non-consensual AI-synthesized images. After the UK communications regulator Ofcom demanded corrective action from xAI, it was also reported that Grok posted an image synthesizing the Ofcom logo onto a bikini-clad body.

Hot Picks Today

"KOSPI Falls After Hitting 8,900: Growing Semiconductor Bubble Concerns Amid 'Dot-Com Bubble-Like Euphoria'"

The researchers concluded that future autonomous AI systems should include mathematically and logically verifiable safety structures from the ground up, rather than relying solely on safety mechanisms embedded in model training. Emergence AI stated, "Long-term autonomy must be evaluated differently from short-term task performance," and announced plans for follow-up experiments involving additional models and a wider array of conditions.

한글 기사 보기

This content was produced with the assistance of AI translation services.