AI Reads Even the Sarcasm in a Simple "Well Done"—Now Changes Facial Expressions [Reading Science]

by Kim Jonghwa

Published 18 Jun.2026 08:00(KST)

UNIST Analyzes Emotions in Voice to Generate Facial Expressions
Expresses Untrained Emotions... Accuracy Improved by 14 Percentage Points

"Well done."

Even the same words can be either a genuine compliment or a sarcastic remark. While humans can detect emotions just by listening to tone and intonation, this is not an easy task for artificial intelligence (AI).

Comparison of emotion editing results between C-MET and existing methods. The research team compared performance by inputting sarcastic emotional speech into the same neutral expression video and explained that C-MET most accurately reproduced the subtle facial changes characteristic of sarcasm, such as the corners of the mouth stretching widely sideways. In contrast, existing technologies failed to properly recreate these emotional expressions. Provided by the research team

A Korean research team has developed an AI technology that reads subtle emotions embedded in speech and alters the facial expression of a person in a video accordingly. This technology can express not only simple emotions like joy or sadness but also more complex emotions such as sarcasm, empathy, and charisma, greatly enhancing the naturalness of virtual humans, educational avatars, and counseling AI.

The Ulsan National Institute of Science and Technology (UNIST) announced on June 18 that Professor Taehwan Kim's research team at the Graduate School of Artificial Intelligence has developed an AI module called "C-MET (Cross-Modal Emotion Transfer)," which extracts emotions from voice signals and transforms the speaker's facial expression in a video to reflect the desired emotion.

Traditional face generation AI has limitations: it requires a reference image expressing a specific emotion or can only generate facial expressions within the range of emotions it has been trained on.

The research team focused not on the emotions themselves, but on the "variation" between emotions.

They quantified the difference between a neutral voice and an emotionally charged voice, and then designed the AI to learn how these changes manifest as facial expression changes. As a result, even when both content and emotion are mixed in speech, the AI can isolate and interpret only the emotional signals.

For example, the same sentence can be spoken with different intonations, leading to distinct movements in the corners of the mouth, eyebrows, or around the eyes.

In particular, instead of labeling emotions as "joy" or "sadness" during training, the model learns the differences between emotions. This allows it to express emotions it hasn't encountered during training, including subtle emotions such as sarcasm, empathy, or charisma.

"Emotional Expression Without Photos"…Anticipation for Virtual Human Applications

The research team validated C-MET's performance by applying it to the latest talking face editing technology, "EDTalk."

As a result, the emotional expression accuracy based on the MEAD (Multimodal Emotion-Aware Dataset), a multi-emotion audio-visual dataset, improved by about 14 percentage points, from 41.99% to 55.91%.

In another face generation model, "PD-FGC," accuracy also increased from 33.36% to 36.82%. Inference speed also improved, confirming that the technology can be applied to various face generation AIs without being limited to any particular model.

Research team photo. Professor Taehwan Kim (left) and researcher Chanhyuk Choi. Courtesy of UNIST

The research team explained that the technology is highly versatile, as it can generate facial expressions from voice alone, without the need for high-quality reference images containing emotions.

Professor Taehwan Kim stated, "This research has practically overcome the limitations of previous technologies by enabling facial emotion changes in videos using only voice, without reference images." He added, "It serves as a foundational technology that can be used in a variety of fields, such as virtual human production, post-production in films and content, and emotional recognition AI."

Hot Picks Today

Corrective Orders Issued to Plastic Surgery Clinics for Elaborate 'Fake Reviews' Used as Deceptive Ads

This research was led by Chanhyuk Choi, a master's student at the UNIST Graduate School of Artificial Intelligence, as the first author. The results have been accepted for presentation at CVPR 2026, the most prestigious conference in the field of artificial intelligence and computer vision.

한글 기사 보기

This content was produced with the assistance of AI translation services.

AI Reads Even the Sarcasm in a Simple "Well Done"—Now Changes Facial Expressions [Reading Science]

UNIST Analyzes Emotions in Voice to Generate Facial Expressions Expresses Untrained Emotions... Accuracy Improved by 14 Percentage Points

"Emotional Expression Without Photos"…Anticipation for Virtual Human Applications

Hot Picks Today

Today’s Briefing

UNIST Analyzes Emotions in Voice to Generate Facial Expressions
Expresses Untrained Emotions... Accuracy Improved by 14 Percentage Points