AI Reads Even the Sarcasm in a Simple "Well Done"—Now Changes Facial Expressions [Reading Science]
UNIST Analyzes Emotions in Voice to Generate Facial Expressions
Expresses Untrained Emotions... Accuracy Improved by 14 Percentage Points
"Well done."
Even the same words can be either a genuine compliment or a sarcastic remark. While humans can detect emotions just by listening to tone and intonation, this is not an easy task for artificial intelligence (AI).
Comparison of emotion editing results between C-MET and existing methods. The research team compared performance by inputting sarcastic emotional speech into the same neutral expression video and explained that C-MET most accurately reproduced the subtle facial changes characteristic of sarcasm, such as the corners of the mouth stretching widely sideways. In contrast, existing technologies failed to properly recreate these emotional expressions. Provided by the research team
View original imageA Korean research team has developed an AI technology that reads subtle emotions embedded in speech and alters the facial expression of a person in a video accordingly. This technology can express not only simple emotions like joy or sadness but also more complex emotions such as sarcasm, empathy, and charisma, greatly enhancing the naturalness of virtual humans, educational avatars, and counseling AI.
The Ulsan National Institute of Science and Technology (UNIST) announced on June 18 that Professor Taehwan Kim's research team at the Graduate School of Artificial Intelligence has developed an AI module called "C-MET (Cross-Modal Emotion Transfer)," which extracts emotions from voice signals and transforms the speaker's facial expression in a video to reflect the desired emotion.
Traditional face generation AI has limitations: it requires a reference image expressing a specific emotion or can only generate facial expressions within the range of emotions it has been trained on.
The research team focused not on the emotions themselves, but on the "variation" between emotions.
They quantified the difference between a neutral voice and an emotionally charged voice, and then designed the AI to learn how these changes manifest as facial expression changes. As a result, even when both content and emotion are mixed in speech, the AI can isolate and interpret only the emotional signals.
For example, the same sentence can be spoken with different intonations, leading to distinct movements in the corners of the mouth, eyebrows, or around the eyes.
In particular, instead of labeling emotions as "joy" or "sadness" during training, the model learns the differences between emotions. This allows it to express emotions it hasn't encountered during training, including subtle emotions such as sarcasm, empathy, or charisma.
"Emotional Expression Without Photos"…Anticipation for Virtual Human Applications
The research team validated C-MET's performance by applying it to the latest talking face editing technology, "EDTalk."
As a result, the emotional expression accuracy based on the MEAD (Multimodal Emotion-Aware Dataset), a multi-emotion audio-visual dataset, improved by about 14 percentage points, from 41.99% to 55.91%.
In another face generation model, "PD-FGC," accuracy also increased from 33.36% to 36.82%. Inference speed also improved, confirming that the technology can be applied to various face generation AIs without being limited to any particular model.
Research team photo. Professor Taehwan Kim (left) and researcher Chanhyuk Choi. Courtesy of UNIST
View original imageThe research team explained that the technology is highly versatile, as it can generate facial expressions from voice alone, without the need for high-quality reference images containing emotions.
Professor Taehwan Kim stated, "This research has practically overcome the limitations of previous technologies by enabling facial emotion changes in videos using only voice, without reference images." He added, "It serves as a foundational technology that can be used in a variety of fields, such as virtual human production, post-production in films and content, and emotional recognition AI."
Hot Picks Today
"Shocking Forecast" for Bitcoin: "Even After 100 Years, Only 0.6% Annual Return" Bubble Concerns Resurface
- When SK hynix Rose 1,200%, This Stock Soared 1,800%... Headed for 1.85 Million Won [Click e-Stock]
- "Insurance Covers It"—Unaware Patients at Risk as Repeated Extracorporeal Shock Wave Therapy May Lead to Loss of Indemnity Coverage
- Record Number of Foreign Tourists Spend 5.9 Trillion Won, Department Stores Anticipate Unprecedented Boom
- "What About Those Who Paid 220,000 Won?"... Chinese Maotai Faces Major Setback and Plummeting Prices
This research was led by Chanhyuk Choi, a master's student at the UNIST Graduate School of Artificial Intelligence, as the first author. The results have been accepted for presentation at CVPR 2026, the most prestigious conference in the field of artificial intelligence and computer vision.
© The Asia Business Daily(www.asiae.co.kr). All rights reserved.