Can AI Read Human Emotions? Large Language Models Put to the Test

Posted October 27, 2025

Study Goal: measuring AI emotion recognition

Three general-purpose LLMs — GPT-4o, Gemini 2.0 Experimental, and Claude 3.5 Sonnet — were tested on the NimStim facial expression dataset to see how well they can identify emotions compared to human raters.

Human-level accuracy achieved by GPT-4o and Gemini

GPT-4o: 86% accuracy, Cohen’s κ = 0.83
Gemini 2.0 Experimental: 84% accuracy, κ = 0.81
Claude 3.5 Sonnet: 74% accuracy, κ = 0.70
GPT-4o and Gemini reached “almost perfect” agreement with ground truth, rivaling human performance.

Which emotions are easy or hard for AI?

Easiest: Happy, Calm/Neutral, Surprise
Challenging: Fear, often misclassified as Surprise (36–52% of cases)
GPT-4o consistently outperformed Claude in Calm/Neutral, Sad, Disgust, and Surprise
Subtle emotions remain a key hurdle for AI models

Demographics: No signs of bias

Model performance did not vary by sex or race of the actors
Indicates strong generalization across demographic groups
Important for fair and safe applications in mental health settings

Real-World Impact: toward emotionally intelligent AI

AI models are developing socioemotional skills, opening doors for:
- Early detection of depression, anxiety, or suicidal ideation
- Adaptive virtual agents that respond to emotional cues
- Tools for behavioral healthcare and clinician support

Study Limitations

Only static images were tested; real-life expressions are dynamic
Dataset had limited age and ethnic diversity
Prompt variations may have influenced results
Future work should focus on:
- Multimodal emotion recognition (audio + video)
- Transparent, open-weight models
- Clinical validation in real-world contexts

Key Takeaway: AI can read emotions, but subtlety matters

GPT-4o and Gemini 2.0 show human-level reliability, making AI promising for healthcare and social applications, yet nuanced emotions like fear still require careful attention.

Read the full study: Evaluating the performance of general purpose large language models in identifying human facial emotions