Facial Expression Recognition with Landmark Detection

Overview

This project develops a two-stage facial expression recognition pipeline on the FER2013 dataset:

Facial landmark detection using MediaPipe to extract key points
Emotion classification with a modified ResNet18 enhanced by self-attention.

Implemented in PyTorch, the system addresses class imbalance and low-resolution challenges, achieving 52.38% accuracy on imbalanced dataset with a 4.2% improvement from attention mechanisms.

Dataset

FER2013 consists of 28,709 grayscale 48x48 pixel images across 7 emotion classes (Angry, Disgust, Fear, Happy, Sad, Surprise, Neutral), split into training (28,709), public test, and private test sets.
The dataset presents challenges like class imbalance (e.g., Disgust: only 547 samples) and detection failures in ~18% of low-resolution images.

For more details, refer to the FER2013.

Core Challenge: Handling noisy, imbalanced facial data for accurate landmark extraction and emotion prediction.

Methodology

Data Preprocessing and Augmentation

Note: Images are normalized and augmented to improve model robustness using augmentation techniques like rotation, zoom, and horizontal flip.

Part 1: Facial Landmark Detection

MediaPipe extracts ground-truth landmarks, which train a ResNet18 regressor for 4 key points (eye centers, lip corners).

class LandmarkDetectionModel(nn.Module):
    def init(self, num_landmarks=8):  # 4 points x (x,y)
        super().init()
        self.resnet = models.resnet18(weights=models.ResNet18_Weights.DEFAULT)
        self.resnet.fc = nn.Sequential(
            nn.Linear(self.resnet.fc.in_features, 256),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(256, num_landmarks)
        )

    def forward(self, x):
        return self.resnet(x)

Part 2: Emotion Classification with Attention

The ResNet18 model is modified to include a self-attention mechanism, enhancing feature extraction and classification.

class EmotionClassificationModel(nn.Module):
    def __init__(self, num_classes=7):
        super(EmotionClassificationModel, self).__init__()

        # Load pre-trained ResNet18
        self.resnet = models.resnet18(weights=models.ResNet18_Weights.DEFAULT)

        # Modify first conv layer to handle grayscale images while utilizing pre-trained weights
        original_weight = self.resnet.conv1.weight.clone()
        self.resnet.conv1 = nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3, bias=False)
        with torch.no_grad():
            self.resnet.conv1.weight = nn.Parameter(original_weight)

        # Modify the final fully connected layer for emotion classification
        num_features = self.resnet.fc.in_features
        self.resnet.fc = nn.Sequential(
            nn.Dropout(0.3),  # Increased dropout from 0.2 to 0.3
            nn.Linear(num_features, 256),
            nn.ReLU(),
            nn.Dropout(0.3),  # Uncommented the second dropout layer
            nn.Linear(256, num_classes)
        )
        self.attention = nn.Sequential(
            nn.AdaptiveAvgPool2d(1),  # Global average pooling
            nn.Conv2d(512, 32, kernel_size=1),
            nn.ReLU(),
            nn.Conv2d(32, 512, kernel_size=1),
            nn.Sigmoid()
        )

    def forward(self, x):
        # Extract intermediate features
        x = self.resnet.conv1(x)
        x = self.resnet.bn1(x)
        x = self.resnet.relu(x)
        x = self.resnet.maxpool(x)
        x = self.resnet.layer1(x)
        x = self.resnet.layer2(x)
        x = self.resnet.layer3(x)
        x = self.resnet.layer4(x)
        att = self.attention(x)  # adding self attention
        x = x * att
        x = self.resnet.avgpool(x)
        x = torch.flatten(x, 1)
        x = self.resnet.fc(x)

        return x

Training: Class-weighted CrossEntropyLoss, AdamW (lr=0.0005, weight_decay=9e-2), early stopping.

Results & Analysis

Landmark Detection Performance

Landmark	Average Error (Euclidean Distance)
Left Eye Center	0.0382
Right Eye Center	0.0398
Left Lip Corner	0.0471
Right Lip Corner	0.0433

Average Error: 0.0386 pixels (normalized to image size)

The above visualization shows the predicted landmarks (blue) compared to the ground truth (red) on sample images. The model effectively captures key facial features, demonstrating its robustness against noise and low resolution.

Emotion Classification

The diagnostic plots:

The emotion classification model achieved the following performance metrics on the FER2013 dataset:

Emotion	Accuracy	Sample Count
Happy	82.82%	1,983
Surprise	74.28%	831
Neutral	73.64%	1,234
Fear	11.74%	764
Disgust	38.18%	123
Overall	52.38%	3,589

Key findings:

Attention Mechanism Impact: Self-attention improved accuracy by 4.2% over baseline ResNet18, proving the value of focusing on emotion-relevant facial regions
Class Imbalance Reality: Model excelled on majority classes but struggled with minority emotions, reflecting real-world dataset challenges
Data Quality Constraints: 18% face detection failure rate highlighted limitations when working with low-resolution facial data
Architecture Choices: ResNet18 with attention struck a good balance between model complexity and performance for this constrained dataset

Future Work

Address class imbalance using oversampling, undersampling, or synthetic data.
Explore advanced methods like transfer learning, ensemble models, or Vision Transformers (ViTs) for improved feature extraction and accuracy.