Assessing Disentanglement and Interpretability in Variational Autoencoders: A Comparative Study on Latent Representations

First Phase with dSprites Dataset:
- Objective: We start by using the dSprites dataset, which contains simple, synthetic images of shapes with varying attributes like shape, size, and orientation. This controlled environment helps us evaluate how well VAEs can disentangle these features.
- Methods: We use a VAE enhanced with VGG perceptual loss and a Total Correlation (TC) penalty to encourage better disentanglement. We measure success using Mutual Information Gap (MIG) and Beta-VAE scores.
- Explainability: We employ three methods—Feature Ablation, Counterfactual Latent Traversals, and Clustering—to interpret the latent space learned by the VAE. We evaluate these methods based on their fidelity (how well they correlate with the VAE’s reconstruction error) and stability (consistency of interpretations).
Second Phase with CelebA Dataset:
- Objective: We plan to extend our study to the CelebA dataset, which contains real-world images of faces. This phase will test the robustness of our methods in a more complex and varied setting.
- Focus: We will explore how well VAEs can disentangle facial attributes and how reliable our interpretability methods are when applied to real-world data.

RQ1: How effective is a VAE with VGG and TC loss in achieving disentangled latent representations on the dSprites dataset, as measured by MIG and Beta-VAE scores?
RQ2: How do Feature Ablation, Counterfactual Latent Traversals, and Clustering compare in explaining VAE latent spaces on dSprites, in terms of fidelity and stability?
H1: Can these explainable AI (XAI) methods achieve stable attributions (stability between 0.1 and 0.5) across different dSprites inputs?
H3: Can these XAI methods produce attributions with positive fidelity (between 0.3 and 0.8) to the VAE’s reconstruction error on dSprites?

Get started