True to Tone? Quantifying Skin Tone Fidelity and Bias in Photographic-to-Virtual Human Pipelines

Published 2 Apr 2026 in cs.CV | (2604.02055v1)

Abstract: Accurate reproduction of facial skin tone is essential for realism, identity preservation, and fairness in Virtual Human (VH) rendering. However, most accessible avatar creation pipelines rely on photographic inputs that lack colorimetric calibration, which can introduce inconsistencies and bias. We propose a fully automatic and scalable methodology to systematically evaluate skin tone fidelity across the VH generation pipeline. Our approach defines a full workflow that integrates skin color and illumination extraction, texture recolorization, real-time rendering, and quantitative color analysis. Using facial images from the Chicago Face Database (CFD), we compare skin tone extraction strategies based on cheek-region sampling, following the literature, and multidimensional masking derived from full-face analysis. Additionally, we test both strategies with lighting isolation, using the pre-trained TRUST framework, employed without any training or optimization within our pipeline. Extracted skin tones are applied to MetaHuman textures and rendered under multiple lighting configurations. Skin tone consistency is evaluated objectively in the CIELAB color space using the $ΔE$ metric and the Individual Typology Angle (ITA). The proposed methodology operates without manual intervention and, with the exception of pre-trained illumination compensation modules, the pipeline does not include learning or training stages, enabling low computational cost and large-scale evaluation. Using this framework, we generate and analyze approximately 19,848 rendered instances. Our results show phenotype-dependent behavior of extraction strategies and consistently higher colorimetric errors for darker skin tones.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces an automated pipeline that assesses skin tone fidelity using multiple extraction methods, illumination compensation, and real-time rendering techniques.
It demonstrates that TRUST-based methods significantly reduce colorimetric errors, with median ITA improvements for darker skin tones compared to standard approaches.
Findings highlight systematic bias in virtual human rendering, emphasizing the need for enhanced global illumination compensation and comprehensive skin sampling.

Quantitative Analysis of Skin Tone Fidelity and Bias in Photographic-to-Virtual Human Pipelines

Introduction

The paper "True to Tone? Quantifying Skin Tone Fidelity and Bias in Photographic-to-Virtual Human Pipelines" (2604.02055) presents a rigorous quantitative framework for assessing the fidelity and fairness of skin tone representation in contemporary virtual human (VH) pipelines. The study targets the reproducibility of facial skin tones when translating non-calibrated photographic images into 3D avatar renderings, acknowledging the critical implications for realism, identity preservation, and algorithmic bias in immersive applications.

Methodological Framework

The authors propose a fully automated, scalable evaluation pipeline comprising several key stages: (1) albedo extraction via four distinct methods (Cheek, MMM, T-Cheek, T-MMM), (2) illumination compensation leveraging the TRUST framework, (3) texture recolorization and application to MetaHuman assets, (4) real-time rendering in Unreal Engine under multiple lighting conditions, and (5) quantitative analysis of color reproduction using CIELAB $\Delta E$ and Individual Typology Angle (ITA) metrics.

The use of the Chicago Face Database (CFD) ensures demographic coverage and controlled variability of the source imagery, including extensions for multiracial and Indian subsets (total $N=827$ images). The pipeline synthesizes 19,848 rendered instances, supporting statistically robust inferences and the analysis of pipeline interactions at scale.

Figure 1: Schematic illustration of the methodology spanning input image extraction, albedo extraction, texture recolorization, lighting variation, and quantitative evaluation.

Skin Color Extraction and Rendering

The paper evaluates four primary color extraction methods:

Cheek: sRGB averaging over cheek regions.
MMM: Multidimensional masking using CIELAB clustering of full-face pixels.
T-Cheek and T-MMM: Variants applying TRUST intrinsic image decomposition to the Cheek and MMM routines, decoupling illumination from albedo estimation.

After color extraction, target skin tones are applied to MetaHuman base textures via normalization-based and variation-map recoloring, then rendered in Unreal Engine under three lighting regimes: CFD-matched, Frontal, and Paramount.

Figure 2: Renderings from the CFD dataset, comparing lighter (ITA Class 1) and darker (ITA Class 6) skin phenotypes across extraction methods and lighting setups.

Quantitative Evaluation: $\Delta E$ and ITA

Color fidelity is assessed via perceptual $\Delta E$ (CIE76) and categorical ITA error, calibrating the analysis to dermatologically motivated skin tone classes (I–VI). The results demonstrate marked increases in colorimetric reproduction error for darker skin tones, with the following key findings:

Extraction Method Impact: Non-illumination compensated approaches (Cheek, MMM) yield higher median $\Delta E$ and ITA error compared to TRUST-based methods. For example, Cheek and MMM methods confer median ITA errors of 29.32 and 24.89, respectively, versus 15.08 (T-Cheek) and 12.83 (T-MMM).
Lighting Impact: CFD-matched illumination substantially outperforms Frontal and Paramount, especially for darker phenotypes, with statistically significant error escalation under studio lighting.
Phenotype Sensitivity: ITA error exhibits steep amplification with increasing ITA class (darker phototypes), e.g., medians grow from 12.12 (ITA I) to 49.43 (ITA VI).
Figure 3: Distribution of $\Delta E$ and ITA errors across lighting and extraction methods, with stronger errors observed in non-TRUST methods and under studio lighting.

Figure 4: ITA error breakdown by skin phenotype and extraction method, highlighting disproportionate degradation for darker phototypes.

Qualitative Assessment and Confusion Analysis

Qualitative visualizations of color difference extrema affirm the quantitative results, with systematic lightening and desaturation for darker skin tones noted in all but the TRUST-based methods. Confusion matrix analysis between ground truth and rendered ITA classes reveals substantial reclassification of darker phenotypes into lighter categories, confirming persistent bias propagation.

Figure 5: Maximal $\Delta E$ samples illustrating the dominant failure modes across extraction and lighting setups for each ITA class.

Figure 6: Confusion matrix between ground truth and rendered ITA classes, evidencing substantial misclassification towards lighter phototypes.

Statistical Analysis

Non-parametric statistical tests (Kruskal-Wallis with Dunn post hoc) validate all main effects and interactions:

Extraction method and lighting condition both exert highly significant influences on colorimetric outcome ( $p < 0.001$ ).
Interaction effects between extraction method, lighting, and phenotype amplify disparities for darker phototypes.

Implications and Future Directions

The results establish that prevalent VH pipelines exhibit significant, phenotype-dependent errors when tasked with reproducing diverse skin tones from uncalibrated photography, with systematic bias against accurate depiction of darker phenotypes. TRUST-based, illumination-compensated extraction methods partially mitigate, but do not eliminate, this discrepancy.

Key practical implications include:

Necessity for global illumination compensation as a standard preprocessing stage in VH pipelines to reduce phenotype-dependent error.
Importance of holistic, full-face skin sampling to avoid regional artifacts or over-lightening.
Limitations of current real-time renderers in reproducing darker skin tones faithfully, especially under studio lighting.

Theoretically, these findings mandate a reconsideration of pipeline compositionality: bias and error accumulate multiplicatively, not additively, under typical design heuristics. The methodology also provides a rigorous, scalable benchmark for future synthetic-to-virtual evaluations.

Prospects for Further Research

The authors recommend the integration of subjective perceptual studies to correlate objective error metrics with human judgments of realism, comfort, and fairness. Additionally, automation via explainable ML models may enable real-time detection and correction of pipeline-induced skin tone distortions, contributing to more equitable avatar generation in diverse applications.

Conclusion

This work systematically quantifies the complex, phenotype-dependent propagation of skin tone errors and bias in end-to-end VH pipelines. By dissecting the contribution of extraction, illumination, and rendering submodules through large-scale, statistically rigorous analysis, it exposes the persistent technical and ethical challenges in current practice. The open, parameterized evaluation framework offers a robust platform for ongoing auditing, standardization, and improvement of fairness in computer graphics and AI-driven virtual human synthesis.

Markdown Report Issue