Cross-Modal Perceptionist

Cross-Modal Perceptionist: Can Face Geometry be Gleaned from Voices?
CVPR 2022

Cho-Ying Wu
USC, CGIT Lab
Chin-Cheng Hsu
USC
Ulrich Neumann
USC, CGIT Lab

Introduction

This work digs into a root question in human perception: can face geometry be gleaned from one's voices?

Previous works that study this question only operate on image representation, but inevitably include irrelevant factors, such as hair, cosmetics, and background, which can be changed arbitrarily with the same human voice. Further, correlation between skin color and ethnicity are controversial (See feature visualization from Speech2Face). To clearly study the correlation between voice and geometry, we instead work on 3D, use mesh representation, and focus on geometry only.

We propose two framework to validate the correlation: a supervised method and an unsupervised method. The former is used when pairs of voice and 3D face data exist. The latter is used when we lack such datasets or have low fidelity in them.

Supervised setting

The unsupervised framework is shown as follows. This setting serves an ideal case that when paired voice and 3D face data exist. The supervised framework directly learns the 3D face reconstruction pipeline from paired voices and 3D faces.

Voxceleb-3D

We propose a dataset, Voxceleb-3D. We fetch voice banks in Voxceleb and face images from VGGFace for celebrities appear in both datsets.

To obtain 3D mesh, we first extract facial landmarks for images. Then we optimize 3DMM parameters to fit in landmarks. 3DMM is a parametric method to reconstruct meshes by a few controllable parameters. In our work, we use popular BFM faces and use a 62-dim vector to control the face reconstruction. To this end, we obtain paired voice and 3D face data.

We show fitting in the following figure, top: images from VGGFace; down: meshes overlayed on images.

Unsupervised setting

The unsupervised framework is shown as follows. This setting serves a more realistic purpose that it's very hard to obtain large-scale paired voice and 3D face data. The unsupervised framework utilizes the knowledge distillation (KD) to distill knacks from an image-to-3D-face expert to facilitate the unsupervised end-to-end training. Images here is a bridge representation that connects voice and 3D mesh.

Result:

Q1: Is it feasible to predict visually reasonable face meshes from voice?

Results from supervised learning

Results from unsupervised learning

Q2: How stable is the mesh prediction from different utterances of the same person?

Results from supervised learning

Results from unsupervised learning

Q3. Compared with face meshes produced by baselines, can the performance from the joint training flow improve? How much?

We use the baselines: direct cascaded pretrained models of voice-to-img and img-to-mesh. scales

Results from supervised learning

Results from unsupervised learning

Quantitative comparison

Q4. What is the major improvement that voice information can bring in the joint training flow?

From the quantitative comparison, we find the major improvement comes from the ear-to-ear ratio (ER), which corresponds to overall wideness of faces. Therefore, the relative face wideness or thinness is the property that voice can indicate. This matches our experience that when someone starts to talk, even before seeing one's face, we can roughly imagine whether one's face is wider or thinner, but we cannot imagine fine-grained details, such as wrinkles or bumps.

Subjective comparison

scales

Impact and Ethics

There are arguably implicit factors, such as voices after smoking and drinking might be different. The data of Voxceleb contains speech from interviews, where interviewees usually speak in normal voices. More implicit and subtle factors such as drug use or health conditions might affect voices, but it needs clinical studies and should be validated from physiological views.

The results shown in this work only aim to point out the correlation between voice and face (skull) structure exist and do not make assumptions on race/ethnic origin, and this work does not indicate the relation between race and voice or race and face structure. As mentioned in Introduction, the correlation between race/ethnicity cannot be easily resolved. Besides, the reconstructed meshes do not contain skin color, facial textures, or hairstyles that can explicitly correspond to one’s true identity, and thus anonymity can be preserved.

The website template was borrowed from Michaël Gharbi