1. 程式人生 > 實用技巧 >paper carefully review : Seeing Voices and Hearing Faces: Cross-modal Biometric Matching

paper carefully review : Seeing Voices and Hearing Faces: Cross-modal Biometric Matching

文章目錄

Seeing Voices and Hearing Faces: Cross-modal Biometric Matching

code: https://github.com/a-nagrani/SVHF-Net

project URL: http://www. robots.ox.ac.uk/˜vgg/research/CMBiometrics

Summary

The author wants to infer how we combine voice with a face. In this paper, the author does many work base on VGGFace and VoxCeleb database. Its main contributions can be summarized as follow :

  1. introduce CNN for binary or multi-way’s face matching with audio.
  2. Using different audio to identify the dynamic speaker.
  3. the author discovers that CNN matches human performance on easy examples (different gender). But it exceeds human judgment in complex examples. (face has the same gender, age, and nationality)

摘要 (中文)

我們引入了一項看似不可能的任務:僅給某人講話的音訊片段,確定講話者是兩個面部影象中的哪個。在本文中,我們研究了這一問題以及許多相關的跨模式任務,旨在回答以下問題:我們可以從關於面部的聲音中推斷出多少,反之亦然?我們使用公開可用的資料集,從靜態影象(VGGFace)和音訊的說話者識別(VoxCeleb)中使用公開的資料集,“在野外”研究此任務。這些為交叉模式匹配的靜態和動態測試提供了培訓和測試方案。我們做出了以下貢獻:(i)我們介紹了用於二進位制和多路交叉模式人臉和音訊匹配的CNN架構,(ii)比較了動態測試(可提供視訊資訊,但音訊不是來自同一視訊,而是經過靜態測試(只有一個靜止影象可用),並且(iii)我們使用人工測試作為基準來校準任務的難度。我們展示了CNN確實可以在靜態和動態場景中都經過訓練來解決此任務,並且甚至在給定聲音的情況下對人臉進行10次分類的可能性也大大超過了。CNN在簡單示例(例如,兩張面孔上的性別不同)上與人類表現相匹配,但在更具挑戰性的示例(例如,具有相同性別,年齡和國籍的面孔)上,其表現優於人類。甚至在給定聲音的情況下對人臉進行10種分類的機會都大大超過了。CNN在簡單示例(例如,兩張面孔上的性別不同)上與人類表現相匹配,但在更具挑戰性的示例(例如,具有相同性別,年齡和國籍的面孔)上,其表現優於人類。甚至在給定聲音的情況下對人臉進行10種分類的機會都大大超過了。CNN在簡單示例(例如,兩張面孔上的性別不同)上與人類表現相匹配,但在更具挑戰性的示例(例如,具有相同性別,年齡和國籍的面孔)上,其表現優於人類。

在這裡插入圖片描述

Research Objective

The author aims to explore whether we can judge people only by audio. Just giving an audio clip of a voice, determine which of two or more face images or videos it corresponds to. Note, the voice and face video are not acquired simultaneously, so methods of active speaker detection that may rely on synchronisation of the audio and lip motion,e.g. [11] cannot be employed here.

Background and Problems

  • Background

    • Recommender sysytems focus on top-N recommendation. The previous menthods can be divided into latent space methods and neihbourhood-based methods.(user-based and item-based) In recently years, many researches using moview description instead of focusing users past item.
    • age, gender, ethnicity/accent influence both the facial appearance and voice.
    • Besides the above static properties, Sheffert and Olson [40] suggested that visual information about a person’s particular idiosyncratic speaking style is related to the speaker’s auditory attributes.
  • previous methods brief introduction

    • not state, maybe it is newest research.
  • Problem Statement

    • not state in introduction

Related work

  • Human Perception Studies:

    • Broad consensus Research exploring cross pattern matching of faces and voices In the case of human participants, matching is only possible when dynamic visual information about the pronunciation patterns is available [19, 26, 37].
  • Problem Statement

    • It is worth noting that the difficulty of the task is highly dependent on the specific stimuli sets provided.
  • Face Recognition and Speaker Identification:

    • we note that the recent advent of deep CNNs with large datasets has considerably advanced the state-of-the-art in both face recognition [21, 36, 46, 47] and speaker recognition [14, 33, 39, 45].
  • Problem Statement

  • Unfortunately, while these recognition models have proven remarkably effective at representation learning from a single modality, the alignment of learned representations across the modalities is less developed.
  • Cross-modal Matching

    • Cross-modal matching has received considerable attention using visual data and text (natural language). Methods have been developed to establish mappings from images [16, 20, 23, 25, 50] and videos [49] to textual descriptions (e.g. captioning), generating visual models from text [51, 57] and solving visual question answering problems [1, 29, 31].
  • Problem Statement

    • In cross-modal matching between video and audio however, work is limited, particularly in the field of biometrics (person or speaker recognition).

summary: only one research had done relvant work[38]. But it is not in big dataset and not still face images.

Method(s)

  • Methods
    • (1) The static 3-stream CNN architecture consisting of two face sub-networks and one voice network.
    • (2) a 5-stream dynamic-fusion architecture with two extra streams as dynamic feature subnetworks, and finally.
    • (3) the N-way classification architecture which can deal with any number of face inputs at test time due to the concept of query pooling.

在這裡插入圖片描述

  • Input :
    • Voices: 512 ⇥ 300 for three seconds of speech.

    • Static Faces: an RGB image, which has been cropped from a source image to contain only the region of the image surrounding a face. size is 224 ⇥ 224

    • Dynamic Faces:including 3D convolutions [18], optical flow [41] and dynamic images [6] which have proven to be particularly effective in the context of human action recognition.

  • Architectures
    • Static Architecture:Our base architecture comprises two face sub-networks and one voice sub-network. Both the face and voice streams use the VGG-M architecture [10].
    • Dynamic-Fusion Architecture:The features computed for each face (RGB + dynamic) are combined after the final fully connected layer in each stream through summation.
      在這裡插入圖片描述
    • N-way Classification Architecture:One approach to resolving this issue would be to concatenate the voice to each face stream separately. And the author add a mean pooling layer to each face stream which calculates
      the ‘mean face’ of all the faces in a particular query, thereby making each stream context aware.

Evaluation And Experiment

  • DtataSet distributed:
  • VGGFace
  • VoxCeleb
    在這裡插入圖片描述
  • Train/Test Split:
    All speakers whose names start with ‘A’ or ‘B’ are reserved for validation, while speakers with names starting with ‘C’, ‘D’, ‘E’ are reserved for testing.

  • Gender, Nationality and Age (GNA) Variation:

    • We use these labels to construct a more challenging test set, wherein each triplet contains speakers of the same gender, broad age bracket
  • Training Protocol

    • batchsize , optimizer.
    • pre-trained weights from the VGGFace and VoxCeleb models
    • data augmentation techniques used on the ImageNet classification task by [42] (i.e. random cropping, flipping, colour shift). For the audio segments, we change the speed of each segment by choosing a random speed ratio between 0.95 to 1.05.
    • Networks are trained for 10 epochs, or until validation error stops decreasing, whichever is sooner.
  • Method

    • Static Matching: make use of both still images from VGGFace and frames extracted from the videos in the VoxCeleb dataset during training. When processing frames extracted from the VoxCeleb videos, and the author ensure that the audio segments and frames in a single triplet are not sourced from the same video.
    • Dynamic Matching: we experiment with different methods for extracting dynamic information from a face-track.
    • N-way Classification : pass
  • Metrics : the author define two metrics to evaluate performance; Identification Accuracy and Marginal Accuracy.

    • identification accuracy在這裡插入圖片描述
    • marginal accuracy
      在這裡插入圖片描述
  • Baselines : there are no prior baselines to compare to, So the author just according human self perception.

  • Results :

    • Static and Dynamic Matching: Static and dynamic cases in table 2. The results for the dynamic task are better than those for the static task (by more than 3% for the V-F case).
      在這裡插入圖片描述
      • N-way classification:
        在這裡插入圖片描述
  • Analysis:

    • Comparison to the Human Benchmark: On the more challenging test set with GNAvariation removed, however, human performance is significantly lower.

    Marginal Accuracies: some face-voice combinations are significantly more discriminative than others.
    在這裡插入圖片描述

  • Ablation Analysis : We experiment with three different methods of incorporating dynamic features in our architecture.
    在這裡插入圖片描述
    在這裡插入圖片描述

    • As seen from figure 5, it is harder to discern latent variables like age, gender, ethnicity in these images, while mouth motion is clearly encoded. Using these dynamic images alone, we still achieve an accuracy of 77%, suggesting that the network may be able to exploit dynamic cross-modal biometrics.

Conclusion

  • main controbution
  1. n this paper, we have introduced the novel task of crossmodal matching between faces and voices, and proposed a corresponding CNN architecture to address it.
  2. The results of the experiments strongly suggest the existence of cross-modal biometric information, leading to the conclusion that perhaps our faces are more similar to our voices than we think.
  • week point
  1. not reflected .
  • further work
  1. not reflected .

Reference(optional)

a blog explain this essay in chinese

Arouse for me

  • this paper has fundanmatoal experiment as root, I need to study it. Let our experiment more reasonly.
  • This area is not too many preious work, and its dataset is not big. Maybe 30G. I can do it if the author publish code is able to running.
  • Here has a new research compare this experiment and finally got a more excellent score in across-modal matching task . here is paper introduction by Chinese.