paper carefully review : Seeing Voices and Hearing Faces: Cross-modal Biometric Matching

阿新 • • 發佈：2020-10-09

文章目錄

Seeing Voices and Hearing Faces: Cross-modal Biometric Matching

Seeing Voices and Hearing Faces: Cross-modal Biometric Matching

code: https://github.com/a-nagrani/SVHF-Net

project URL: http://www. robots.ox.ac.uk/˜vgg/research/CMBiometrics

Summary

The author wants to infer how we combine voice with a face. In this paper, the author does many work base on VGGFace and VoxCeleb database. Its main contributions can be summarized as follow :

introduce CNN for binary or multi-way’s face matching with audio.

Using different audio to identify the dynamic speaker.
the author discovers that CNN matches human performance on easy examples (different gender). But it exceeds human judgment in complex examples. (face has the same gender, age, and nationality)

摘要 (中文)

我們引入了一項看似不可能的任務：僅給某人講話的音訊片段，確定講話者是兩個面部影象中的哪個。在本文中，我們研究了這一問題以及許多相關的跨模式任務，旨在回答以下問題：我們可以從關於面部的聲音中推斷出多少，反之亦然？我們使用公開可用的資料集，從靜態影象（VGGFace）和音訊的說話者識別（VoxCeleb）中使用公開的資料集，“在野外”研究此任務。這些為交叉模式匹配的靜態和動態測試提供了培訓和測試方案。我們做出了以下貢獻：（i）我們介紹了用於二進位制和多路交叉模式人臉和音訊匹配的CNN架構，（ii）比較了動態測試（可提供視訊資訊，但音訊不是來自同一視訊，而是經過靜態測試（只有一個靜止影象可用），並且（iii）我們使用人工測試作為基準來校準任務的難度。我們展示了CNN確實可以在靜態和動態場景中都經過訓練來解決此任務，並且甚至在給定聲音的情況下對人臉進行10次分類的可能性也大大超過了。CNN在簡單示例（例如，兩張面孔上的性別不同）上與人類表現相匹配，但在更具挑戰性的示例（例如，具有相同性別，年齡和國籍的面孔）上，其表現優於人類。甚至在給定聲音的情況下對人臉進行10種分類的機會都大大超過了。CNN在簡單示例（例如，兩張面孔上的性別不同）上與人類表現相匹配，但在更具挑戰性的示例（例如，具有相同性別，年齡和國籍的面孔）上，其表現優於人類。甚至在給定聲音的情況下對人臉進行10種分類的機會都大大超過了。CNN在簡單示例（例如，兩張面孔上的性別不同）上與人類表現相匹配，但在更具挑戰性的示例（例如，具有相同性別，年齡和國籍的面孔）上，其表現優於人類。

在這裡插入圖片描述

Research Objective

The author aims to explore whether we can judge people only by audio. Just giving an audio clip of a voice, determine which of two or more face images or videos it corresponds to. Note, the voice and face video are not acquired simultaneously, so methods of active speaker detection that may rely on synchronisation of the audio and lip motion,e.g. [11] cannot be employed here.

Background and Problems

Background
- Recommender sysytems focus on top-N recommendation. The previous menthods can be divided into latent space methods and neihbourhood-based methods.(user-based and item-based) In recently years, many researches using moview description instead of focusing users past item.
- age, gender, ethnicity/accent influence both the facial appearance and voice.
- Besides the above static properties, Sheffert and Olson [40] suggested that visual information about a person’s particular idiosyncratic speaking style is related to the speaker’s auditory attributes.
previous methods brief introduction
- not state, maybe it is newest research.
Problem Statement
- not state in introduction

Related work

Human Perception Studies:
- Broad consensus Research exploring cross pattern matching of faces and voices In the case of human participants, matching is only possible when dynamic visual information about the pronunciation patterns is available [19, 26, 37].
Problem Statement
- It is worth noting that the difficulty of the task is highly dependent on the specific stimuli sets provided.
Face Recognition and Speaker Identification:
- we note that the recent advent of deep CNNs with large datasets has considerably advanced the state-of-the-art in both face recognition [21, 36, 46, 47] and speaker recognition [14, 33, 39, 45].
Problem Statement

Unfortunately, while these recognition models have proven remarkably effective at representation learning from a single modality, the alignment of learned representations across the modalities is less developed.

Cross-modal Matching
- Cross-modal matching has received considerable attention using visual data and text (natural language). Methods have been developed to establish mappings from images [16, 20, 23, 25, 50] and videos [49] to textual descriptions (e.g. captioning), generating visual models from text [51, 57] and solving visual question answering problems [1, 29, 31].
Problem Statement
- In cross-modal matching between video and audio however, work is limited, particularly in the field of biometrics (person or speaker recognition).

summary： only one research had done relvant work[38]. But it is not in big dataset and not still face images.

Method(s)

Methods
- (1) The static 3-stream CNN architecture consisting of two face sub-networks and one voice network.
- (2) a 5-stream dynamic-fusion architecture with two extra streams as dynamic feature subnetworks, and finally.
- (3) the N-way classification architecture which can deal with any number of face inputs at test time due to the concept of query pooling.

在這裡插入圖片描述

Input :
- Voices: 512 ⇥ 300 for three seconds of speech.
- Static Faces: an RGB image, which has been cropped from a source image to contain only the region of the image surrounding a face. size is 224 ⇥ 224
- Dynamic Faces:including 3D convolutions [18], optical flow [41] and dynamic images [6] which have proven to be particularly effective in the context of human action recognition.
Architectures
- Static Architecture:Our base architecture comprises two face sub-networks and one voice sub-network. Both the face and voice streams use the VGG-M architecture [10].
- Dynamic-Fusion Architecture:The features computed for each face (RGB + dynamic) are combined after the final fully connected layer in each stream through summation.
- N-way Classification Architecture:One approach to resolving this issue would be to concatenate the voice to each face stream separately. And the author add a mean pooling layer to each face stream which calculates
  the ‘mean face’ of all the faces in a particular query, thereby making each stream context aware.

Evaluation And Experiment

DtataSet distributed:

VGGFace
VoxCeleb

Train/Test Split:
All speakers whose names start with ‘A’ or ‘B’ are reserved for validation, while speakers with names starting with ‘C’, ‘D’, ‘E’ are reserved for testing.
Gender, Nationality and Age (GNA) Variation:
- We use these labels to construct a more challenging test set, wherein each triplet contains speakers of the same gender, broad age bracket
Training Protocol
- batchsize , optimizer.
- pre-trained weights from the VGGFace and VoxCeleb models
- data augmentation techniques used on the ImageNet classification task by [42] (i.e. random cropping, flipping, colour shift). For the audio segments, we change the speed of each segment by choosing a random speed ratio between 0.95 to 1.05.
- Networks are trained for 10 epochs, or until validation error stops decreasing, whichever is sooner.
Method
- Static Matching: make use of both still images from VGGFace and frames extracted from the videos in the VoxCeleb dataset during training. When processing frames extracted from the VoxCeleb videos, and the author ensure that the audio segments and frames in a single triplet are not sourced from the same video.
- Dynamic Matching: we experiment with different methods for extracting dynamic information from a face-track.
- N-way Classification : pass
Metrics : the author define two metrics to evaluate performance; Identification Accuracy and Marginal Accuracy.
- identification accuracy
- marginal accuracy
Baselines : there are no prior baselines to compare to, So the author just according human self perception.
Results :
- Static and Dynamic Matching: Static and dynamic cases in table 2. The results for the dynamic task are better than those for the static task (by more than 3% for the V-F case).
  - N-way classification:
Analysis:
- Comparison to the Human Benchmark: On the more challenging test set with GNAvariation removed, however, human performance is significantly lower.
Marginal Accuracies: some face-voice combinations are significantly more discriminative than others.
Ablation Analysis : We experiment with three different methods of incorporating dynamic features in our architecture.
- As seen from figure 5, it is harder to discern latent variables like age, gender, ethnicity in these images, while mouth motion is clearly encoded. Using these dynamic images alone, we still achieve an accuracy of 77%, suggesting that the network may be able to exploit dynamic cross-modal biometrics.

Conclusion

main controbution

n this paper, we have introduced the novel task of crossmodal matching between faces and voices, and proposed a corresponding CNN architecture to address it.
The results of the experiments strongly suggest the existence of cross-modal biometric information, leading to the conclusion that perhaps our faces are more similar to our voices than we think.

week point

not reflected .

further work

not reflected .

Reference(optional)

a blog explain this essay in chinese

Arouse for me

this paper has fundanmatoal experiment as root, I need to study it. Let our experiment more reasonly.
This area is not too many preious work, and its dataset is not big. Maybe 30G. I can do it if the author publish code is able to running.
Here has a new research compare this experiment and finally got a more excellent score in across-modal matching task . here is paper introduction by Chinese.

paper carefully review : Seeing Voices and Hearing Faces: Cross-modal Biometric Matching

文章目錄 Seeing Voices and Hearing Faces: Cross-modal Biometric MatchingSummary摘要 (中文)Research ObjectiveBackground and ProblemsRelated workMethod(s)Evaluation And ExperimentConclusionRe

閱讀筆記 Modality-specific and shared generative adversarial network for cross-modal retrieval

這一篇論文講的是使用多模態來進行圖片的檢索，通過文字檢索出最好的圖片，模型結構如下：

每日一篇文獻：Robotic pick-and-place of novel objects in clutter with multi-affordance grasping and cross-domain image matching

標題：Robotic pick-and-place of novel objects in clutter with multi-affordance grasping and cross-domain image matching

OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation

OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation 2021-07-2120:23:07 Paper:https://arxiv.org/pdf/2107.00249.pdf

Paper 實現 - Implicit Fairing of Irregular Meshes using Diffusion and Curvature Flow

Paper 實現 - Implicit Fairing of Irregular Meshes using Diffusion and Curvature Flow Desbrun, Mathieu & Meyer, Mark & Schröder, Peter & Barr, Alan. (2001). Implicit Fairing of Irregular

#樹狀陣列#CF461C Appleman and a Sheet of Paper

題目傳送門分析可以發現往左翻太多相當於往右翻一點，所以如果翻的位置超過一半那麼打一個取反標記再另一邊翻轉，

【論文閱讀】Affective database for e-learning and classroom environments using Indian students’ faces, hand gestures and body postures】

1.這篇文章究竟講了什麼問題？幾乎沒有一個標準的資料集，包含學生情感狀態識別以及分析，在線上課堂和教室環境。

paper carefully review : Seeing Voices and Hearing Faces: Cross-modal Biometric Matching

文章目錄

Seeing Voices and Hearing Faces: Cross-modal Biometric Matching

Summary

摘要 (中文)

Research Objective

Background and Problems

Related work

Method(s)

Evaluation And Experiment

Conclusion

Reference(optional)

Arouse for me

paper carefully review : Seeing Voices and Hearing Faces: Cross-modal Biometric Matching

閱讀筆記 Modality-specific and shared generative adversarial network for cross-modal retrieval

每日一篇文獻：Robotic pick-and-place of novel objects in clutter with multi-affordance grasping and cross-domain image matching

OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation

Paper 實現 - Implicit Fairing of Irregular Meshes using Diffusion and Curvature Flow

#樹狀陣列#CF461C Appleman and a Sheet of Paper

【論文閱讀】Affective database for e-learning and classroom environments using Indian students’ faces, hand gestures and body postures】

【luogu CF461C】Appleman and a Sheet of Paper（樹狀陣列）

A review of location encoding for GeoAI: methods and applications

iOS Jailbreak Principles 0x02 - codesign and amfid bypass

動手實現MySQL讀寫分離and故障轉移

numpy.array shape (R, 1) and (R,) 的區別

LeetCode 841：鑰匙和房間 Keys and Rooms

Sentinel Getting Started And Integration of Spring Cloud Alibaba Tutorials

Joins in SQL - Inner, Outer, Left and Right

解決大於5.7版本mysql的分組報錯Expression #1 of SELECT list is not in GROUP BY clause and contains nonaggregated

SQL語句中OR和AND的混合使用的小技巧

關於SQL語句中的AND和OR執行順序遇到的問題

MongoDb的"not master and slaveok=false"錯誤及解決方法

Django Form and ModelForm的區別與使用

paper carefully review : Seeing Voices and Hearing Faces: Cross-modal Biometric Matching

文章目錄

Seeing Voices and Hearing Faces: Cross-modal Biometric Matching

Summary

摘要 (中文)

Research Objective

Background and Problems

Related work

Method(s)

Evaluation And Experiment

Conclusion

Reference(optional)

Arouse for me

相關推薦