【Music】視訊配樂|多模態檢索 Content-based video–music retrieval (CBVMR) Using Soft Intra-Modal 筆記

阿新 • • 發佈：2020-10-14

2018 ICMR
Content-Based Video–Music Retrieval Using Soft Intra-Modal Structure Constraint

Introduction

bidirectional retrieval

挑戰

設計一種對元資料沒有要求的跨模式模型
難以獲得匹配的視訊音樂對，視訊和音樂之間的匹配標準比其他跨模態任務（例如，影象到文字的檢索）更加模糊

Contributions

Content-based, cross-modal embedding network
- introduce VM-NET, two-branch neural network that infers the latent alignment between videos and music tracks using only their contents
- train the network via inter-modal ranking loss
  such that videos and music with similar semantics end up close together in the embedding space

However, if only the inter-modal ranking constraint for embedding is considered, modality-specific characteristics (e.g., rhythm or tempo for music and texture or color for image) may be lost.

devise a novel soft intra-modal structure constraint
takes advantage of the relative distance relationship of samples within each modality
does not require ground truth pair information within individual modality.

Large-scale video–music pair dataset

Hong–Im Music–Video 200K (HIMV- 200K)
composed of 200,500 video–music pairs.

Evaluation

[email protected]
subjective user evaluation

Related work

A. Video–Music Related Tasks

conventional approaches can be divided into three categories according to the task:

generation,
classification
matching

大多數現有方法使用元資料（例如，關鍵字，心情標籤和相關描述）

B. Two-branch Neural Networks Over

不同模態之間的關係
影象與文字相關聯

音樂視訊情感標籤
Tunesensor: A semantic-driven music recommendation service for digital photo albums （ISWC 2011）

Method

A. Music Feature Extraction

decompose an audio signal into harmonic and percussive components
諧波 / 打擊樂
apply log-amplitude scaling to each component
to avoid numerical underflow
slice the components into shorter segments called local frames (or windowed excerpts) and extract multiple features from each component of each frame.

Frame-level features.

Spectral features
頻譜特徵
The first type of audio features are derived from spectral analyses.

first apply the fast Fourier transform and the discrete wavelet transform to the windowed signal in each local frame
快速傅立葉變換/離散小波變換
From the magnitude spectral results
幅度頻譜結果
- compute summary features including the spectral centroid, the spectral bandwidth, the spectral rolloff, and the first and second order polynomial features of a spectrogram
  頻譜質心，頻譜頻寬，頻譜衰減以及頻譜圖的一階和二階多項式特徵

Mel-scale features
梅爾尺度特徵

compute the Mel-scale spectrogram of each frame as well as the Mel-frequency Cepstral Coefficients (MFCC)
每幀的梅爾尺度譜圖以及梅爾頻率倒譜系數（MFCC）
to extract more meaningful features
use delta-MFCC features（the first- and second-order differences in MFCC features over time）
增量MFCC
capture variations of timbre over time
音色隨時間變化

Chroma features
色度

use chroma short-time Fourier transform as well as chroma energy normalized
色度短時傅立葉變換以及色度能量歸一化
While Mel- scaled representations efficiently capture timbre, they provide poor resolution of pitches and pitch classes
儘管梅爾音階表示法可以有效地捕獲音色，但它們對音高和音高等級的解析度較差

Etc.

use the number of time domain zero-crossings as an audio feature
時域零交叉點的數量
in order to detect the amount of noise in the audio signal.
use the root-mean-square energy for each frame
均方根能量

B. Video Feature Extraction

###Frame-level features

HIMV-200K dataset 包含大量資料，CNN 從頭訓太久
因此使用在 ImageNet 上預訓練的 Inception，extract frame-level features
whitened principal component analysis (WPCA)
the normalized features are approximately multivariate Gaussian with zero mean and identity covariance
應用白化的主成分分析（WPCA），以使歸一化特徵為具有零均值和恆等協方差的近似多元高斯

Video-level features

concatenation the music-level features
a global normalization process（subtracts the mean of vectors from all the features）
principal component analysis (PCA)

L2 normalization

C. Multimodal Embedding

The final step is to embed the separately extracted features of the heterogeneous music and video modalities into a shared embedding space.

The two-branch neural network

FC（ReLU）

視訊特徵是從 pretrain 的CNN中提取的
音樂特徵只是 low- level 音訊特徵統計的簡單 concat

為了補償相對較 low-level 的音訊，我們使網路的音訊分支比視訊分支更深

兩個分支的最終輸出經過L2歸一化，以便於計算餘弦相似度，在我們的方法中用作距離度量。

Inter-modal ranking constraint

受 triplet ranking loss 啟發

a positive cross-modal sample
a ground truth pair item separated from the same music video
a negative sample
not paired with the anchor

Loss

vi (anchor)
video of the i-th music video
mi (positive sample)
music of the i-th music video
mj (negative sample)
the music feature obtained from the j-th music video
d(v,m)
distance (e.g., Euclidean distance)
e
a margin constant

video input

music input

select

在三元組選擇過程中，計算所有可能的三元組的損失需要大量的計算

top Q most violated cross-modal matches in each mini-batch

selecting a maximum of Q violating negative matches that are closer to the positive pair (i.e., a ground truth video–music pair) in the embedding space.

Soft intra-modal structure constraint

只使用 Inter-modal ranking constraint
每個模態內的固有特徵（即模態特定的特徵）可能會丟失

the modality- specific characteristics

in music
rhythm,tempo, or timbre
旋律速度音色
in videos
brightness, color, or texture

為了解決每個模態內結構崩潰的問題，我們設計了一種 Soft intra-modal structure constraint

video input

music input

xxx music features in multimodal space if xxx music features before embedding

do not use the margin constant

Embedding network loss

inter-modal ranking constraint
two types of triplets ( v i , m i , m j ) (vi,mi,mj) (vi,mi,mj) and ( m i , v i , v j ) (mi, vi, vj) (mi,vi,vj)
soft intra-modal structure constraint
two types of triplets ( v i , v j , v k ) (vi, vj, vk) (vi,vj,vk) and ( m i , m j , m k ) (mi,mj,mk) (mi,mj,mk)

s i g n ( x ) = { 1 , x > 0 0 , x = 0 − 1 , x < 0 sign(x)=\begin{cases}1, x>0 \\0, x=0\\-1,x<0\end{cases} sign(x)=⎩⎪⎨⎪⎧1,x>00,x=0−1,x<0

Dataset and implementation details

A. Construction of the Dataset The

我們的Hong-Im音樂視訊200K（HIMV-200K）基準資料集，其中包含200,500個視訊音樂。
我們從YouTube-8M獲得了這些音視訊對，YouTube-8M是一個大規模的帶有標籤的視訊資料集，包含數百萬個YouTube視訊ID和相關標籤。

YouTube中的所有視訊均帶有4800個視覺實體的詞彙註釋，其中的實體包括各種活動（例如，體育，遊戲，愛好），物件（例如，汽車，食品，產品）和事件（例如，音樂會，節，婚禮）。

在與數千個實體相關的整個視訊中，我們首先下載了與“音樂視訊”相關的視訊，包括官方音樂視訊，模仿音樂視訊以及帶有背景音樂的使用者生成的視訊。

下載完了所有帶有“音樂視訊”標籤的視訊，就可以使用FFmpeg 將它們分為視訊和音訊最終我們獲得了205,000對視訊音樂組合，用於訓練，驗證和測試的組合分別包括200K，4K和1K對。

為了公開發布我們的HIMV-200K資料集而又不侵犯版權，我們在“線上視訊”類別下提供了YouTube視訊的URL，並在我們的線上儲存庫中提供了視訊和音樂曲目的特徵提取程式碼。（https://github.com/csehong/VM-NET）

B. Implementation Details

Therefore, we trimmed the audio signals to 29.12 s at the center of the songs and downsample them from 22.05 kHz to 12 kHz following [36].

將音訊訊號分解為諧波分量和打擊樂分量，並逐幀提取大量音訊特徵。
我們可以獲得每幀380維的向量

video–music retrieval

followed the implementation details in [40].

每個視訊首先以每秒1幀的速度解碼，直到前360秒。

使用Inception網路[4]提取了2048個維度的幀級特徵，並應用WPCA將特徵維數減少到1024。給定幀級特徵，我們使用均值，標準差和前5個序數統計量彙總了這些特徵其次是全域性標準化。

Experimental results

A. The [email protected] Metric The

1k 測試集

大多數方法都主要執行主觀使用者評估

為了解決這個問題，我們將[email protected]（一種用於交叉模式檢索的標準協議，尤其是在影象-文字檢索[30]，[33]中）應用到雙向CBVMR任務

對於給定的K值，它衡量在測試集中至少有一個正確的基礎事實匹配項被排在前K個匹配項中的查詢集中的查詢百分比。例如，如果我們考慮要求適當音樂的視訊查詢，則Recall @ 10會告訴我們前十個結果中包含基本音樂匹配項的視訊查詢的百分比。

[30] A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3128–3137.
[33] L. Wang, Y. Li, and S. Lazebnik, “Learning deep structure-preserving image-text embeddings,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 5005–5013.

相對地賦予λ1比λ2更多的權重，通常可以提高效能。但是，根據經驗，我們確認將λ1設定為5或更大不會改善Recall @ K。

B. A Human Preference Test

Conclusion

兩分支考慮到模式間和模式內關係，將視訊和音樂關聯起來的深度網路。

模型可以瞭解音樂的流派，歌手的性別或國籍