【Music】視訊配樂|多模態檢索 Content-based video–music retrieval (CBVMR) Using Soft Intra-Modal 筆記
2018 ICMR
Content-Based Video–Music Retrieval Using Soft Intra-Modal Structure Constraint
Introduction
bidirectional retrieval
挑戰
- 設計一種對元資料沒有要求的跨模式模型
- 難以獲得匹配的視訊音樂對,視訊和音樂之間的匹配標準比其他跨模態任務(例如,影象到文字的檢索)更加模糊
Contributions
- Content-based, cross-modal embedding network
- introduce VM-NET, two-branch neural network that infers the latent alignment between videos and music tracks using only their contents
- train the network via inter-modal ranking loss
such that videos and music with similar semantics end up close together in the embedding space
However, if only the inter-modal ranking constraint for embedding is considered, modality-specific characteristics (e.g., rhythm or tempo for music and texture or color for image) may be lost.
- devise a novel soft intra-modal structure constraint
takes advantage of the relative distance relationship of samples within each modality
does not require ground truth pair information within individual modality.
Large-scale video–music pair dataset
- Hong–Im Music–Video 200K (HIMV- 200K)
composed of 200,500 video–music pairs.
Evaluation
- [email protected]
- subjective user evaluation
Related work
A. Video–Music Related Tasks
conventional approaches can be divided into three categories according to the task:
- generation,
- classification
- matching
大多數現有方法使用元資料(例如,關鍵字,心情標籤和相關描述)
B. Two-branch Neural Networks Over
不同模態之間的關係
影象與文字相關聯
音樂視訊情感標籤
Tunesensor: A semantic-driven music recommendation service for digital photo albums (ISWC 2011)
Method
A. Music Feature Extraction
-
decompose an audio signal into harmonic and percussive components
諧波 / 打擊樂 -
apply log-amplitude scaling to each component
to avoid numerical underflow -
slice the components into shorter segments called local frames (or windowed excerpts) and extract multiple features from each component of each frame.
Frame-level features.
- Spectral features
頻譜特徵
The first type of audio features are derived from spectral analyses.
- first apply the fast Fourier transform and the discrete wavelet transform to the windowed signal in each local frame
快速傅立葉變換/離散小波變換 - From the magnitude spectral results
幅度頻譜結果- compute summary features including the spectral centroid, the spectral bandwidth, the spectral rolloff, and the first and second order polynomial features of a spectrogram
頻譜質心,頻譜頻寬,頻譜衰減以及頻譜圖的一階和二階多項式特徵
- compute summary features including the spectral centroid, the spectral bandwidth, the spectral rolloff, and the first and second order polynomial features of a spectrogram
- Mel-scale features
梅爾尺度特徵
-
compute the Mel-scale spectrogram of each frame as well as the Mel-frequency Cepstral Coefficients (MFCC)
每幀的梅爾尺度譜圖以及梅爾頻率倒譜系數(MFCC)
to extract more meaningful features -
use delta-MFCC features(the first- and second-order differences in MFCC features over time)
增量MFCC
capture variations of timbre over time
音色隨時間變化
- Chroma features
色度
- use chroma short-time Fourier transform as well as chroma energy normalized
色度短時傅立葉變換 以及色度能量歸一化
While Mel- scaled representations efficiently capture timbre, they provide poor resolution of pitches and pitch classes
儘管梅爾音階表示法可以有效地捕獲音色,但它們對音高和音高等級的解析度較差
- Etc.
- use the number of time domain zero-crossings as an audio feature
時域零交叉點的數量
in order to detect the amount of noise in the audio signal. - use the root-mean-square energy for each frame
均方根能量
B. Video Feature Extraction
###Frame-level features
-
HIMV-200K dataset 包含大量資料,CNN 從頭訓太久
因此使用 在 ImageNet 上預訓練的 Inception,extract frame-level features -
whitened principal component analysis (WPCA)
the normalized features are approximately multivariate Gaussian with zero mean and identity covariance
應用白化的主成分分析(WPCA),以使歸一化特徵為具有零均值和恆等協方差的近似多元高斯
Video-level features
concatenation the music-level features
a global normalization process(subtracts the mean of vectors from all the features)
principal component analysis (PCA)
L2 normalization
C. Multimodal Embedding
The final step is to embed the separately extracted features of the heterogeneous music and video modalities into a shared embedding space.
The two-branch neural network
FC(ReLU)
- 視訊特徵是從 pretrain 的CNN中提取的
- 音樂特徵只是 low- level 音訊特徵統計的簡單 concat
為了補償相對較 low-level 的音訊 ,我們使網路的音訊分支比視訊分支更深
兩個分支的最終輸出經過L2歸一化,以便於計算餘弦相似度,在我們的方法中用作距離度量。
Inter-modal ranking constraint
受 triplet ranking loss 啟發
- a positive cross-modal sample
a ground truth pair item separated from the same music video - a negative sample
not paired with the anchor
Loss
- vi (anchor)
video of the i-th music video - mi (positive sample)
music of the i-th music video - mj (negative sample)
the music feature obtained from the j-th music video - d(v,m)
distance (e.g., Euclidean distance) - e
a margin constant
video input
music input
select
在三元組選擇過程中,計算所有可能的三元組的損失需要大量的計算
- top Q most violated cross-modal matches in each mini-batch
selecting a maximum of Q violating negative matches that are closer to the positive pair (i.e., a ground truth video–music pair) in the embedding space.
Soft intra-modal structure constraint
只使用 Inter-modal ranking constraint
每個模態內的固有特徵(即模態特定的特徵)可能會丟失
the modality- specific characteristics
- in music
rhythm,tempo, or timbre
旋律 速度 音色 - in videos
brightness, color, or texture
為了解決每個模態內結構崩潰的問題,我們設計了一種 Soft intra-modal structure constraint
video input
music input
xxx music features in multimodal space if xxx music features before embedding
do not use the margin constant
Embedding network loss
-
inter-modal ranking constraint
two types of triplets ( v i , m i , m j ) (vi,mi,mj) (vi,mi,mj) and ( m i , v i , v j ) (mi, vi, vj) (mi,vi,vj) -
soft intra-modal structure constraint
two types of triplets ( v i , v j , v k ) (vi, vj, vk) (vi,vj,vk) and ( m i , m j , m k ) (mi,mj,mk) (mi,mj,mk)
s i g n ( x ) = { 1 , x > 0 0 , x = 0 − 1 , x < 0 sign(x)=\begin{cases}1, x>0 \\0, x=0\\-1,x<0\end{cases} sign(x)=⎩⎪⎨⎪⎧1,x>00,x=0−1,x<0
Dataset and implementation details
A. Construction of the Dataset The
我們的Hong-Im音樂視訊200K(HIMV-200K)基準資料集,其中包含200,500個視訊音樂。
我們從YouTube-8M獲得了這些音視訊對,YouTube-8M是一個大規模的帶有標籤的視訊資料集,包含數百萬個YouTube視訊ID和相關標籤。
YouTube中的所有視訊均帶有4800個視覺實體的詞彙註釋,其中的實體包括各種活動(例如,體育,遊戲,愛好),物件(例如,汽車,食品,產品)和事件(例如,音樂會,節,婚禮)。
在與數千個實體相關的整個視訊中,我們首先下載了與“音樂視訊”相關的視訊,包括官方音樂視訊,模仿音樂視訊以及帶有背景音樂的使用者生成的視訊。
下載完了所有帶有“音樂視訊”標籤的視訊,就可以使用FFmpeg 將它們分為視訊和音訊 最終我們獲得了205,000對視訊音樂組合,用於訓練,驗證和測試的組合分別包括200K,4K和1K對。
為了公開發布我們的HIMV-200K資料集而又不侵犯版權,我們在“線上視訊”類別下提供了YouTube視訊的URL,並在我們的線上儲存庫中提供了視訊和音樂曲目的特徵提取程式碼。 (https://github.com/csehong/VM-NET)
B. Implementation Details
Therefore, we trimmed the audio signals to 29.12 s at the center of the songs and downsample them from 22.05 kHz to 12 kHz following [36].
- 將音訊訊號分解為諧波分量和打擊樂分量,並逐幀提取大量音訊特徵。
我們可以獲得每幀380維的向量
video–music retrieval
followed the implementation details in [40].
每個視訊首先以每秒1幀的速度解碼,直到前360秒。
使用Inception網路[4]提取了2048個維度的幀級特徵,並應用WPCA將特徵維數減少到1024。給定幀級特徵,我們使用均值,標準差和前5個序數統計量彙總了這些特徵其次是全域性標準化。
Experimental results
A. The [email protected] Metric The
1k 測試集
大多數方法都主要執行主觀使用者評估
為了解決這個問題,我們將[email protected](一種用於交叉模式檢索的標準協議,尤其是在影象-文字檢索[30],[33]中)應用到雙向CBVMR任務
對於給定的K值,它衡量在測試集中至少有一個正確的基礎事實匹配項被排在前K個匹配項中的查詢集中的查詢百分比。例如,如果我們考慮要求適當音樂的視訊查詢,則Recall @ 10會告訴我們前十個結果中包含基本音樂匹配項的視訊查詢的百分比。
[30] A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3128–3137.
[33] L. Wang, Y. Li, and S. Lazebnik, “Learning deep structure-preserving image-text embeddings,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 5005–5013.
相對地賦予λ1比λ2更多的權重,通常可以提高效能。但是,根據經驗,我們確認將λ1設定為5或更大不會改善Recall @ K。
B. A Human Preference Test
Conclusion
兩分支考慮到模式間和模式內關係,將視訊和音樂關聯起來的深度網路。
模型可以瞭解音樂的流派,歌手的性別或國籍