短時傅裏葉變換(Short Time Fourier Transform)原理及 Python 實現
原理
短時傅裏葉變換(Short Time Fourier Transform, STFT) 是一個用於語音信號處理的通用工具.它定義了一個非常有用的時間和頻率分布類, 其指定了任意信號隨時間和頻率變化的復數幅度. 實際上,計算短時傅裏葉變換的過程是把一個較長的時間信號分成相同長度的更短的段, 在每個更短的段上計算傅裏葉變換, 即傅裏葉頻譜.
短時傅裏葉變換通常的數學定義如下:
其中,
DTFT (Decrete Time Fourier Transform) 為離散時間傅裏葉變換. 其數學公式, 如下所示:
其中, x(n) 為在采樣數 n 處的信號幅度. ω~ 的定義如下:
實現時, 短時傅裏葉變換被計算為一系列加窗數據幀的快速傅裏葉變換 (Fast Fourier Transform, FFT),其中窗口隨時間 “滑動” (slide) 或“跳躍” (hop) 。
Python 實現
在程序中, frame_size 為將信號分為較短的幀的大小, 在語音處理中, 通常幀大小在 20ms 到 40ms 之間. 這裏設置為 25ms, 即 frame_size = 0.025;
frame_stride 為相鄰幀的滑動尺寸或跳躍尺寸, 通常幀的滑動尺寸在 10ms 到 20ms 之間, 這裏設置為 10ms, 即 frame_stride = 0.01. 此時, 相鄰幀的交疊大小為 15ms;
窗函數采用漢明窗函數 (Hamming Function) ;
在每一幀, 進行 512 點快速傅裏葉變換, 即 NFFT = 512. 具體程序如下:
# -*- coding: utf8 -*-
import numpy as np
def calc_stft(signal, sample_rate=16000, frame_size=0.025, frame_stride=0.01, winfunc=np.hamming, NFFT=512):
# Calculate the number of frames from the signal
frame_length = frame_size * sample_rate
frame_step = frame_stride * sample_rate
signal_length = len(signal)
frame_length = int(round(frame_length))
frame_step = int(round(frame_step))
num_frames = 1 + int(np.ceil(float(np.abs(signal_length - frame_length)) / frame_step))
# zero padding
pad_signal_length = num_frames * frame_step + frame_length
z = np.zeros((pad_signal_length - signal_length))
# Pad signal to make sure that all frames have equal number of samples
# without truncating any samples from the original signal
pad_signal = np.append(signal, z)
# Slice the signal into frames from indices
indices = np.tile(np.arange(0, frame_length), (num_frames, 1)) + np.tile(np.arange(0, num_frames * frame_step, frame_step), (frame_length, 1)).T
frames = pad_signal[indices.astype(np.int32, copy=False)]
# Get windowed frames
frames *= winfunc(frame_length)
# Compute the one-dimensional n-point discrete Fourier Transform(DFT) of
# a real-valued array by means of an efficient algorithm called Fast Fourier Transform (FFT)
mag_frames = np.absolute(np.fft.rfft(frames, NFFT))
# Compute power spectrum
pow_frames = (1.0 / NFFT) * ((mag_frames) ** 2)
return pow_frames
if __name__ == ‘__main__‘:
import scipy.io.wavfile
import matplotlib.pyplot as plt
# Read wav file
# "OSR_us_000_0010_8k.wav" is downloaded from http://www.voiptroubleshooter.com/open_speech/american.html
sample_rate, signal = scipy.io.wavfile.read("OSR_us_000_0010_8k.wav")
# Get speech data in the first 2 seconds
signal = signal[0:int(2. * sample_rate)]
# Calculate the short time fourier transform
pow_spec = calc_stft(signal, sample_rate)
plt.imshow(pow_spec)
plt.tight_layout()
plt.show()
參考資料
1. DISCRETE TIME FOURIER TRANSFORM (DTFT). https://www.dsprelated.com/freebooks/mdft/Discrete_Time_Fourier_Transform.html
2. THE SHORT-TIME FOURIER TRANSFORM. https://www.dsprelated.com/freebooks/sasp/Short_Time_Fourier_Transform.html
3. Short-time Fourier transform. https://en.wikipedia.org/wiki/Short-time_Fourier_transform
4. Speech Processing for Machine Learning: Filter banks, Mel-Frequency Cepstral Coefficients (MFCCs) and What‘s In-Between. https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html
短時傅裏葉變換(Short Time Fourier Transform)原理及 Python 實現