各大廠的語音識別Speech To Text API使用體驗

阿新 • • 發佈：2021-06-17

最近發現有聲讀物能極大促進我的睡眠，但每個前面都有一段開場語，想把它剪掉，但是有多個開場語，所以就要用到語音識別判斷一下再剪。

前兩年在本地搭建過識別的環境，奈何識別準確率不行，只能找找API了，後面有時間再弄本地的吧。下面是幾個大廠提供的服務，就我個人使用來看，Google > 訊飛 > IBM

Oracle：

被它的Always Free計劃吸了一波粉，但是提供的轉寫服務不支援中文，pass

IBM

優點：有一定的持續免費額度
缺點：準確度不夠，官網訪問有點慢
亂寫的示例：

#coding:utf-8
'''
@version: python3.8
@author: ‘eric‘
@license: Apache Licence
@contact: [email protected]
@software: PyCharm
@file: ibm.py
@time: 2021/6/16 23:05
'''
from __future__ import print_function

import traceback

apikey = ''
url = ''

from watson_developer_cloud import SpeechToTextV1
service = SpeechToTextV1(
    iam_apikey=apikey,
    url=url)

import os, re

#總資原始檔目錄
base_dir = r'36041981'

#子目錄，存放已被裁剪好的長度為5s的x2m字尾檔案（安卓端，喜馬拉雅快取檔案），我估計其實就是常用的音訊格式，就改了個字尾名
cliped_dir =os.listdir(os.path.join(base_dir,'clip'))
for each in cliped_dir:
    try:
        filename = re.findall(r"(.*?)\.x2m", each)  # 取出.mp3字尾的檔名
        if filename:
            filename[0] += '.x2m'
            with open(os.path.join(base_dir, 'clip', filename[0]),
                      'rb') as audio_file:
                recognize_result = service.recognize(
                    audio=audio_file,
                    content_type='audio/mp3',
                    timestamps=False,
                    #中文模型，CN_BroadbandModel更準確一點
                    model='zh-CN_NarrowbandModel',
                    # model='zh-CN_BroadbandModel',
                    
                    
                    #這兩個引數應該是讓識別出來的文字更接近於提供的，但實際測試，並沒什麼用，不知道什麼原因
                    # keywords=list(set([x for x in '曲曲于山川歷史為解之謎拓展人生的長度廣度人生的長度廣度和深度由喜馬拉雅聯合大理石獨家推出探祕類大家好歡迎大家訂閱歷史未解之謎全記錄'])),
                    #keywords_threshold=0.1,
                    word_confidence=True).get_result()
                if len(recognize_result['results'])==0:
                    with open('result-1.txt', 'a', encoding='utf-8') as f:
                        f.write('%s-%s\n' % (filename[0], '-'))
                        continue
                final_result = recognize_result['results'][0]['alternatives'][0]['transcript'].replace(' ', '')
                with open('result-1.txt', 'a',encoding='utf-8') as f:
                    f.write('%s-%s\n' % (filename[0], final_result))
    except:
        traceback.print_exc()
        print(each)

Google

優點：有一定的持續免費額度，識別速度快，準確
缺點：要掛代__理訪問
文件：快速入門：使用客戶端庫,本地音訊檔案的話，不要用文件中的程式碼，可參考我下面的
亂寫的示例：

# coding:utf-8
from os import path

AUDIO_FILE = path.join(path.dirname(path.realpath(__file__)), "268675557.mp3")


def transcribe_file(speech_file):
    """Transcribe the given audio file."""
    from google.cloud import speech
    import io

    client = speech.SpeechClient()

    with io.open(speech_file, "rb") as audio_file:
        content = audio_file.read()

    audio = speech.RecognitionAudio(content=content)
    config = speech.RecognitionConfig(
        encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
        sample_rate_hertz=16000,
        language_code="zh-CN",
    )

    response = client.recognize(config=config, audio=audio)

    # Each result is for a consecutive portion of the audio. Iterate through
    # them to get the transcripts for the entire audio file.
    for result in response.results:
        # The first alternative is the most likely one for this portion.
        print(u"Transcript: {}".format(result.alternatives[0].transcript))


if __name__ == '__main__':
    transcribe_file(AUDIO_FILE)