康奈爾大學的電影對白語料庫介紹 --Cornell Movie-Dialogs Corpus

阿新 • • 發佈：2019-01-01

這個公開的資源被很多和自然語言處理NLP相關的開原始碼和論文提到，

所以仔細閱讀了readme，並記錄相關要點

所有檔案以" +++$+++ "分隔符

- movie_titles_metadata.txt
   - 包含每部電影標題資訊
   - fields:
       - movieID,
       - movie title,
       - movie year,
          - IMDB rating,
       - no. IMDB votes,
        - genres in the format ['genre1','genre2',?'genreN']

- movie_characters_metadata.txt
   - 包含每部電影角色資訊
   - fields:
       - characterID
       - character name
       - movieID
       - movie title
       - gender ("?" for unlabeled cases)
       - position in credits ("?" for unlabeled cases)

關鍵是下面兩個檔案，一個包含了所有文字，一個包含了文字之間的關係

- movie_lines.txt
   - 包含每個表達(utterance)的實際文字
   - fields:
       - lineID
       - characterID (who uttered this phrase)
       - movieID
       - character name
       - text of the utterance

前面5個樣本:

L1045 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ They do not!
L1044 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ They do to!
L985 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ I hope so.
L984 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ She okay?
L925 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Let's go.

- movie_conversations.txt
   - 對話的結構-
   - fields
       - characterID of the first character involved in the conversation 對話中的第一個角色的ID

- characterID of the second character involved in the conversation 對話中的第二個角色的ID

- movieID of the movie in which the conversation occurred 對話所屬電影的ID

        - list of the utterances that make the conversation, in chronological
           order: ['lineID1','lineID2',?'lineIDN']
           has to be matched with movie_lines.txt to reconstruct the actual content

對話中以時間順序的各個表達的列表，

order: ['lineID1','lineID2',?'lineIDN']必須和movie_lines.txt匹配以便於重構實際內容

前面5個樣本:

u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L194', 'L195', 'L196', 'L197']
u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L198', 'L199']
u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L200', 'L201', 'L202', 'L203']
u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L204', 'L205', 'L206']
u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L207', 'L208']

- raw_script_urls.txt
-原始來源的url( the urls from which the raw sources were retrieved)

來源:

http://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html

康奈爾大學的電影對白語料庫介紹 --Cornell Movie-Dialogs Corpus

康奈爾大學的電影對白語料庫介紹 --Cornell Movie-Dialogs Corpus

美康奈爾大學暫停與中國人民大學的合作專案外交部回

康奈爾大學和斯坦福大學的研究表明：人類比人工智能更安全

發現一個記錄筆記的方法(康奈爾筆記即5R筆記)

說明：最新"谷歌AI智商不超過6歲"研究來自中國科學院而非美國康內爾大學

畢業真實的版本=#「利物浦約翰摩爾大學畢業證書」LJMU原件一模一樣證書

百度推薦～『辦理紐卡斯爾大學畢業證』原件一模一樣證書

澳洲紐卡斯爾大學NCL文憑文憑制作修改成績,GPA修改微信:13166038657

畢業推薦加拿大!拉瓦爾大學畢業推薦證/Laval一模一樣文憑 6荒

康托爾、哥德爾、圖靈——永恒的金色對角線

數據挖掘-語料庫的構建

DL之Attention-ED：基於TF NMT利用帶有Attention的 ED模型訓練、測試(中英文平行語料庫)實現將英文翻譯為中文的LSTM翻譯模型過程全記錄

日本語語料庫

中文情感分析語料庫【下載】

Gensim訓練維基百科語料庫

LOCUST - 用於說話人驗證的縱向語料庫和工具集

如何生成自定義的逆向檔案頻率（IDF）文字語料庫（二）

如何生成自定義的逆向檔案頻率（IDF）文字語料庫（一）

小白使用百度 NLP 實現語料庫 TTR 統計

【自然語言處理】淺談語料庫

康奈爾大學的電影對白語料庫介紹 --Cornell Movie-Dialogs Corpus

相關推薦