康奈爾大學的電影對白語料庫介紹 --Cornell Movie-Dialogs Corpus
這個公開的資源被很多和自然語言處理NLP相關的開原始碼和論文提到,
所以仔細閱讀了readme,並記錄相關要點
所有檔案以" +++$+++ "分隔符
- movie_titles_metadata.txt
- 包含每部電影標題資訊
- fields:
- movieID,
- movie title,
- movie year,
- IMDB rating,
- no. IMDB votes,
- genres in the format ['genre1','genre2',?'genreN']
- movie_characters_metadata.txt
- 包含每部電影角色資訊
- fields:
- characterID
- character name
- movieID
- movie title
- gender ("?" for unlabeled cases)
- position in credits ("?" for unlabeled cases)
關鍵是下面兩個檔案,一個包含了所有文字,一個包含了文字之間的關係
- movie_lines.txt
- 包含每個表達(utterance)的實際文字
- fields:
- lineID
- characterID (who uttered this phrase)
- movieID
- character name
- text of the utterance
前面5個樣本:
L1045 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ They do not!
L1044 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ They do to!
L985 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ I hope so.
L984 +++$+++ u2 +++$+++ m0 +++$+++ CAMERON +++$+++ She okay?
L925 +++$+++ u0 +++$+++ m0 +++$+++ BIANCA +++$+++ Let's go.
- movie_conversations.txt
- 對話的結構-
- fields
- characterID of the first character involved in the conversation 對話中的第一個角色的ID
- characterID of the second character involved in the conversation 對話中的第二個角色的ID
- movieID of the movie in which the conversation occurred 對話所屬電影的ID
- list of the utterances that make the conversation, in chronological
order: ['lineID1','lineID2',?'lineIDN']
has to be matched with movie_lines.txt to reconstruct the actual content
對話中以時間順序的各個表達的列表,
order: ['lineID1','lineID2',?'lineIDN']必須和movie_lines.txt匹配以便於重構實際內容
前面5個樣本:
u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L194', 'L195', 'L196', 'L197']
u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L198', 'L199']
u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L200', 'L201', 'L202', 'L203']
u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L204', 'L205', 'L206']
u0 +++$+++ u2 +++$+++ m0 +++$+++ ['L207', 'L208']
- raw_script_urls.txt
-原始來源的url( the urls from which the raw sources were retrieved)
來源:
http://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html