1. 程式人生 > >Kaggle:Bag of Words Meets Bags of Popcorn 學習記錄

Kaggle:Bag of Words Meets Bags of Popcorn 學習記錄

Kaggle案例分析:Bag of Words Meets Bags of Popcorn-part1*

專案表述: 這是一個關於情感分析的話題。Google的Word2Vec(文字深度表示模型)是一個由深度學習驅動的方法。旨在獲取words內部的含義。Word2Vec試圖理解單詞之間的含義與語義關係, 它類似於recurrent neural nets(遞迴神經網路)或者深度神經網路, 但是計算效率更高.情感分析是機器學習領域的一個具有挑戰性的任務。
part1:基礎的自然語言處理。
part2:講述如何用Word2Vec來訓練一個模型以及如何使用得到的詞向量來做情感分析。本教程所採用的資料集為IMDB情感分析資料集,它包含了10萬條電影評論。
本文處理流程主要包含以下幾個模組:利用pd.read_csv讀取資料 –> 利用BeautifulSoup包去除評論中的HTML標籤 –> 用正則化re去除評論中的標點符號 –> 將評論中所有大寫字母換成小寫 –>

import pandas as pd
train  = pd.read_csv ('labeledTrainData.tsv',header= 0, delimiter= '\t',quoting= 3)
print(train.review[0])
"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally starts is only on for 20 minutes or so excluding the Smooth Criminal sequence and Joe Pesci is convincing as a psychopathic all powerful drug lord. Why he wants MJ dead so bad is beyond me. Because MJ overheard his plans? Nah, Joe Pesci's character ranted that he wanted people to know it is he who is supplying drugs etc so i dunno, maybe he just hates MJ's music.<br /><br />Lots of cool things in this like MJ turning into a car and a robot and the whole Speed Demon sequence. Also, the director must have had the patience of a saint when it came to filming the kiddy Bad sequence as usually directors hate working with one kid let alone a whole bunch of them performing a complex dance scene.<br /><br />Bottom line, this movie is for people who like MJ on one level or another (which i think is most people). If not, then stay away. It does try and give off a wholesome message and ironically MJ's bestest buddy in this movie is a girl! Michael Jackson is truly one of the most talented people ever to grace this planet but is he guilty? Well, with all the attention i've gave this subject....hmmm well i don't know because people can be different behind closed doors, i know this for a fact. He is either an extremely nice but stupid guy or one of the most sickest liars. I hope he is not the latter."

讀取的資料一般都是.csv或者.tsv檔案,均可以使用pandas模組中的reas_csv()函式來進行讀取,此函式的第一個引數為檔名,其餘還有很多引數可供選擇
來實現不同的功能,header可以是一個list,列表裡的值指定了行數,這些行的資料將會被忽略,比如header=0表示資料的第一行就是屬性值,delimiter的含義是分隔符,指定檔案裡面的元素是用什麼分割的,這裡delimiter=”\t”代表這個檔案是使用製表符(Tab)來分隔的. 返回值train是一個DataFrame型別的資料, 呼叫train的shape屬性可以檢視資料大小(25000x3), 從下面程式可以看出, 訓練集中有25000條資料. DataFrame型別的每列資料用屬性名來標識, 可以通過列屬性名來提取資料。

從資料的review看出,評論中除了包含正常的單詞,還帶有HTML標籤
,單詞縮寫,各種標點符號。

print(train.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 3 columns):
id           25000 non-null object
sentiment    25000 non-null int64
review       25000 non-null object
dtypes: int64(1), object(2)
memory usage: 586.0+ KB
None
print(train.describe())
         sentiment
count  25000.00000
mean       0.50000
std        0.50001
min        0.00000
25%        0.00000
50%        0.50000
75%        1.00000
max        1.00000
train['id'][0:5]
0    "5814_8"
1    "2381_9"
2    "7759_3"
3    "3630_4"
4    "9495_8"
Name: id, dtype: object
train.shape
(25000, 3)
train.columns.values
array(['id', 'sentiment', 'review'], dtype=object)

資料的清洗和文字預處理
首先,使用BeautifulSoup包去除文字中的HTML標籤。

import pandas as pd
from bs4 import BeautifulSoup
train  = pd.read_csv ('labeledTrainData.tsv',header= 0, delimiter= '\t',quoting= 3)
example = BeautifulSoup(train.review[0],"lxml")
print(example.get_text () )
"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.The actual feature film bit when it finally starts is only on for 20 minutes or so excluding the Smooth Criminal sequence and Joe Pesci is convincing as a psychopathic all powerful drug lord. Why he wants MJ dead so bad is beyond me. Because MJ overheard his plans? Nah, Joe Pesci's character ranted that he wanted people to know it is he who is supplying drugs etc so i dunno, maybe he just hates MJ's music.Lots of cool things in this like MJ turning into a car and a robot and the whole Speed Demon sequence. Also, the director must have had the patience of a saint when it came to filming the kiddy Bad sequence as usually directors hate working with one kid let alone a whole bunch of them performing a complex dance scene.Bottom line, this movie is for people who like MJ on one level or another (which i think is most people). If not, then stay away. It does try and give off a wholesome message and ironically MJ's bestest buddy in this movie is a girl! Michael Jackson is truly one of the most talented people ever to grace this planet but is he guilty? Well, with all the attention i've gave this subject....hmmm well i don't know because people can be different behind closed doors, i know this for a fact. He is either an extremely nice but stupid guy or one of the most sickest liars. I hope he is not the latter."
import re
letters_only = re.sub('[^a-zA-Z]',' ',example.get_text ())
print(letters_only )
 With all this stuff going down at the moment with MJ i ve started listening to his music  watching the odd documentary here and there  watched The Wiz and watched Moonwalker again  Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent  Moonwalker is part biography  part feature film which i remember going to see at the cinema when it was originally released  Some of it has subtle messages about MJ s feeling towards the press and also the obvious message of drugs are bad m kay Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring  Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him The actual feature film bit when it finally starts is only on for    minutes or so excluding the Smooth Criminal sequence and Joe Pesci is convincing as a psychopathic all powerful drug lord  Why he wants MJ dead so bad is beyond me  Because MJ overheard his plans  Nah  Joe Pesci s character ranted that he wanted people to know it is he who is supplying drugs etc so i dunno  maybe he just hates MJ s music Lots of cool things in this like MJ turning into a car and a robot and the whole Speed Demon sequence  Also  the director must have had the patience of a saint when it came to filming the kiddy Bad sequence as usually directors hate working with one kid let alone a whole bunch of them performing a complex dance scene Bottom line  this movie is for people who like MJ on one level or another  which i think is most people   If not  then stay away  It does try and give off a wholesome message and ironically MJ s bestest buddy in this movie is a girl  Michael Jackson is truly one of the most talented people ever to grace this planet but is he guilty  Well  with all the attention i ve gave this subject    hmmm well i don t know because people can be different behind closed doors  i know this for a fact  He is either an extremely nice but stupid guy or one of the most sickest liars  I hope he is not the latter  

返回的內容,所有的除了a_z,A-Z,空格之外的字元,比如數字,標點符號都被去除了。

lower_case=letters_only.lower()
words = lower_case .split()
print(words )
['with', 'all', 'this', 'stuff', 'going', 'down', 'at', 'the', 'moment', 'with', 'mj', 'i', 've', 'started', 'listening', 'to', 'his', 'music', 'watching', 'the', 'odd', 'documentary', 'here', 'and', 'there', 'watched', 'the', 'wiz', 'and', 'watched', 'moonwalker', 'again', 'maybe', 'i', 'just', 'want', 'to', 'get', 'a', 'certain', 'insight', 'into', 'this', 'guy', 'who', 'i', 'thought', 'was', 'really', 'cool', 'in', 'the', 'eighties', 'just', 'to', 'maybe', 'make', 'up', 'my', 'mind', 'whether', 'he', 'is', 'guilty', 'or', 'innocent', 'moonwalker', 'is', 'part', 'biography', 'part', 'feature', 'film', 'which', 'i', 'remember', 'going', 'to', 'see', 'at', 'the', 'cinema', 'when', 'it', 'was', 'originally', 'released', 'some', 'of', 'it', 'has', 'subtle', 'messages', 'about', 'mj', 's', 'feeling', 'towards', 'the', 'press', 'and', 'also', 'the', 'obvious', 'message', 'of', 'drugs', 'are', 'bad', 'm', 'kay', 'visually', 'impressive', 'but', 'of', 'course', 'this', 'is', 'all', 'about', 'michael', 'jackson', 'so', 'unless', 'you', 'remotely', 'like', 'mj', 'in', 'anyway', 'then', 'you', 'are', 'going', 'to', 'hate', 'this', 'and', 'find', 'it', 'boring', 'some', 'may', 'call', 'mj', 'an', 'egotist', 'for', 'consenting', 'to', 'the', 'making', 'of', 'this', 'movie', 'but', 'mj', 'and', 'most', 'of', 'his', 'fans', 'would', 'say', 'that', 'he', 'made', 'it', 'for', 'the', 'fans', 'which', 'if', 'true', 'is', 'really', 'nice', 'of', 'him', 'the', 'actual', 'feature', 'film', 'bit', 'when', 'it', 'finally', 'starts', 'is', 'only', 'on', 'for', 'minutes', 'or', 'so', 'excluding', 'the', 'smooth', 'criminal', 'sequence', 'and', 'joe', 'pesci', 'is', 'convincing', 'as', 'a', 'psychopathic', 'all', 'powerful', 'drug', 'lord', 'why', 'he', 'wants', 'mj', 'dead', 'so', 'bad', 'is', 'beyond', 'me', 'because', 'mj', 'overheard', 'his', 'plans', 'nah', 'joe', 'pesci', 's', 'character', 'ranted', 'that', 'he', 'wanted', 'people', 'to', 'know', 'it', 'is', 'he', 'who', 'is', 'supplying', 'drugs', 'etc', 'so', 'i', 'dunno', 'maybe', 'he', 'just', 'hates', 'mj', 's', 'music', 'lots', 'of', 'cool', 'things', 'in', 'this', 'like', 'mj', 'turning', 'into', 'a', 'car', 'and', 'a', 'robot', 'and', 'the', 'whole', 'speed', 'demon', 'sequence', 'also', 'the', 'director', 'must', 'have', 'had', 'the', 'patience', 'of', 'a', 'saint', 'when', 'it', 'came', 'to', 'filming', 'the', 'kiddy', 'bad', 'sequence', 'as', 'usually', 'directors', 'hate', 'working', 'with', 'one', 'kid', 'let', 'alone', 'a', 'whole', 'bunch', 'of', 'them', 'performing', 'a', 'complex', 'dance', 'scene', 'bottom', 'line', 'this', 'movie', 'is', 'for', 'people', 'who', 'like', 'mj', 'on', 'one', 'level', 'or', 'another', 'which', 'i', 'think', 'is', 'most', 'people', 'if', 'not', 'then', 'stay', 'away', 'it', 'does', 'try', 'and', 'give', 'off', 'a', 'wholesome', 'message', 'and', 'ironically', 'mj', 's', 'bestest', 'buddy', 'in', 'this', 'movie', 'is', 'a', 'girl', 'michael', 'jackson', 'is', 'truly', 'one', 'of', 'the', 'most', 'talented', 'people', 'ever', 'to', 'grace', 'this', 'planet', 'but', 'is', 'he', 'guilty', 'well', 'with', 'all', 'the', 'attention', 'i', 've', 'gave', 'this', 'subject', 'hmmm', 'well', 'i', 'don', 't', 'know', 'because', 'people', 'can', 'be', 'different', 'behind', 'closed', 'doors', 'i', 'know', 'this', 'for', 'a', 'fact', 'he', 'is', 'either', 'an', 'extremely', 'nice', 'but', 'stupid', 'guy', 'or', 'one', 'of', 'the', 'most', 'sickest', 'liars', 'i', 'hope', 'he', 'is', 'not', 'the', 'latter']

最後, 我們需要考慮如何處理那些出現頻率高,但是卻沒有多大意義的單詞, 如a, and, the, is等. 這類單詞稱之為”stop words”,儘管我們說stop words是一種語言中最常見到的單詞,但是卻沒有任何一個統一的stop words列表被所有的自然語言處理工具所使用,有時候,一個工具甚至會使用多個stop words列表.NLTK包(Natural Language Toolkit)裡面包含了stop words的列表,安裝好nltk後,要用.download()來安裝資料包,執行命令後,出現的介面如下所示,可能會下載較長的時間.

import nltk
nltk.download()#會下載資料包,此處已經下載好了。
showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml





True
from nltk.corpus import stopwords
print(stopwords.words('english'))
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', 'couldn', 'didn', 'doesn', 'hadn', 'hasn', 'haven', 'isn', 'ma', 'mightn', 'mustn', 'needn', 'shan', 'shouldn', 'wasn', 'weren', 'won', 'wouldn']

停用詞是指在資訊檢索中,為節省儲存空間和提高搜尋效率,在處理自然語言資料(或文字)之前或之後會自動過濾掉某些字或詞,這些字或詞即被稱為Stop Words(停用詞)。這些停用詞都是人工輸入、非自動化生成的,生成後的停用詞會形成一個停用詞表。但是,並沒有一個明確的停用詞表能夠適用於所有的工具。甚至有一些工具是明確地避免使用停用詞來支援短語搜尋的。這些停用詞將會從我們的電影評論中刪除。

words=[w for w in words if not w in stopwords.words("english")]
print(words)
['stuff', 'going', 'moment', 'mj', 'started', 'listening', 'music', 'watching', 'odd', 'documentary', 'watched', 'wiz', 'watched', 'moonwalker', 'maybe', 'want', 'get', 'certain', 'insight', 'guy', 'thought', 'really', 'cool', 'eighties', 'maybe', 'make', 'mind', 'whether', 'guilty', 'innocent', 'moonwalker', 'part', 'biography', 'part', 'feature', 'film', 'remember', 'going', 'see', 'cinema', 'originally', 'released', 'subtle', 'messages', 'mj', 'feeling', 'towards', 'press', 'also', 'obvious', 'message', 'drugs', 'bad', 'kay', 'visually', 'impressive', 'course', 'michael', 'jackson', 'unless', 'remotely', 'like', 'mj', 'anyway', 'going', 'hate', 'find', 'boring', 'may', 'call', 'mj', 'egotist', 'consenting', 'making', 'movie', 'mj', 'fans', 'would', 'say', 'made', 'fans', 'true', 'really', 'nice', 'actual', 'feature', 'film', 'bit', 'finally', 'starts', 'minutes', 'excluding', 'smooth', 'criminal', 'sequence', 'joe', 'pesci', 'convincing', 'psychopathic', 'powerful', 'drug', 'lord', 'wants', 'mj', 'dead', 'bad', 'beyond', 'mj', 'overheard', 'plans', 'nah', 'joe', 'pesci', 'character', 'ranted', 'wanted', 'people', 'know', 'supplying', 'drugs', 'etc', 'dunno', 'maybe', 'hates', 'mj', 'music', 'lots', 'cool', 'things', 'like', 'mj', 'turning', 'car', 'robot', 'whole', 'speed', 'demon', 'sequence', 'also', 'director', 'must', 'patience', 'saint', 'came', 'filming', 'kiddy', 'bad', 'sequence', 'usually', 'directors', 'hate', 'working', 'one', 'kid', 'let', 'alone', 'whole', 'bunch', 'performing', 'complex', 'dance', 'scene', 'bottom', 'line', 'movie', 'people', 'like', 'mj', 'one', 'level', 'another', 'think', 'people', 'stay', 'away', 'try', 'give', 'wholesome', 'message', 'ironically', 'mj', 'bestest', 'buddy', 'movie', 'girl', 'michael', 'jackson', 'truly', 'one', 'talented', 'people', 'ever', 'grace', 'planet', 'guilty', 'well', 'attention', 'gave', 'subject', 'hmmm', 'well', 'know', 'people', 'different', 'behind', 'closed', 'doors', 'know', 'fact', 'either', 'extremely', 'nice', 'stupid', 'guy', 'one', 'sickest', 'liars', 'hope', 'latter']

現在我們就來處理所有的電影評論資料,25000條訓練資料。

def review_to_words(raw_review):
    #這個函式的功能就是將原始的資料經過預處理變成一系列的詞。
    #輸入是原始的資料(一條電影評論)。
    #輸出是一系列的詞(經過預處理的評論)
    review_text = BeautifulSoup(raw_review,'lxml').get_text()
    letters_only = re.sub('[^a-zA-Z]',' ',review_text)
    words=letters_only.lower().split()
    stops=set(stopwords.words('english'))
    meaningful_words = [w for w in words if not w in stops]
    return (" ".join(meaningful_words))
print(clearn_review)
stuff going moment mj started listening music watching odd documentary watched wiz watched moonwalker maybe want get certain insight guy thought really cool eighties maybe make mind whether guilty innocent moonwalker part biography part feature film remember going see cinema originally released subtle messages mj feeling towards press also obvious message drugs bad kay visually impressive course michael jackson unless remotely like mj anyway going hate find boring may call mj egotist consenting making movie mj fans would say made fans true really nice actual feature film bit finally starts minutes excluding smooth criminal sequence joe pesci convincing psychopathic powerful drug lord wants mj dead bad beyond mj overheard plans nah joe pesci character ranted wanted people know supplying drugs etc dunno maybe hates mj music lots cool things like mj turning car robot whole speed demon sequence also director must patience saint came filming kiddy bad sequence usually directors hate working one kid let alone whole bunch performing complex dance scene bottom line movie people like mj one level another think people stay away try give wholesome message ironically mj bestest buddy movie girl michael jackson truly one talented people ever grace planet guilty well attention gave subject hmmm well know people different behind closed doors know fact either extremely nice stupid guy one sickest liars hope latter

在上述函式中,我們把停用詞轉化為set集合,目的是為了提高程式碼的執行速度,因為我們在後面的程式碼中會呼叫這個函式數萬次,它需要擁有很快的執行速度,

clearn_review = review_to_words(train['review'][0])
print(clearn_review)
stuff going moment mj started listening music watching odd documentary watched wiz watched moonwalker maybe want get certain insight guy thought really cool eighties maybe make mind whether guilty innocent moonwalker part biography part feature film remember going see cinema originally released subtle messages mj feeling towards press also obvious message drugs bad kay visually impressive course michael jackson unless remotely like mj anyway going hate find boring may call mj egotist consenting making movie mj fans would say made fans true really nice actual feature film bit finally starts minutes excluding smooth criminal sequence joe pesci convincing psychopathic powerful drug lord wants mj dead bad beyond mj overheard plans nah joe pesci character ranted wanted people know supplying drugs etc dunno maybe hates mj music lots cool things like mj turning car robot whole speed demon sequence also director must patience saint came filming kiddy bad sequence usually directors hate working one kid let alone whole bunch performing complex dance scene bottom line movie people like mj one level another think people stay away try give wholesome message ironically mj bestest buddy movie girl michael jackson truly one talented people ever grace planet guilty well attention gave subject hmmm well know people different behind closed doors know fact either extremely nice stupid guy one sickest liars hope latter
num_reviews = train['review'].size
print(num_reviews)
25000
clean_train_reviews = []
for i in range(0, num_reviews):
    clean_train_reviews.append(review_to_words(train['review'][i]))

此段程式碼可能執行的時間比較長,程式設計者又不知道程式執行到什麼程度,我們可以更改程式碼,讓python處理過程中每1000評論就會列印狀態。

clean_train_reviews = []
for i in range(0, num_reviews):
    if ((i+1)%1000==0):
        print('Review %d of %d \n' % (i+1,num_reviews))
    clean_train_reviews.append(review_to_words(train['review'][i]))
Review 1000 of 25000 

Review 2000 of 25000 

Review 3000 of 25000 

Review 4000 of 25000 

Review 5000 of 25000 

Review 6000 of 25000 

Review 7000 of 25000 

Review 8000 of 25000 

Review 9000 of 25000 

Review 10000 of 25000 

Review 11000 of 25000 

Review 12000 of 25000 

Review 13000 of 25000 

Review 14000 of 25000 

Review 15000 of 25000 

Review 16000 of 25000 

Review 17000 of 25000 

Review 18000 of 25000 

Review 19000 of 25000 

Review 20000 of 25000 

Review 21000 of 25000 

Review 22000 of 25000 

Review 23000 of 25000 

Review 24000 of 25000 

Review 25000 of 25000 

利用scikit-learn第三方機器學習的包進行特徵提取:
現在我們需要對評論資料做一下處理,如何將文字轉化為機器學習中的數字表示,一種常見的方法叫做詞袋,詞彙袋是從所有的文件中學習詞彙,然後通過計算每個單詞出現的次數來對每個文件建模,例如:
Sentence 1: “The cat sat on the hat”
Sentence 2: “The dog ate the cat and the hat”
從兩句話來看,我們的彙總詞彙為:
{ the, cat, sat, on, hat, dog, ate, and }
那麼句子表示為:Sentence 1: { 2, 1, 1, 1, 1, 0, 0, 0 },2表示’the’在句1中出現2次,1表示’cat’,’sat’,’on’,’hat’在句1中均出現1次,0表示’dog’,’ate’,’and’在句1中沒有出現。
同理Sentence 2 are: { 3, 1, 0, 0, 1, 1, 1, 1}

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(analyzer = “word”, \
tokenizer = None, \
preprocessor = None, \
stop_words = None, \
max_features = 5000)
train_data_features = vectorizer.fit_transform(clean_train_reviews)
print(train_data_features.dtype)
train_data_features = train_data_features.toarray()
print(train_data_features.dtype)
print(train_data_features.shape)

vocab = vectorizer.get_feature_names()
print(vocab)
['abandoned', 'abc', 'abilities', 'ability', 'able', 'abraham', 'absence', 'absent', 'absolute', 'absolutely', 'absurd', 'abuse', 'abusive', 'abysmal', 'academy', 'accent', 'accents', 'accept', 'acceptable', 'accepted', 'access', 'accident', 'accidentally', 'accompanied', 'accomplished', 'according', 'account', 'accuracy', 'accurate', 'accused', 'achieve', 'achieved', 'achievement', 'acid', 'across', 'act', 'acted', 'acting', 'action', 'actions', 'activities', 'actor', 'actors', 'actress', 'actresses', 'acts', 'actual', 'actually', 'ad', 'adam', 'adams', 'adaptation', 'adaptations', 'adapted', 'add', 'added', 'adding', 'addition', 'adds', 'adequate', 'admire', 'admit', 'admittedly', 'adorable', 'adult', 'adults', 'advance', 'advanced', 'advantage', 'adventure', 'adventures', 'advertising', 'advice', 'advise', 'affair', 'affect', 'affected', 'afford', 'aforementioned', 'afraid', 'africa', 'african', 'afternoon', 'afterwards', 'age', 'aged', 'agent', 'agents', 'ages', 'aging', 'ago', 'agree', 'agreed', 'agrees', 'ah', 'ahead', 'aid', 'aids', 'aim', 'aimed', 'air', 'aired', 'airplane', 'airport', 'aka', 'akshay', 'al', 'alan', 'alas', 'albeit', 'albert', 'album', 'alcohol', 'alcoholic', 'alec', 'alert', 'alex', 'alexander', 'alfred', 'alice', 'alicia', 'alien', 'aliens', 'alike', 'alison', 'alive', 'allen', 'allow', 'allowed', 'allowing', 'allows', 'almost', 'alone', 'along', 'alongside', 'already', 'alright', 'also', 'alternate', 'alternative', 'although', 'altman', 'altogether', 'always', 'amanda', 'amateur', 'amateurish', 'amazed', 'amazing', 'amazingly', 'ambiguous', 'ambitious', 'america', 'american', 'americans', 'amitabh', 'among', 'amongst', 'amount', 'amounts', 'amused', 'amusing', 'amy', 'analysis', 'ancient', 'anderson', 'andre', 'andrew', 'andrews', 'andy', 'angel', 'angela', 'angeles', 'angels', 'anger', 'angle', 'angles', 'angry', 'animal', 'animals', 'animated', 'animation', 'anime', 'ann', 'anna', 'anne', 'annie', 'annoyed', 'annoying', 'another', 'answer', 'answers', 'anthony', 'anti', 'antics', 'antonioni', 'antwone', 'anybody', 'anymore', 'anyone', 'anything', 'anyway', 'anyways', 'anywhere', 'apart', 'apartment', 'ape', 'apes', 'appalling', 'apparent', 'apparently', 'appeal', 'appealing', 'appear', 'appearance', 'appearances', 'appeared', 'appearing', 'appears', 'appreciate', 'appreciated', 'appreciation', 'approach', 'appropriate', 'april', 'area', 'areas', 'arguably', 'argue', 'argument', 'arm', 'armed', 'arms', 'army', 'arnold', 'around', 'arrested', 'arrival', 'arrive', 'arrived', 'arrives', 'arrogant', 'art', 'arthur', 'artificial', 'artist', 'artistic', 'artists', 'arts', 'ashamed', 'ashley', 'asian', 'aside', 'ask', 'asked', 'asking', 'asks', 'asleep', 'aspect', 'aspects', 'ass', 'assassin', 'assault', 'assigned', 'assistant', 'associated', 'assume', 'assumed', 'astaire', 'astonishing', 'atlantis', 'atmosphere', 'atmospheric', 'atrocious', 'attached', 'attack', 'attacked', 'attacks', 'attempt', 'attempted', 'attempting', 'attempts', 'attend', 'attention', 'attitude', 'attitudes', 'attorney', 'attracted', 'attraction', 'attractive', 'audience', 'audiences', 'audio', 'aunt', 'austen', 'austin', 'australia', 'australian', 'authentic', 'author', 'authority', 'available', 'average', 'avoid', 'avoided', 'awake', 'award', 'awards', 'aware', 'away', 'awe', 'awesome', 'awful', 'awfully', 'awhile', 'awkward', 'babe', 'baby', 'bacall', 'back', 'backdrop', 'background', 'backgrounds', 'bad', 'badly', 'bag', 'baker', 'bakshi', 'balance', 'baldwin', 'ball', 'ballet', 'balls', 'band', 'bands', 'bang', 'bank', 'banned', 'bar', 'barbara', 'bare', 'barely', 'bargain', 'barry', 'barrymore', 'base', 'baseball', 'based', 'basement', 'basic', 'basically', 'basis', 'basketball', 'bat', 'bath', 'bathroom', 'batman', 'battle', 'battles', 'bay', 'bbc', 'beach', 'bear', 'bears', 'beast', 'beat', 'beaten', 'beating', 'beats', 'beatty', 'beautiful', 'beautifully', 'beauty', 'became', 'become', 'becomes', 'becoming', 'bed', 'bedroom', 'beer', 'began', 'begin', 'beginning', 'begins', 'behave', 'behavior', 'behind', 'beings', 'bela', 'belief', 'beliefs', 'believable', 'believe', 'believed', 'believes', 'believing', 'bell', 'belong', 'belongs', 'beloved', 'belushi', 'ben', 'beneath', 'benefit', 'bergman', 'berlin', 'besides', 'best', 'bet', 'bette', 'better', 'bettie', 'betty', 'beyond', 'bible', 'big', 'bigger', 'biggest', 'biko', 'bill', 'billed', 'billy', 'bin', 'biography', 'bird', 'birds', 'birth', 'birthday', 'bit', 'bite', 'bits', 'bitter', 'bizarre', 'black', 'blade', 'blah', 'blair', 'blake', 'blame', 'bland', 'blank', 'blast', 'blatant', 'bleak', 'blend', 'blew', 'blind', 'blob', 'block', 'blockbuster', 'blond', 'blonde', 'blood', 'bloody', 'blow', 'blowing', 'blown', 'blows', 'blue', 'blues', 'blunt', 'bo', 'board', 'boat', 'bob', 'bobby', 'bodies', 'body', 'bold', 'boll', 'bollywood', 'bomb', 'bond', 'bone', 'bonus', 'book', 'books', 'boom', 'boot', 'border', 'bore', 'bored', 'boredom', 'boring', 'born', 'borrowed', 'boss', 'bother', 'bothered', 'bottle', 'bottom', 'bought', 'bound', 'bourne', 'box', 'boxing', 'boy', 'boyfriend', 'boyle', 'boys', 'brad', 'brady', 'brain', 'brains', 'branagh', 'brand', 'brando', 'brave', 'brazil', 'break', 'breaking', 'breaks', 'breasts', 'breath', 'breathtaking', 'brenda', 'brian', 'bride', 'bridge', 'brief', 'briefly', 'bright', 'brilliance', 'brilliant', 'brilliantly', 'bring', 'bringing', 'brings', 'britain', 'british', 'broad', 'broadcast', 'broadway', 'broke', 'broken', 'brooklyn', 'brooks', 'brosnan', 'brother', 'brothers', 'brought', 'brown', 'bruce', 'brutal', 'brutality', 'brutally', 'buck', 'bucks', 'bud', 'buddies', 'buddy', 'budget', 'buff', 'buffalo', 'buffs', 'bug', 'bugs', 'build', 'building', 'buildings', 'builds', 'built', 'bull', 'bullet', 'bullets', 'bumbling', 'bunch', 'bunny', 'buried', 'burn', 'burned', 'burning', 'burns', 'burt', 'burton', 'bus', 'bush', 'business', 'businessman', 'buster', 'busy', 'butler', 'butt', 'button', 'buy', 'buying', 'cabin', 'cable', 'cage', 'cagney', 'caine', 'cake', 'caliber', 'california', 'call', 'called', 'calling', 'calls', 'calm', 'came', 'cameo', 'cameos', 'camera', 'cameras', 'cameron', 'camp', 'campbell', 'campy', 'canada', 'canadian', 'candy', 'cannibal', 'cannot', 'cant', 'capable', 'capital', 'captain', 'captivating', 'capture', 'captured', 'captures', 'capturing', 'car', 'card', 'cardboard', 'cards', 'care', 'cared', 'career', 'careers', 'careful', 'carefully', 'carell', 'cares', 'caring', 'carl', 'carla', 'carol', 'carpenter', 'carradine', 'carrey', 'carrie', 'carried', 'carries', 'carry', 'carrying', 'cars', 'carter', 'cartoon', 'cartoons', 'cary', 'case', 'cases', 'cash', 'cassidy', 'cast', 'casting', 'castle', 'cat', 'catch', 'catches', 'catching', 'catchy', 'category', 'catherine', 'catholic', 'cats', 'caught', 'cause', 'caused', 'causes', 'causing', 'cave', 'cd', 'celebrity', 'cell', 'celluloid', 'center', 'centered', 'centers', 'central', 'century', 'certain', 'certainly', 'cg', 'cgi', 'chain', 'chair', 'challenge', 'challenging', 'championship', 'chan', 'chance', 'chances', 'change', 'changed', 'changes', 'changing', 'channel', 'channels', 'chaos', 'chaplin', 'chapter', 'character', 'characterization', 'characters', 'charge', 'charisma', 'charismatic', 'charles', 'charlie', 'charlotte', 'charm', 'charming', 'chase', 'chased', 'chases', 'chasing', 'che', 'cheap', 'cheated', 'cheating', 'check', 'checked', 'checking', 'cheek', 'cheese', 'cheesy', 'chemistry', 'chess', 'chest', 'chicago', 'chick', 'chicken', 'chicks', 'chief', 'child', 'childhood', 'childish', 'children', 'chilling', 'china', 'chinese', 'choice', 'choices', 'choose', 'chooses', 'choreographed', 'choreography', 'chorus', 'chose', 'chosen', 'chris', 'christ', 'christian', 'christianity', 'christians', 'christmas', 'christopher', 'christy', 'chuck', 'church', 'cia', 'cinderella', 'cinema', 'cinematic', 'cinematographer', 'cinematography', 'circle', 'circumstances', 'cities', 'citizen', 'city', 'civil', 'civilization', 'claim', 'claimed', 'claims', 'claire', 'clark', 'class', 'classes', 'classic', 'classical', 'classics', 'claus', 'clean', 'clear', 'clearly', 'clever', 'cleverly', 'clich', 'cliche', 'cliff', 'climactic', 'climax', 'clint', 'clip', 'clips', 'clock', 'close', 'closed', 'closely', 'closer', 'closest', 'closet', 'closing', 'clothes', 'clothing', 'clown', 'club', 'clue', 'clues', 'clumsy', 'co', 'coach', 'coast', 'code', 'coffee', 'coherent', 'cold', 'cole', 'collection', 'college', 'colonel', 'color', 'colorful', 'colors', 'colour', 'columbo', 'com', 'combat', 'combination', 'combine', 'combined', 'come', 'comedian', 'comedic', 'comedies', 'comedy', 'comes', 'comfort', 'comfortable', 'comic', 'comical', 'comics', 'coming', 'command', 'comment', 'commentary', 'commented', 'comments', 'commercial', 'commercials', 'commit', 'committed', 'common', 'communist', 'community', 'companies', 'companion', 'company', 'compare', 'compared', 'comparing', 'comparison', 'compassion', 'compelled', 'compelling', 'competent', 'competition', 'complain', 'complaint', 'complete', 'completely', 'complex', 'complexity', 'complicated', 'composed', 'composer', 'computer', 'con', 'conceived', 'concept', 'concern', 'concerned', 'concerning', 'concerns', 'concert', 'conclusion', 'condition', 'conditions', 'confess', 'confidence', 'conflict', 'conflicts', 'confrontation', 'confused', 'confusing', 'confusion', 'connect', 'connected', 'connection', 'connery', 'conscious', 'consequences', 'conservative', 'consider', 'considerable', 'considered', 'considering', 'consistent', 'consistently', 'consists', 'conspiracy', 'constant', 'constantly', 'constructed', 'construction', 'contact', 'contain', 'contained', 'contains', 'contemporary', 'content', 'contest', 'context', 'continue', 'continued', 'continues', 'continuity', 'contract', 'contrary', 'contrast', 'contrived', 'control', 'controversial', 'conventional', 'conversation', 'conversations', 'convey', 'convince', 'convinced', 'convincing', 'convincingly', 'convoluted', 'cook', 'cool', 'cooper', 'cop', 'copies', 'cops', 'copy', 'core', 'corner', 'corny', 'corporate', 'corpse', 'correct', 'correctly', 'corrupt', 'corruption', 'cost', 'costs', 'costume', 'costumes', 'could', 'count', 'counter', 'countless', 'countries', 'country', 'countryside', 'couple', 'couples', 'courage', 'course', 'court', 'cousin', 'cover', 'covered', 'covers', 'cowboy', 'cox', 'crack', 'cracking', 'craft', 'crafted', 'craig', 'crap', 'crappy', 'crash', 'craven', 'crawford', 'crazed', 'crazy', 'cream', 'create', 'created', 'creates', 'creating', 'creation', 'creative', 'creativity', 'creator', 'creators', 'creature', 'creatures', 'credibility', 'credible', 'credit', 'credits', 'creep', 'creepy', 'crew', 'cried', 'crime', 'crimes', 'criminal', 'criminals', 'cringe', 'crisis', 'critic', 'critical', 'criticism', 'critics', 'crocodile', 'cross', 'crowd', 'crucial', 'crude', 'cruel', 'cruise', 'crush', 'cry', 'crying', 'crystal', 'cuba', 'cube', 'cult', 'cultural', 'culture', 'cup', 'cure', 'curiosity', 'curious', 'current', 'currently', 'curse', 'curtis', 'cusack', 'cut', 'cute', 'cuts', 'cutting', 'cynical', 'da', 'dad', 'daddy', 'daily', 'dalton', 'damage', 'damme', 'damn', 'damon', 'dan', 'dana', 'dance', 'dancer', 'dancers', 'dances', 'dancing', 'danes', 'danger', 'dangerous', 'daniel', 'danny', 'dare', 'daring', 'dark', 'darker', 'darkness', 'darren', 'date', 'dated', 'dating', 'daughter', 'daughters', 'dave', 'david', 'davies', 'davis', 'dawn', 'dawson', 'day', 'days', 'de', 'dead', 'deadly', 'deaf', 'deal', 'dealing', 'deals', 'dealt', 'dean', 'dear', 'death', 'deaths', 'debut', 'decade', 'decades', 'deceased', 'decent', 'decide', 'decided', 'decides', 'decision', 'decisions', 'dedicated', 'dee', 'deep', 'deeper', 'deeply', 'defeat', 'defend', 'defense', 'defined', 'definite', 'definitely', 'definition', 'degree', 'del', 'deliberately', 'delight', 'delightful', 'deliver', 'delivered', 'delivering', 'delivers', 'delivery', 'demand', 'demands', 'demented', 'demise', 'demon', 'demons', 'deniro', 'dennis', 'dentist', 'denzel', 'department', 'depicted', 'depicting', 'depiction', 'depicts', 'depressed', 'depressing', 'depression', 'depth', 'der', 'derek', 'descent', 'describe', 'described', 'describes', 'description', 'desert', 'deserve', 'deserved', 'deserves', 'design', 'designed', 'designs', 'desire', 'desired', 'despair', 'desperate', 'desperately', 'desperation', 'despite', 'destiny', 'destroy', 'destroyed', 'destroying', 'destruction', 'detail', 'detailed', 'details', 'detective', 'determined', 'develop', 'developed', 'developing', 'development', 'develops', 'device', 'devil', 'devoid', 'devoted', 'dialog', 'dialogs', 'dialogue', 'dialogues', 'diamond', 'diana', 'diane', 'dick', 'dickens', 'die', 'died', 'dies', 'difference', 'differences', 'different', 'difficult', 'dig', 'digital', 'dignity', 'dimension', 'dimensional', 'din', 'dinner', 'dinosaur', 'dinosaurs', 'dire', 'direct', 'directed', 'directing', 'direction', 'directions', 'directly', 'director', 'directorial', 'directors', 'directs', 'dirty', 'disagree', 'disappear', 'disappeared', 'disappoint', 'disappointed', 'disappointing', 'disappointment', 'disaster', 'disbelief', 'disc', 'discover', 'discovered', 'discovers', 'discovery', 'discuss', 'discussion', 'disease', 'disgusting', 'disjointed', 'dislike', 'disliked', 'disney', 'display', 'displayed', 'displays', 'distance', 'distant', 'distinct', 'distracting', 'distribution', 'disturbed', 'disturbing', 'divorce', 'dixon', 'doc', 'doctor', 'documentaries', 'documentary', 'dog', 'dogs', 'doll', 'dollar', 'dollars', 'dolls', 'dolph', 'domestic', 'domino', 'donald', 'done', 'donna', 'doo', 'doom', 'doomed', 'door', 'doors', 'dorothy', 'double', 'doubt', 'doubts', 'douglas', 'downey', 'downhill', 'downright', 'dozen', 'dozens', 'dr', 'dracula', 'drag', 'dragged', 'dragon', 'drags', 'drake', 'drama', 'dramas', 'dramatic', 'draw', 'drawing', 'drawn', 'draws', 'dreadful', 'dream', 'dreams', 'dreary', 'dreck', 'dress', 'dressed', 'dressing', 'drew', 'drink', 'drinking', 'drive', 'drivel', 'driven', 'driver', 'drives', 'driving', 'drop', 'dropped', 'dropping', 'drops', 'drug', 'drugs', 'drunk', 'drunken', 'dry', 'dub', 'dubbed', 'dubbing', 'dud', 'dude', 'due', 'duke', 'dull', 'dumb', 'duo', 'dust', 'dutch', 'duty', 'dvd', 'dying', 'dynamic', 'eager', 'ear', 'earl', 'earlier', 'early', 'earned', 'ears', 'earth', 'ease', 'easier', 'easily', 'east', 'eastern', 'eastwood', 'easy', 'eat', 'eaten', 'eating', 'eccentric', 'ed', 'eddie', 'edgar', 'edge', 'edgy', 'edie', 'edited', 'editing', 'edition', 'editor', 'education', 'educational', 'edward', 'eerie', 'effect', 'effective', 'effectively', 'effects', 'effort', 'efforts', 'ego', 'eight', 'eighties', 'either', 'elaborate', 'elderly', 'elegant', 'element', 'elements', 'elephant', 'elizabeth', 'ellen', 'elm', 'else', 'elsewhere', 'elvira', 'elvis', 'em', 'embarrassed', 'embarrassing', 'embarrassment', 'emily', 'emma', 'emotion', 'emotional', 'emotionally', 'emotions', 'empathy', 'emperor', 'emphasis', 'empire', 'empty', 'en', 'encounter', 'encounters', 'end', 'endearing', 'ended', 'ending', 'endings', 'endless', 'ends', 'endure', 'enemies', 'enemy', 'energy', 'engage', 'engaged', 'engaging', 'england', 'english', 'enjoy', 'enjoyable', 'enjoyed', 'enjoying', 'enjoyment', 'enjoys', 'enormous', 'enough', 'ensemble', 'ensues', 'enter', 'enterprise', 'enters', 'entertain', 'entertained', 'entertaining', 'entertainment', 'enthusiasm', 'entire', 'entirely', 'entry', 'environment', 'epic', 'episode', 'episodes', 'equal', 'equally', 'equipment', 'equivalent', 'er', 'era', 'eric', 'erotic', 'errors', 'escape', 'escaped', 'escapes', 'especially', 'essence', 'essential', 'essentially', 'established', 'estate', 'et', 'etc', 'ethan', 'eugene', 'europe', 'european', 'eva', 'eve', 'even', 'evening', 'event', 'events', 'eventually', 'ever', 'every', 'everybody', 'everyday', 'everyone', 'everything', 'everywhere', 'evidence', 'evident', 'evil', 'ex', 'exact', 'exactly', 'exaggerated', 'examination', 'example', 'examples', 'excellent', 'except', 'exception', 'exceptional', 'exceptionally', 'excessive', 'exchange', 'excited', 'excitement', 'exciting', 'excuse', 'executed', 'execution', 'executive', 'exercise', 'exist', 'existed', 'existence', 'existent', 'exists', 'exotic', 'expect', 'expectations', 'expected', 'expecting', 'expedition', 'expensive', 'experience', 'experienced', 'experiences', 'experiment', 'experimental', 'experiments', 'expert', 'explain', 'explained', 'explaining', 'explains', 'explanation', 'explicit', 'exploitation', 'exploration', 'explore', 'explored', 'explosion', 'explosions', 'exposed', 'exposure', 'express', 'expressed', 'expression', 'expressions', 'extended', 'extent', 'extra', 'extraordinary', 'extras', 'extreme', 'extremely', 'eye', 'eyed', 'eyes', 'eyre', 'fabulous', 'face', 'faced', 'faces', 'facial', 'facing', 'fact', 'factor', 'factory', 'facts', 'fail', 'failed', 'failing', 'fails', 'failure', 'fair', 'fairly', 'fairy', 'faith', 'faithful', 'fake', 'falk', 'fall', 'fallen', 'falling', 'falls', 'false', 'fame', 'familiar', 'families', 'family', 'famous', 'fan', 'fancy', 'fans', 'fantastic', 'fantasy', 'far', 'farce', 'fare', 'farm', 'farrell', 'fascinated', 'fascinating', 'fashion', 'fashioned', 'fast', 'faster', 'fat', 'fatal', 'fate', 'father', 'fault', 'faults', 'favor', 'favorite', 'favorites', 'favourite', 'fay', 'fbi', 'fear', 'fears', 'feature', 'featured', 'features', 'featuring', 'fed', 'feed', 'feel', 'feeling', 'feelings', 'feels', 'feet', 'felix', 'fell', 'fellow', 'felt', 'female', 'feminist', 'femme', 'fest', 'festival', 'fetched', 'fever', 'fi', 'fianc', 'fiction', 'fictional', 'fido', 'field', 'fields', 'fifteen', 'fight', 'fighter', 'fighting', 'fights', 'figure', 'figured', 'figures', 'files', 'fill', 'filled', 'film', 'filmed', 'filming', 'filmmaker', 'filmmakers', 'films', 'final', 'finale', 'finally', 'financial', 'find', 'finding', 'finds', 'fine', 'finest', 'finger', 'finish', 'finished', 'fire', 'fired', 'firm', 'first', 'firstly', 'fish', 'fisher', 'fishing', 'fit', 'fits', 'fitting', 'five', 'fix', 'flair', 'flash', 'flashback', 'flashbacks', 'flat', 'flaw', 'flawed', 'flawless', 'flaws', 'flesh', 'flick', 'flicks', 'flies', 'flight', 'floating', 'floor', 'flop', 'florida', 'flow', 'fly', 'flying', 'flynn', 'focus', 'focused', 'focuses', 'focusing', 'folk', 'folks', 'follow', 'followed', 'following', 'follows', 'fond', 'fonda', 'food', 'fool', 'fooled', 'foot', 'footage', 'football', 'forbidden', 'force', 'forced', 'forces', 'ford', 'foreign', 'forest', 'forever', 'forget', 'forgettable', 'forgive', 'forgot', 'forgotten', 'form', 'format', 'former', 'forms', 'formula', 'formulaic', 'forth', 'fortunately', 'fortune', 'forty', 'forward', 'foster', 'fought', 'foul', 'found', 'four', 'fourth', 'fox', 'frame', 'france', 'franchise', 'francis', 'francisco', 'franco', 'frank', 'frankenstein', 'frankie', 'frankly', 'freak', 'fred', 'freddy', 'free', 'freedom', 'freeman', 'french', 'frequent', 'frequently', 'fresh', 'friday', 'friend', 'friendly', 'friends', 'friendship', 'frightening', 'front', 'frustrated', 'frustrating', 'frustration', 'fu', 'fulci', 'full', 'fuller', 'fully', 'fun', 'funeral', 'funnier', 'funniest', 'funny', 'furious', 'furthermore', 'fury', 'future', 'futuristic', 'fx', 'gabriel', 'gadget', 'gag', 'gags', 'gain', 'game', 'games', 'gandhi', 'gang', 'gangster', 'gangsters', 'garbage', 'garbo', 'garden', 'gary', 'gas', 'gave', 'gay', 'gem', 'gender', 'gene', 'general', 'generally', 'generated', 'generation', 'generations', 'generic', 'generous', 'genius', 'genre', 'genres', 'gentle', 'gentleman', 'genuine', 'genuinely', 'george', 'gerard', 'german', 'germans', 'germany', 'get', 'gets', 'getting', 'ghost', 'ghosts', 'giallo', 'giant', 'gift', 'gifted', 'ginger', 'girl', 'girlfriend', 'girls', 'give', 'given', 'gives', 'giving', 'glad', 'glass', 'glasses', 'glenn', 'glimpse', 'global', 'glorious', 'glory', 'glover', 'go', 'goal', 'god', 'godfather', 'godzilla', 'goes', 'going', 'gold', 'goldberg', 'golden', 'gone', 'gonna', 'good', 'goodness', 'goofy', 'gordon', 'gore', 'gorgeous', 'gory', 'got', 'gothic', 'gotta', 'gotten', 'government', 'grab', 'grabs', 'grace', 'grade', 'gradually', 'graham', 'grand', 'grandfather', 'grandmother', 'grant', 'granted', 'graphic', 'graphics', 'grasp', 'gratuitous', 'grave', 'gray', 'grayson', 'great', 'greater', 'greatest', 'greatly', 'greatness', 'greed', 'greedy', 'greek', 'green', 'greg', 'gregory', 'grew', 'grey', 'grief', 'griffith', 'grim', 'grinch', 'gripping', 'gritty', 'gross', 'ground', 'group', 'groups', 'grow', 'growing', 'grown', 'grows', 'gruesome', 'guarantee', 'guard', 'guess', 'guessed', 'guessing', 'guest', 'guide', 'guilt', 'guilty', 'gun', 'gundam', 'guns', 'guts', 'guy', 'guys', 'ha', 'hair', 'hal', 'half', 'halfway', 'hall', 'halloween', 'ham', 'hamilton', 'hamlet', 'hammer', 'hand', 'handed', 'handful', 'handle', 'handled', 'hands', 'handsome', 'hang', 'hanging', 'hank', 'hanks', 'happen', 'happened', 'happening', 'happens', 'happily', 'happiness', 'happy', 'hard', 'hardcore', 'harder', 'hardly', 'hardy', 'harm', 'harris', 'harry', 'harsh', 'hart', 'hartley', 'harvey', 'hat', 'hate', 'hated', 'hates', 'hatred', 'haunted', 'haunting', 'hawke', 'hbo', 'head', 'headed', 'heads', 'health', 'hear', 'heard', 'hearing', 'heart', 'hearted', 'hearts', 'heat', 'heaven', 'heavily', 'heavy', 'heck', 'heights', 'held', 'helen', 'helicopter', 'hell', 'hello', 'help', 'helped', 'helping', 'helps', 'hence', 'henry', 'hero', 'heroes', 'heroic', 'heroine', 'heston', 'hey', 'hidden', 'hide', 'hideous', 'hiding', 'high', 'higher', 'highest', 'highlight', 'highlights', 'highly', 'hilarious', 'hilariously', 'hill', 'hills', 'hint', 'hints', 'hip', 'hippie', 'hire', 'hired', 'historical', 'historically', 'history', 'hit', 'hitchcock', 'hitler', 'hits', 'hitting', 'ho', 'hoffman', 'hold', 'holding', 'holds', 'hole', 'holes', 'holiday', 'hollow', 'holly', 'hollywood', 'holmes', 'holy', 'homage', 'home', 'homeless', 'homer', 'homosexual', 'honest', 'honestly', 'honesty', 'hong', 'honor', 'hood', 'hook', 'hooked', 'hop', 'hope', 'hoped', 'hopefully', 'hopeless', 'hopes', 'hoping', 'hopper', 'horrendous', 'horrible', 'horribly', 'horrid', 'horrific', 'horrifying', 'horror', 'horrors', 'horse', 'horses', 'hospital', 'host', 'hot', 'hotel', 'hour', 'hours', 'house', 'household', 'houses', 'howard', 'however', 'hudson', 'huge', 'hugh', 'huh', 'human', 'humanity', 'humans', 'humble', 'humor', 'humorous', 'humour', 'hundred', 'hundreds', 'hung', 'hunt', 'hunter', 'hunters', 'hunting', 'hurt', 'hurts', 'husband', 'husbands', 'hyde', 'hype', 'hysterical', 'ian', 'ice', 'icon', 'idea', 'ideal', 'ideas', 'identify', 'identity', 'idiot', 'idiotic', 'idiots', 'ignorant', 'ignore', 'ignored', 'ii', 'iii', 'ill', 'illegal', 'illness', 'illogical', 'im', 'image', 'imagery', 'images', 'imagination', 'imaginative', 'imagine', 'imagined', 'imdb', 'imitation', 'immediate', 'immediately', 'immensely', 'impact', 'implausible', 'importance', 'important', 'importantly', 'impossible', 'impress', 'impressed', 'impression', 'impressive', 'improve', 'improved', 'improvement', 'inability', 'inane', 'inappropriate', 'incident', 'include', 'included', 'includes', 'including', 'incoherent', 'incompetent', 'incomprehensible', 'increasingly', 'incredible', 'incredibly', 'indeed', 'independent', 'india', 'indian', 'indians', 'indie', 'individual', 'individuals', 'inducing', 'indulgent', 'industry', 'inept', 'inevitable', 'inevitably', 'infamous', 'inferior', 'influence', 'influenced', 'information', 'ingredients', 'initial', 'initially', 'inner', 'innocence', 'innocent', 'innovative', 'insane', 'inside', 'insight', 'inspector', 'inspiration', 'inspired', 'inspiring', 'installment', 'instance', 'instant', 'instantly', 'instead', 'instinct', 'insult', 'insulting', 'integrity', 'intellectual', 'intelligence', 'intelligent', 'intended', 'intense', 'intensity', 'intent', 'intention', 'intentionally', 'intentions', 'interaction', 'interest', 'interested', 'interesting', 'interests', 'international', 'internet', 'interpretation', 'interview', 'interviews', 'intimate', 'intrigue', 'intrigued', 'intriguing', 'introduce', 'introduced', 'introduces', 'introduction', 'invasion', 'invented', 'inventive', 'investigate', 'investigation', 'invisible', 'involve', 'involved', 'involvement', 'involves', 'involving', 'iran', 'iraq', 'ireland', 'irish', 'iron', 'ironic', 'ironically', 'irony', 'irrelevant', 'irritating', 'island', 'isolated', 'israel', 'issue', 'issues', 'italian', 'italy', 'jack', 'jackie', 'jackson', 'jail', 'jake', 'james', 'jamie', 'jane', 'japan', 'japanese', 'jason', 'jaw', 'jaws', 'jay', 'jazz', 'jealous', 'jean', 'jeff', 'jeffrey', 'jennifer', 'jenny', 'jeremy', 'jerk', 'jerry', 'jesse', 'jessica', 'jesus', 'jet', 'jewish', 'jim', 'jimmy', 'joan', 'job', 'jobs', 'joe', 'joel', 'joey', 'john', 'johnny', 'johnson', 'join', 'joined', 'joke', 'jokes', 'jon', 'jonathan', 'jones', 'joseph', 'josh', 'journalist', 'journey', 'joy', 'jr', 'judge', 'judging', 'judy', 'julia', 'julian', 'julie', 'jump', 'jumped', 'jumping', 'jumps', 'june', 'jungle', 'junior', 'junk', 'justice', 'justify', 'justin', 'juvenile', 'kane', 'kansas', 'kapoor', 'karen', 'karloff', 'kate', 'kay', 'keaton', 'keep', 'keeping', 'keeps', 'keith', 'kelly', 'ken', 'kennedy', 'kenneth', 'kept', 'kevin', 'key', 'khan', 'kick', 'kicked', 'kicking', 'kicks', 'kid', 'kidding', 'kidnapped', 'kids', 'kill', 'killed', 'killer', 'killers', 'killing', 'killings', 'kills', 'kim', 'kind', 'kinda', 'kinds', 'king', 'kingdom', 'kirk', 'kiss', 'kissing', 'kitchen', 'knew', 'knife', 'knock', 'know', 'knowing', 'knowledge', 'known', 'knows', 'kolchak', 'kong', 'korean', 'kubrick', 'kudos', 'kumar', 'kung', 'kurosawa', 'kurt', 'kyle', 'la', 'lab', 'lack', 'lacked', 'lacking', 'lackluster', 'lacks', 'ladies', 'lady', 'laid', 'lake', 'lame', 'land', 'landing', 'landscape', 'landscapes', 'lane', 'language', 'large', 'largely', 'larger', 'larry', 'last', 'lasted', 'late', 'lately', 'later', 'latest', 'latin', 'latter', 'laugh', 'laughable', 'laughably', 'laughed', 'laughing', 'laughs', 'laughter', 'laura', 'laurel', 'law', 'lawrence', 'laws', 'lawyer', 'lay', 'lazy', 'le', 'lead', 'leader', 'leading', 'leads', 'league', 'learn', 'learned', 'learning', 'learns', 'least', 'leave', 'leaves', 'leaving', 'led', 'lee', 'left', 'leg', 'legacy', 'legal', 'legend', 'legendary', 'legs', 'lemmon', 'lena', 'length', 'lengthy', 'leo', 'leon', 'leonard', 'les', 'lesbian', 'leslie', 'less', 'lesser', 'lesson', 'lessons', 'let', 'lets', 'letter', 'letters', 'letting', 'level', 'levels', 'lewis', 'li', 'liberal', 'library', 'lie', 'lies', 'life', 'lifestyle', 'lifetime', 'light', 'lighting', 'lights', 'likable', 'like', 'liked', 'likely', 'likes', 'likewise', 'liking', 'lily', 'limited', 'limits', 'lincoln', 'linda', 'line', 'liners', 'lines', 'link', 'lion', 'lips', 'lisa', 'list', 'listed', 'listen', 'listening', 'lit', 'literally', 'literature', 'little', 'live', 'lived', 'lively', 'lives', 'living', 'lloyd', 'load', 'loaded', 'loads', 'local', 'location', 'locations', 'locked', 'logan', 'logic', 'logical', 'lol', 'london', 'lone', 'lonely', 'long', 'longer', 'look', 'looked', 'looking', 'looks', 'loose', 'loosely', 'lord', 'los', 'lose', 'loser', 'losers', 'loses', 'losing', 'loss', 'lost', 'lot', 'lots', 'lou', 'loud', 'louis', 'louise', 'lousy', 'lovable', 'love', 'loved', 'lovely', 'lover', 'lovers', 'loves', 'loving', 'low', 'lower', 'lowest', 'loyal', 'loyalty', 'lucas', 'luck', 'luckily', 'lucky', 'lucy', 'ludicrous', 'lugosi', 'luke', 'lumet', 'lundgren', 'lust', 'lying', 'lynch', 'lyrics', 'macarthur', 'machine', 'machines', 'macy', 'mad', 'made', 'madness', 'madonna', 'mafia', 'magazine', 'maggie', 'magic', 'magical', 'magnificent', 'maid', 'mail', 'main', 'mainly', 'mainstream', 'maintain', 'major', 'majority', 'make', 'maker', 'makers', 'makes', 'makeup', 'making', 'male', 'mall', 'malone', 'man', 'manage', 'managed', 'manager', 'manages', 'manhattan', 'maniac', 'manipulative', 'mankind', 'mann', 'manner', 'mansion', 'many', 'map', 'marc', 'march', 'margaret', 'maria', 'marie', 'mario', 'marion', 'mark', 'market', 'marketing', 'marks', 'marriage', 'married', 'marry', 'mars', 'marshall', 'martial', 'martin', 'marty', 'marvelous', 'mary', 'mask', 'masks', 'mass', 'massacre', 'masses', 'massive', 'master', 'masterful', 'masterpiece', 'masterpieces', 'masters', 'match', 'matched', 'matches', 'mate', 'material', 'matrix', 'matt', 'matter', 'matters', 'matthau', 'matthew', 'mature', 'max', 'may', 'maybe', 'mayor', 'mclaglen', 'mean', 'meaning', 'meaningful', 'meaningless', 'means', 'meant', 'meanwhile', 'measure', 'meat', 'mechanical', 'media', 'medical', 'mediocre', 'medium', 'meet', 'meeting', 'meets', 'mel', 'melodrama', 'melodramatic', 'melting', 'member', 'members', 'memorable', 'memories', 'memory', 'men', 'menace', 'menacing', 'mental', 'mentally', 'mention', 'mentioned', 'mentioning', 'mentions', 'mere', 'merely', 'merit', 'merits', 'meryl', 'mess', 'message', 'messages', 'messed', 'met', 'metal', 'metaphor', 'method', 'methods', 'mexican', 'mexico', 'mgm', 'michael', 'michelle', 'mickey', 'mid', 'middle', 'midnight', 'might', 'mighty', 'miike', 'mike', 'mild', 'mildly', 'mildred', 'mile', 'miles', 'military', 'milk', 'mill', 'miller', 'million', 'millionaire', 'millions', 'min', 'mind', 'minded', 'mindless', 'minds', 'mine', 'mini', 'minimal', 'minimum', 'minor', 'minute', 'minutes', 'miracle', 'mirror', 'miscast', 'miserable', 'miserably', 'misery', 'miss', 'missed', 'misses', 'missing', 'mission', 'mistake', 'mistaken', 'mistakes', 'mistress', 'mitchell', 'mix', 'mixed', 'mixture', 'miyazaki', 'mm', 'mob', 'model', 'models', 'modern', 'modesty', 'molly', 'mom', 'moment', 'moments', 'mon', 'money', 'monk', 'monkey', 'monkeys', 'monster', 'monsters', 'montage', 'montana', 'month', 'months', 'mood', 'moody', 'moon', 'moore', 'moral', 'morality', 'morgan', 'morning', 'moronic', 'morris', 'mostly', 'mother', 'motion', 'motivation', 'motivations', 'motives', 'mountain', 'mountains', 'mouse', 'mouth', 'move', 'moved', 'movement', 'movements', 'moves', 'movie', 'movies', 'moving', 'mr', 'mrs', 'ms', 'mst', 'mtv', 'much', 'multi', 'multiple', 'mummy', 'mundane', 'murder', 'murdered', 'murderer', 'murderous', 'murders', 'murphy', 'murray', 'museum', 'music', 'musical', 'musicals', 'muslim', 'must', 'myers', 'mysteries', 'mysterious', 'mystery', 'nail', 'naive', 'naked', 'name', 'named', 'namely', 'names', 'nancy', 'narration', 'narrative', 'narrator', 'nasty', 'nathan', 'nation', 'national', 'native', 'natural', 'naturally', 'nature', 'navy', 'nazi', 'nazis', 'nd', 'near', 'nearby', 'nearly', 'neat', 'necessarily', 'necessary', 'neck', 'ned', 'need', 'needed', 'needless', 'needs', 'negative', 'neighbor', 'neighborhood', 'neighbors', 'neil', 'neither', 'nelson', 'neo', 'nephew', 'nerd', 'nervous', 'network', 'never', 'nevertheless', 'new', 'newly', 'newman', 'news', 'newspaper', 'next', 'nice', 'nicely', 'nicholas', 'nicholson', 'nick', 'nicole', 'night', 'nightmare', 'nightmares', 'nights', 'nine', 'ninja', 'niro', 'noble', 'nobody', 'noir', 'noise', 'nominated', 'nomination', 'non', 'none', 'nonetheless', 'nonsense', 'nonsensical', 'normal', 'normally', 'norman', 'north', 'nose', 'nostalgia', 'nostalgic', 'notable', 'notably', 'notch', 'note', 'noted', 'notes', 'nothing', 'notice', 'noticed', 'notion', 'notorious', 'novak', 'novel', 'novels', 'nowadays', 'nowhere', 'nuclear', 'nude', 'nudity', 'number', 'numbers', 'numerous', 'nurse', 'nuts', 'nyc', 'object', 'objective', 'obnoxious', 'obscure', 'obsessed', 'obsession', 'obvious', 'obviously', 'occasion', 'occasional', 'occasionally', 'occur', 'occurred', 'occurs', 'ocean', 'odd', 'oddly', 'odds', 'offended', 'offensive', 'offer', 'offered', 'offering', 'offers', 'office', 'officer', 'officers', 'official', 'often', 'oh', 'oil', 'ok', 'okay', 'old', 'older', 'oliver', 'olivier', 'ollie', 'omen', 'one', 'ones', 'online', 'onto', 'open', 'opened', 'opening', 'opens', 'opera', 'operation', 'opinion', 'opinions', 'opportunities', 'opportunity', 'opposed', 'opposite', 'orange', 'order', 'ordered', 'orders', 'ordinary', 'original', 'originality', 'originally', 'orleans', 'orson', 'oscar', 'oscars', 'othello', 'others', 'otherwise', 'ought', 'outcome', 'outer', 'outfit', 'outrageous', 'outside', 'outstanding', 'overacting', 'overall', 'overcome', 'overdone', 'overlook', 'overlooked', 'overly', 'overrated', 'overwhelming', 'owen', 'owner', 'oz', 'pace', 'paced', 'pacing', 'pacino', 'pack', 'package', 'packed', 'page', 'paid', 'pain', 'painful', 'painfully', 'paint', 'painted', 'painting', 'pair', 'pal', 'palace', 'palance', 'palma', 'paltrow', 'pamela', 'pan', 'panic', 'pants', 'paper', 'par', 'parallel', 'paranoia', 'parent', 'parents', 'paris', 'park', 'parker', 'parody', 'part', 'particular', 'particularly', 'parties', 'partly', 'partner', 'parts', 'party', 'pass', 'passable', 'passed', 'passes', 'passing', 'passion', 'passionate', 'past', 'pat', 'path', 'pathetic', 'patience', 'patient', 'patients', 'patrick', 'paul', 'paulie', 'pay', 'paying', 'pays', 'peace', 'peak', 'pearl', 'people', 'peoples', 'per', 'perfect', 'perfection', 'perfectly', 'perform', 'performance', 'performances', 'performed', 'performer', 'performers', 'performing', 'performs', 'perhaps', 'period', 'perry', 'person', 'persona', 'personal', 'personalities', 'personality', 'personally', 'persons', 'perspective', 'pet', 'pete', 'peter', 'peters', 'petty', 'pg', 'phantom', 'phil', 'philip', 'philosophical', 'philosophy', 'phone', 'phony', 'photo', 'photographed', 'photographer', 'photography', 'photos', 'physical', 'physically', 'piano', 'pick', 'picked', 'picking', 'picks', 'picture', 'pictures', 'pie', 'piece', 'pieces', 'pierce', 'pig', 'pile', 'pilot', 'pin', 'pink', 'pit', 'pitch', 'pitt', 'pity', 'place', 'placed', 'places', 'plague', 'plain', 'plan', 'plane', 'planet', 'planned', 'planning', 'plans', 'plant', 'plastic', 'plausible', 'play', 'played', 'player', 'players', 'playing', 'plays', 'pleasant', 'pleasantly', 'please', 'pleased', 'pleasure', 'plenty', 'plight', 'plot', 'plots', 'plus', 'poem', 'poetic', 'poetry', 'poignant', 'point', 'pointed', 'pointless', 'points', 'pokemon', 'polanski', 'police', 'polished', 'political', 'politically', 'politics', 'pool', 'poor', 'poorly', 'pop', 'popcorn', 'pops', 'popular', 'popularity', 'population', 'porn', 'porno', 'portion', 'portrait', 'portray', 'portrayal', 'portrayed', 'portraying', 'portrays', 'position', 'positive', 'possessed', 'possibilities', 'possibility', 'possible', 'possibly', 'post', 'poster', 'pot', 'potential', 'potentially', 'poverty', 'powell', 'power', 'powerful', 'powers', 'practically', 'practice', 'praise', 'pre', 'precious', 'predictable', 'prefer', 'pregnant', 'premiere', 'premise', 'prepared', 'prequel', 'presence', 'present', 'presentation', 'presented', 'presents', 'president', 'press', 'pressure', 'preston', 'presumably', 'pretend', 'pretending', 'pretentious', 'pretty', 'prevent', 'preview', 'previous', 'previously', 'prey', 'price', 'priceless', 'pride', 'priest', 'primarily', 'primary', 'prime', 'prince', 'princess', 'principal', 'print', 'prior', 'prison', 'prisoner', 'prisoners', 'private', 'prize', 'pro', 'probably', 'problem', 'problems', 'proceedings', 'proceeds', 'process', 'produce', 'produced', 'producer', 'producers', 'producing', 'product', 'production', 'productions', 'professional', 'professor', 'profound', 'program', 'progress', 'progresses', 'project', 'projects', 'prom', 'promise', 'promised', 'promises', 'promising', 'proof', 'propaganda', 'proper', 'properly', 'property', 'props', 'prostitute', 'protagonist', 'protagonists', 'protect', 'proud', 'prove', 'proved', 'proves', 'provide', 'provided', 'provides', 'providing', 'provoking', 'pseudo', 'psychiatrist', 'psychic', 'psycho', 'psychological', 'psychotic', 'public', 'pull', 'pulled', 'pulling', 'pulls', 'pulp', 'punch', 'punishment', 'punk', 'puppet', 'purchase', 'purc