1. 程式人生 > >MovieLens 1M之python資料分析練習

MovieLens 1M之python資料分析練習

程式碼區:

import pandas as pd
uname=['user_id','gender','age','occupation','zip']
users=pd.read_table(r'D:\demo1\ml-1m\users.dat',sep='::',header=None,names=uname,engine = 'python')
'''
sep : str, default ‘,’
指定分隔符。如果不指定引數,則會嘗試使用逗號分隔。分隔符長於一個字元並且不是‘\s+’,
將使用python的語法分析器。並且忽略資料中的逗號。正則表示式例子:'\r\t'

header : int or list of ints, default ‘infer’指定行數用來作為列名,資料開始行數。

names : array-like, default None
用於結果的列名列表,如果資料檔案中沒有列標題行,就需要執行header=None。
engine解析器引擎使用。C引擎速度更快,而python引擎目前更加完善。除去警告
'''
rnames=['user_id','movie_id','rating','timestamp'] ratings=pd.read_table(r'D:\demo1\ml-1m\ratings.dat',sep='::',header=None,names=rnames,engine = 'python') mname=['movie_id','title','genres'] movies=pd.read_table(r'D:\demo1\ml-1m\movies.dat',sep='::',header=None,names=mname,engine = 'python') data=pd.merge(pd.merge(movies,ratings),users) print
data.loc[0]#ix[0]已經deprecated棄用

結果:
這裡寫圖片描述

這裡寫圖片描述

movie_id                                1
title                    Toy Story (1995)
genres        Animation|Children's|Comedy
user_id                                 1
rating                                  5
timestamp                       978824268
gender                                  F
age                                     1
occupation 10 zip 48067
'''
#樞軸表pandas.pivot_table(data, values=None, 
index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All')
'''
mean_ratings=data.pivot_table('rating',index='title',columns='gender',aggfunc='mean')
print mean_ratings[:5]

result:

這裡寫圖片描述

gender                                F         M
title                                            
$1,000,000 Duck (1971)         3.375000  2.761905
'Night Mother (1986)           3.388889  3.352941
'Til There Was You (1997)      2.675676  2.733333
'burbs, The (1989)             2.793478  2.962085
...And Justice for All (1979)  3.828571  3.689024
#過濾資料不足200條的電影
ratings_groupby_title=data.groupby('title').size()
print ratings_groupby_title[:5]

reslut:

title
$1,000,000 Duck (1971)            37
'Night Mother (1986)              70
'Til There Was You (1997)         52
'burbs, The (1989)               303
...And Justice for All (1979)    199
dtype: int64

這裡寫圖片描述

active_titles=data.groupby('title').size().index[data.groupby('title').size()>=200]
print active_titles

result:

Index([u''burbs, The (1989)', u'10 Things I Hate About You (1999)',
       u'101 Dalmatians (1961)', u'101 Dalmatians (1996)',
       u'12 Angry Men (1957)', u'13th Warrior, The (1999)',
       u'2 Days in the Valley (1996)', u'20,000 Leagues Under the Sea (1954)',
       u'2001: A Space Odyssey (1968)', u'2010 (1984)',
       ...
       u'Year of Living Dangerously (1982)', u'Yellow Submarine (1968)',
       u'Yojimbo (1961)', u'You've Got Mail (1998)',
       u'Young Frankenstein (1974)', u'Young Guns (1988)',
       u'Young Guns II (1990)', u'Young Sherlock Holmes (1985)',
       u'Zero Effect (1998)', u'eXistenZ (1999)'],
      dtype='object', name=u'title', length=1426)
mean_ratings=mean_ratings.loc[active_titles]
#對F列進行降序
top_female_rating=mean_ratings.sort_values(by='F',ascending='False')
print top_female_rating[:10]

result:

gender                                                     F         M
title                                                                 
Battlefield Earth (2000)                            1.574468  1.616949
Barb Wire (1996)                                    1.585366  2.100386
Showgirls (1995)                                    1.709091  2.166667
Jaws 3-D (1983)                                     1.863636  1.851064
Rocky V (1990)                                      1.878788  2.132780
Speed 2: Cruise Control (1997)                      1.906667  1.863014
Avengers, The (1998)                                1.915254  2.017467
Anaconda (1997)                                     2.000000  2.248447
Nightmare on Elm Street 5: The Dream Child, A (...  2.052632  1.981481
Howard the Duck (1986)                              2.074627  2.103542

計算評分分歧

mean_ratings['diff']=mean_ratings['M']-mean_ratings['F']
sorted_by_diff=mean_ratings.sort_values(by='diff')
print sorted_by_diff[:5]

result:

gender                                                     F         M  
title                                                                    
Dirty Dancing (1987)                                3.790378  2.959596   
To Wong Foo, Thanks for Everything! Julie Newma...  3.486842  2.795276   
Jumpin' Jack Flash (1986)                           3.254717  2.578358   
Grease (1978)                                       3.975265  3.367041   
Relic, The (1997)                                   3.309524  2.723077   

gender                                                  diff  
title                                                         
Dirty Dancing (1987)                               -0.830782  
To Wong Foo, Thanks for Everything! Julie Newma... -0.691567  
Jumpin' Jack Flash (1986)                          -0.676359  
Grease (1978)                                      -0.608224  
Relic, The (1997)                                  -0.586447  

記一個筆記:指令碼實現txt替換

#把檔案內容替換  
#把file3.txt 的 :: 替換為 ,,並儲存到file4.txt  
import re  

fp3=open("file3.txt","r")  
fp4=open("file4.txt","w")  

for s in fp3.readlines():#先讀出來     
    fp4.write(s.replace("::",",")) #替換 並寫入  

fp3.close()  
fp4.close()