MovieLens 1M之python資料分析練習
阿新 • • 發佈:2019-02-18
程式碼區:
import pandas as pd
uname=['user_id','gender','age','occupation','zip']
users=pd.read_table(r'D:\demo1\ml-1m\users.dat',sep='::',header=None,names=uname,engine = 'python')
'''
sep : str, default ‘,’
指定分隔符。如果不指定引數,則會嘗試使用逗號分隔。分隔符長於一個字元並且不是‘\s+’,
將使用python的語法分析器。並且忽略資料中的逗號。正則表示式例子:'\r\t'
header : int or list of ints, default ‘infer’指定行數用來作為列名,資料開始行數。
names : array-like, default None
用於結果的列名列表,如果資料檔案中沒有列標題行,就需要執行header=None。
engine解析器引擎使用。C引擎速度更快,而python引擎目前更加完善。除去警告
'''
rnames=['user_id','movie_id','rating','timestamp']
ratings=pd.read_table(r'D:\demo1\ml-1m\ratings.dat',sep='::',header=None,names=rnames,engine = 'python')
mname=['movie_id','title','genres']
movies=pd.read_table(r'D:\demo1\ml-1m\movies.dat',sep='::',header=None,names=mname,engine = 'python')
data=pd.merge(pd.merge(movies,ratings),users)
print data.loc[0]#ix[0]已經deprecated棄用
結果:
movie_id 1
title Toy Story (1995)
genres Animation|Children's|Comedy
user_id 1
rating 5
timestamp 978824268
gender F
age 1
occupation 10
zip 48067
'''
#樞軸表pandas.pivot_table(data, values=None,
index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All')
'''
mean_ratings=data.pivot_table('rating',index='title',columns='gender',aggfunc='mean')
print mean_ratings[:5]
result:
gender F M
title
$1,000,000 Duck (1971) 3.375000 2.761905
'Night Mother (1986) 3.388889 3.352941
'Til There Was You (1997) 2.675676 2.733333
'burbs, The (1989) 2.793478 2.962085
...And Justice for All (1979) 3.828571 3.689024
#過濾資料不足200條的電影
ratings_groupby_title=data.groupby('title').size()
print ratings_groupby_title[:5]
reslut:
title
$1,000,000 Duck (1971) 37
'Night Mother (1986) 70
'Til There Was You (1997) 52
'burbs, The (1989) 303
...And Justice for All (1979) 199
dtype: int64
active_titles=data.groupby('title').size().index[data.groupby('title').size()>=200]
print active_titles
result:
Index([u''burbs, The (1989)', u'10 Things I Hate About You (1999)',
u'101 Dalmatians (1961)', u'101 Dalmatians (1996)',
u'12 Angry Men (1957)', u'13th Warrior, The (1999)',
u'2 Days in the Valley (1996)', u'20,000 Leagues Under the Sea (1954)',
u'2001: A Space Odyssey (1968)', u'2010 (1984)',
...
u'Year of Living Dangerously (1982)', u'Yellow Submarine (1968)',
u'Yojimbo (1961)', u'You've Got Mail (1998)',
u'Young Frankenstein (1974)', u'Young Guns (1988)',
u'Young Guns II (1990)', u'Young Sherlock Holmes (1985)',
u'Zero Effect (1998)', u'eXistenZ (1999)'],
dtype='object', name=u'title', length=1426)
mean_ratings=mean_ratings.loc[active_titles]
#對F列進行降序
top_female_rating=mean_ratings.sort_values(by='F',ascending='False')
print top_female_rating[:10]
result:
gender F M
title
Battlefield Earth (2000) 1.574468 1.616949
Barb Wire (1996) 1.585366 2.100386
Showgirls (1995) 1.709091 2.166667
Jaws 3-D (1983) 1.863636 1.851064
Rocky V (1990) 1.878788 2.132780
Speed 2: Cruise Control (1997) 1.906667 1.863014
Avengers, The (1998) 1.915254 2.017467
Anaconda (1997) 2.000000 2.248447
Nightmare on Elm Street 5: The Dream Child, A (... 2.052632 1.981481
Howard the Duck (1986) 2.074627 2.103542
計算評分分歧
mean_ratings['diff']=mean_ratings['M']-mean_ratings['F']
sorted_by_diff=mean_ratings.sort_values(by='diff')
print sorted_by_diff[:5]
result:
gender F M
title
Dirty Dancing (1987) 3.790378 2.959596
To Wong Foo, Thanks for Everything! Julie Newma... 3.486842 2.795276
Jumpin' Jack Flash (1986) 3.254717 2.578358
Grease (1978) 3.975265 3.367041
Relic, The (1997) 3.309524 2.723077
gender diff
title
Dirty Dancing (1987) -0.830782
To Wong Foo, Thanks for Everything! Julie Newma... -0.691567
Jumpin' Jack Flash (1986) -0.676359
Grease (1978) -0.608224
Relic, The (1997) -0.586447
記一個筆記:指令碼實現txt替換
#把檔案內容替換
#把file3.txt 的 :: 替換為 ,,並儲存到file4.txt
import re
fp3=open("file3.txt","r")
fp4=open("file4.txt","w")
for s in fp3.readlines():#先讀出來
fp4.write(s.replace("::",",")) #替換 並寫入
fp3.close()
fp4.close()