Amazon Review Dataset資料集介紹
阿新 • • 發佈:2021-01-30
Amazon Review Dataset資料集記錄了使用者對亞馬遜網站商品的評價,是推薦系統的經典資料集,並且Amazon一直在更新這個資料集,根據時間順序,Amazon資料集可以分成三類:
- 2013 版 https://nijianmo.github.io/amazon/index.html
- 2014版 http://jmcauley.ucsd.edu/data/amazon/index_2014.html
- 2018版 http://snap.stanford.edu/data/web-Amazon-links.html
Amazon資料集可以根據商品類別分為 Books,Electronics,Movies and TV,CDs and Vinyl等子資料集,這些子資料集包含兩類資訊:
以2014版資料集為例:
-
商品資訊描述
asin 商品id title 商品名稱 price 價格 imUrl 商品圖片連結 related 相關商品 salesRank 折扣資訊 brand 品牌 categories 目錄類別 官方例子:
{ "reviewerID": "A2SUAM1J3GNN3B", "asin": "0000013714", "reviewerName": "J. McDonald", "helpful": [2, 3], "reviewText": "I bought this for my husband who plays the piano. He is having a wonderful time playing these old hymns. The music is at times hard to read because we think the book was published for singing from more than playing from. Great purchase though!", "overall": 5.0, "summary": "Heavenly Highway Hymns", "unixReviewTime": 1252800000, "reviewTime": "09 13, 2009" }
-
使用者評分記錄資料
reviewerID 使用者id asin 商品id reviewerName 使用者名稱 helpful 有效評價率(helpfulness rating of the review, e.g. 2/3) reviewText 評價文字 overall 評分 summary 評價總結 unixReviewTime 評價時間戳 reviewTime 評價時間 { "asin": "0000031852", "title": "Girls Ballet Tutu Zebra Hot Pink", "price": 3.17, "imUrl": "http://ecx.images-amazon.com/images/I/51fAmVkTbyL._SY300_.jpg", "related": { "also_bought": ["B00JHONN1S", "B002BZX8Z6"], "also_viewed": ["B002BZX8Z6", "B00JHONN1S"], "bought_together": ["B002BZX8Z6"] }, "salesRank": {"Toys & Games": 211836}, "brand": "Coxlures", "categories": [["Sports & Outdoors", "Other Sports", "Dance"]] }
Amazon資料集讀取:
因為下載的資料是json檔案,不易操作,這裡主要介紹如何將json檔案轉化為csv格式檔案。以2014版Amazon Electronics資料集的轉化為例:
商品資訊讀取
import pickle
import pandas as pd
file_path = 'meta_Electronics.json'
fin = open(file_path, 'r')
df = {}
useless_col = ['imUrl','salesRank','related','title','description'] # 不想要的欄位
i = 0
for line in fin:
d = eval(line)
for s in useless_col:
if s in d:
d.pop(s)
df[i] = d
i += 1
df = pd.DataFrame.from_dict(df, orient='index')
df.to_csv('meta_Electronics.csv',index=False)
使用者評分記錄資料讀取
file_path = 'Electronics_10.json'
fin = open(file_path, 'r')
df = {}
useless_col = ['reviewerName','reviewText','unixReviewTime','summary'] # 不想要的欄位
i = 0
for line in fin:
d = eval(line)
for s in useless_col:
if s in d:
d.pop(s)
df[i] = d
i += 1
df = pd.DataFrame.from_dict(df, orient='index')
df.to_csv('Electronics_10.csv',index=False)