觀影大資料分析（二）

阿新 • • 發佈：2022-12-09

5 資料分析 5.1 why 想要探索影響票房的因素，從電影市場趨勢，觀眾喜好型別，電影導演，發行時間，評分與 關鍵詞等維度著手，給從業者提供合適的建議。 5.2 what 5.2.1 電影型別：定義一個集合，獲取所有的電影型別 注意到集合中存在多餘的元素：空的單引號，所以需要去除。

# 定義一個集合，獲取所有的電影型別

genre = set()

for i in df['genres'].str.split(','):  # 去掉字串之間的分隔符，得到單個電影型別

    genre = set().union(i, genre)  # 集合求並集

    # genre.update(i) #或者使用update方法 



genre.discard('') # 去除多餘的元素


print(genre)
# 將genre轉變成列表

genre_list = list(genre)

# 建立資料框-電影型別

genre_df = pd.DataFrame()

# 對電影型別進行one-hot編碼

for i in genre_list:
    # 如果包含型別 i，則編碼為1，否則編碼為0

    genre_df[i] = df['genres'].str.contains(i).apply(lambda x: 1 if x else 0)

# 將資料框的索引變為年份

genre_df.index  
= df['release_year']

genre_df.head(5)

5.2.1.1 電影型別數量（繪製條形圖）

# 計算得到每種型別的電影總數目，並降序排列

grnre_sum = genre_df.sum().sort_values(ascending = False)

# 視覺化

 

colors = ['tomato','C0']

plt.rcParams['font.sans-serif'] = ['SimHei']  #用來顯示中文

grnre_sum.plot(kind='bar',label='genres',color=colors,figsize=(12,9))

plt.title( 
'不同型別的電影數量總計',fontsize=20)

plt.xticks(rotation=60)

plt.xlabel('電影型別',fontsize=16)

plt.ylabel('數量',fontsize=16)

plt.grid(False)

plt.savefig("不同電影型別數量-條形圖.png",dpi=300) #在 plt.show() 之前呼叫 plt.savefig()

plt.show()

5.2.1.2 電影型別佔比（繪製餅圖）

gen_shares = grnre_sum / grnre_sum.sum()

# 設定other類，當電影型別所佔比例小於%1時，全部歸到other類中

others = 0.01

gen_pie = gen_shares[gen_shares >= others]

gen_pie['others'] = gen_shares[gen_shares < others].sum()

colors = ['tomato', 'lightskyblue', 'goldenrod', 'wheat', 'y', 'tomato', 'lightskyblue', 'goldenrod', 'wheat', 'y',
          'tomato', 'lightskyblue', 'goldenrod', 'wheat', 'y', 'tomato', 'lightskyblue', 'goldenrod', 'wheat', 'y',
          'lightskyblue']

# 設定分裂屬性

# 所佔比例小於或等於%2時，增大每塊餅片邊緣偏離半徑的百分比

explode = (gen_pie <= 0.02) / 10

# 繪製餅圖

gen_pie.plot(kind='pie', label='', colors=colors, explode=explode, startangle=0,

             shadow=False, autopct='%3.1f%%', figsize=(8, 8))

plt.title('不同電影型別所佔百分比', fontsize=20)

plt.savefig("不同電影型別所佔百分比-餅圖.png", dpi=300)

plt.show()

5.2.1.3 電影型別變化趨勢（繪製折線圖）

#電影型別隨時間變化的趨勢

gen_year_sum = genre_df.sort_index(ascending = False).groupby('release_year').sum()

gen_year_sum_sub = gen_year_sum[['Action','Adventure','Crime','Romance','Science Fiction','Drama','Comedy','Thriller']]

gen_year_sum_sub.plot(figsize=(12,9))

plt.legend(gen_year_sum_sub.columns)

plt.xticks(range(1915,2018,10))

plt.xlabel('年份', fontsize=16)

plt.ylabel('數量', fontsize=16)

plt.title('不同電影變化趨勢', fontsize=20)

 

plt.grid(False)

plt.savefig("不同電影型別數量-折線圖2.png",dpi=600)

plt.show()

5.2.1.4 不同電影型別預算/利潤（繪製組合圖）

# 計算不同電影型別的利潤

# Step1-建立profit_dataframe

df['profit'] = df['revenue']-df['budget']

profit_df = pd.DataFrame()

profit_df = pd.concat([genre_df.reset_index(), df['profit']], axis=1)

df.info()

# Step2-建立profit_series，橫座標為genre

profit_s=pd.Series(index=genre_list)

# Step3-求出每種genre對應的利潤均值

for i in genre_list:

    profit_s.loc[i]=profit_df.loc[:,[i,'profit']].groupby(i, as_index=False).mean().loc[1,'profit']

profit_s = profit_s.sort_values(ascending = True)

profit_s

 

# 計算不同型別電影的budget

# Step1-建立profit_dataframe

budget_df = pd.DataFrame()

budget_df = pd.concat([genre_df.reset_index(), df['budget']], axis=1)

# Step2-建立budget_series，橫座標為genre

budget_s=pd.Series(index=genre_list)

# Step3-求出每種genre對應的預算均值

for j in genre_list:

    budget_s.loc[j]=budget_df.loc[:,[j,'budget']].groupby(j, as_index=False).mean().loc[1,'budget']

budget_s

 

# 再接著，橫向合併 profit_s 和 budget_s

profit_budget = pd.concat([profit_s, budget_s], axis=1)

profit_budget.columns = ['profit', 'budget']

 

#新增利潤率列

profit_budget['rate'] = (profit_budget['profit']/profit_budget['budget'])*100

# 降序排序

profit_budget_sort=profit_budget.sort_values(by='budget',ascending = False)

profit_budget_sort.head(2)

 

# 繪製不同型別電影平均預算和利潤率（組合圖）

x = profit_budget_sort.index

y1 = profit_budget_sort.budget

y2 = profit_budget_sort.rate

# 返回profit_budget的行數

length = profit_budget_sort.shape[0]

 

fig = plt.figure(figsize=(12,9))

# 左軸

ax1 = fig.add_subplot(1,1,1)

plt.bar(range(0,length),y1,color='C4',label='平均預算')

plt.xticks(range(0,length),x,rotation=90, fontsize=12)  # 更改橫座標軸名稱

ax1.set_xlabel('年份')                   # 設定x軸label ,y軸label

ax1.set_ylabel('平均預算',fontsize=16)

ax1.legend(loc=2,fontsize=12)

 

#右軸

# 共享x軸，生成次座標軸

ax2 = ax1.twinx()

ax2.plot(range(0,length),y2,'ro-.')

ax2.set_ylabel('平均利潤率',fontsize=16)

ax2.legend(loc=1,fontsize=12)

 

# 將利潤率座標軸以百分比格式顯示

import matplotlib.ticker as mtick

fmt='%.1f%%'

yticks = mtick.FormatStrFormatter(fmt)

ax2.yaxis.set_major_formatter(yticks)

 

# 設定圖片title

ax1.set_title('不同型別電影平均預算和利潤率',fontsize=20)

ax1.grid(False)

ax2.grid(False)

plt.savefig("不同電影平均預算+利潤率.png",dpi=300)

plt.show()

5.2.2 電影關鍵詞（keywords 關鍵詞分析，繪製詞雲圖）

from wordcloud import STOPWORDS

from wordcloud import WordCloud

keywords_list = []

for i in df['keywords']:

    keywords_list.append(i)

    keywords_list

#把字串列表連線成一個長字串

lis = ''.join(keywords_list)

lis.replace('\'s','')

#設定停用詞

stopwords = set(STOPWORDS)

stopwords.add('film')

wordcloud = WordCloud(

                background_color = 'black',

                random_state=3,

                stopwords = stopwords,

                max_words = 3000,

                scale=1).generate(lis)

plt.figure(figsize=(10,6))

plt.imshow(wordcloud)

plt.axis('off')

plt.savefig('詞雲圖.png',dpi=300)

plt.show()

5.3 when 檢視 runtime 的型別，發現是 object 型別，也就是字串，所以，先進行資料轉化。 5.3.1 電影時長（繪製電影時長直方圖）

df.runtime.head(5)

df.runtime = df.runtime.apply(pd.to_numeric, errors='coerce')

df.runtime.describe()

import seaborn as sns

sns.set_style('dark')

sns.distplot(df.runtime, bins=30)

sns.despine(left=True)  # 使用despine()方法來移除座標軸，預設移除頂部和右側座標軸

plt.xticks(range(50, 360, 20))

plt.savefig('電影時長直方圖.png', dpi=300)

plt.show()

5.3.2 發行時間（繪製每月電影數量和單片平均票房）8 5.4 where 本資料集收集的是美國地區的電影資料，對於電影的製作公司以及製作國家，在本次的故事 背景下不作分析。 5.5 who 5.5.1 分析票房分佈及票房 Top10 的導演

fig = plt.figure(figsize=(12,7))

x = list(range(1,13))

y1 = df.groupby('release_month').revenue.size()

y2 = df.groupby('release_month').revenue.mean()# 每月單片平均票房

 

# 左軸

ax1 = fig.add_subplot(1,1,1)

plt.bar(x,y1,color='C6',label='電影數量')

plt.grid(False)

ax1.set_xlabel('月份')                   # 設定x軸label ,y軸label

ax1.set_ylabel('電影數量',fontsize=16)

ax1.legend(loc=2,fontsize=12)

 

# 右軸

ax2 = ax1.twinx()

plt.plot(x,y2,'bo--',label='每月單片平均票房')

ax2.set_ylabel('每月單片平均票房',fontsize=16)

ax2.legend(loc=1,fontsize=12)

 

plt.savefig('每月電影數量和單片平均票房.png',dpi=300)

plt.show()

參考文件：MathorCup 高校數學建模挑戰賽——觀影大資料 - 把糖還給我 - 部落格園 (cnblogs.com)

觀影大資料分析（二）

觀影大資料分析（二）

觀影大資料分析（下）

觀影大資料分析（上）

觀影大資料分析（3）

Python實現資料分析（二）

2020中國高校計算機大賽·華為雲大資料挑戰賽-資料分析（二）

大資料實戰（四）：flink（四）電商使用者行為分析（四）實時流量統計（二）

大資料實戰（三）：flink（三）電商使用者行為分析（三）實時流量統計（一）

大資料實戰（六）：flink（六）電商使用者行為分析（六）惡意登入監控

大資料實戰（七）：flink（七）電商使用者行為分析（七）訂單支付實時監控

大資料實戰（十三）：電商數倉（六）之使用者行為資料採集（六）元件安裝（二）Zookeeper安裝

大資料實戰（二十）：電商數倉（十三）之使用者行為資料採集（十三）元件安裝（九）資料採集通道

大資料實戰（二十三）：電商數倉（十六）之使用者行為資料採集（十六）Ubuntu mysql 安裝

大資料實戰（二十四）：電商數倉（十七）之使用者行為資料採集（十七）高可用mysql （HA mysql，ubuntu）

大資料實戰（二十五）：電商數倉（十八）之使用者行為資料採集（十八）ODS層建立

大資料實戰（二十六）：電商數倉（十九）之使用者行為資料採集（十九）數倉搭建之DWD層（一）DWD層啟動表資料解析

離線資料分析之人物興趣取向分析（二）日誌分析

離線資料分析之人物興趣取向分析（二）離線/實時專案構建思路

演演算法基礎篇-關於棧的演演算法題分析（二）

基於Docker搭建大資料叢集（一）Docker環境部署

觀影大資料分析（二）

相關推薦