1. 程式人生 > >使用seaborn畫堆積柱狀圖

使用seaborn畫堆積柱狀圖

問題描述:
原始表part-00000.csv中列name 有兩組值,一組是rmb,另一組是cis,所以想要實現按照name中資料分組的堆積柱狀圖。
1. 原始資料樣式
這裡寫圖片描述
下載連結:http://download.csdn.net/download/zhousishuo/9902909
2. 資料處理
資料處理我使用了兩種方法,一種是pandas,一種是pyspark.sql,使用後會發現兩種方法在思想上和code上都很相似,只要會其中一種,另外一種按照類似思想翻譯就好了(我就是這樣弄得,哈哈)
方法一:使用pandas處理
使用pandas進行處理,當然首先需要有它的包。我在Anaconda3上處理的,它集成了很多python的包,使用很方便。
處理程式碼如下:

import numpy as np
import pandas as pd
data = pd.read_csv("./part-00000.csv")
#獲取rmb資料資訊
df_rmb = data[data.name.isin(['rmb'])]
#獲取cis資料資訊
df_cis = data[data.name.isin(['cis'])]
#修改df_rmb列名
df_rmb=df_rmb.rename(columns={'total_cores':'rmb_cores','total_allocatedMEM':'rmb_mem'})[["logTime","rmb_cores","rmb_mem"
]] #修改df_cis列名 df_cis=df_cis.rename(columns={'total_cores':'cis_cores','total_allocatedMEM':'cis_mem'})[["logTime","cis_cores","cis_mem"]] #將df_rmb和df_cis進行合併,條件是logTime相等,並將結果按logTime升序排列 result = pd.merge(df_rmb, df_cis, on='logTime', how='inner').sort_index(by=["logTime"], ascending=True) #reset_index可以重置索引
#result.reset_index(inplace=True,drop=True) #去掉result本身建立時自帶的索引(在上述排序後就打斷了,在畫圖的時候也沒用,所以將其去掉),然後匯出資料 result.to_csv("./py_trans.csv", index=False)

方法二:使用pyspark.sql處理
使用這種方法首先得有pyspark的環境,也就是需要安裝spark
處理程式碼如下:

#讀取資料,資料存在hdfs上
df= spark.read.csv("hdfs://master_ip:8020/user/mart_cis/zhousishuo/part-00000.csv",encoding='UTF-8',header='true')
#獲取rmb資料資訊
df_rmb = df.filter("name=='rmb'").selectExpr("logTime","total_cores as rmb_cores","total_allocatedMEM as rmb_mem")
#獲取cis資料資訊
df_cis = df.filter("name=='cis'").selectExpr("logTime","total_cores as cis_cores","total_allocatedMEM as cis_mem")
#將rmb資料和cis資料按照logTime進行join
df = df_rmb.join(df_cis,"logTime").select(df_rmb.logTime,"rmb_cores","cis_cores","rmb_mem","cis_mem") .orderBy("logTime")
#將最終資料寫入到csv中
df.write.csv(path="hdfs://master_ip:8020/user/mart_cis/zhousishuo/part-data.csv",mode="overwrite",sep=",",header="true")
import pandas as pd
from matplotlib import pyplot as plt
import matplotlib as mpl
import seaborn as sns
%matplotlib inline

#Read in data & create total column
stacked_bar_data = pd.read_csv("D:/jupyter/matplot/new-part.csv")
stacked_bar_data["total_cores"] = stacked_bar_data.rmb_cores + stacked_bar_data.cis_cores

#Set general plot properties
sns.set_style("white")
sns.set_context({"figure.figsize": (24, 10)})

#Plot 1 - background - "total" (top) series
sns.barplot(x = stacked_bar_data.logTime, y = stacked_bar_data.total_cores, color = "red")

#Plot 2 - overlay - "bottom" series
bottom_plot = sns.barplot(x = stacked_bar_data.logTime, y = stacked_bar_data.cis_cores, color = "#0000A3")

topbar = plt.Rectangle((0,0),1,1,fc="red", edgecolor = 'none')
bottombar = plt.Rectangle((0,0),1,1,fc='#0000A3',  edgecolor = 'none')
l = plt.legend([bottombar, topbar], ['cis total_cores', 'rmb total_cores'], loc=1, ncol = 2, prop={'size':16})
l.draw_frame(False)

#Optional code - Make plot look nicer
sns.despine(left=True)
bottom_plot.set_ylabel("total_cores")
bottom_plot.set_xlabel("logTime")
bottom_plot.set_xticklabels(stacked_bar_data.logTime, rotation=30, fontsize='small')
plt.show()
# #Set fonts to consistent 16pt size
# for item in ([bottom_plot.xaxis.label, bottom_plot.yaxis.label] +
#              bottom_plot.get_xticklabels() + bottom_plot.get_yticklabels()):
#     item.set_fontsize(8)