1. 程式人生 > 實用技巧 >熊貓燒香原始碼分析_學習大熊貓分析

熊貓燒香原始碼分析_學習大熊貓分析

熊貓燒香原始碼分析

介紹(Introduction)

Being a data scientist in today's age is an incredibly exciting and rewarding career. With the explosion of technology and the immense amount of data and content created daily, data scientist continually need to be learning new ways of efficiently analysing this data. One of the most crucial parts of any new data project is the exploratory data analysis

phase. As a data scientist, this phase allows you to learn and familiarize yourself with that data at hand, where the data is collected from, any gaps in the data, any potential outliers and the range of data types used. One tool that has become a staple among data scientist is Pandas Profiling. Pandas Profiling is an open-source tool written in Python that has the ability to generate interactive HTML reports which detail the types of data within the dataset; Highlights missing values; Provides descriptive statistics including mean, standard deviation and skewness; Creates histograms and returns any potential correlations.

身為當今時代的資料科學家,是一項令人難以置信的激動人心的職業。 隨著技術的爆炸式增長以及每天建立的大量資料和內容,資料科學家不斷需要學習有效分析此資料的新方法。 探索性資料分析階段是任何新資料專案中最關鍵的部分之一。 作為資料科學家,此階段使您可以學習和熟悉手頭的資料,從中收集資料,資料中的任何空白,任何潛在的異常值以及所用資料型別的範圍。 Pandas Profiling是資料科學家中最常用的一種工具。 Pandas Profiling是一個用Python編寫的開放原始碼工具,具有生成詳細描述資料集中資料型別的互動式HTML報告的功能; 突出顯示缺失的值; 提供描述性統計資訊,包括均值,標準差和偏度; 建立直方圖並返回任何潛在的相關性。

安裝熊貓分析 (Installing Pandas Profiling)

For this article, we are using PyCharm which is an integrated development environment created by JetBrains. PyCharm is an excellent tool to use as it handles tasks including creating a virtual environment for the project and the installation of packages referenced in your code.

對於本文,我們使用的是PyCharm ,它是JetBrains建立的整合開發環境。 PyCharm是一個出色的工具,可用於處理任務,包括為專案建立虛擬環境以及安裝程式碼中引用的軟體包。

To get started open PyCharm and selected File > New Project, you will be presented with a dialogue where you can name the project and create an associated virtual environment. Virtual environments allow you to install specific python packages that your project can reference without having to globally install the packages on your machine. This is handy when you have multiple projects running that require a different version of the same package.

首先,開啟PyCharm並選擇File > New Project ,將顯示一個對話方塊,您可以在其中命名專案並建立關聯的虛擬環境。 虛擬環境允許您安裝專案可以引用的特定python軟體包,而無需在計算機上全域性安裝這些軟體包。 當您有多個執行的專案需要同一個程式包的不同版本時,這很方便。

Once the default packages have been installed in the virtual environment we need to install Pandas Profiling. To do this navigate to File > Settings > Project > Project Interpreter select the + button in the top right and search for pandas-profiling then press Install Package.

在虛擬環境中安裝了預設軟體包後,我們需要安裝Pandas Profiling。 為此,請導航至“ File > Settings > Project > Project Interpreter選擇右上角的+按鈕並搜尋pandas-profiling然後按Install Package

An image showing the dialogue box within PyCharm when following the Pandas Profiling installation steps above.
Installing Pandas Profiling using PyCharms Project Interpreter.
使用PyCharms Project Interpreter安裝Pandas分析。

入門 (Getting Started)

For this example, we have created a simple Python script that you can use to get started. If this is your first time using Python please read Getting Started — Python Pandas where we explain the code within the script below.

在此示例中,我們建立了一個簡單的Python指令碼,您可以使用它開始入門。 如果這是您第一次使用Python,請閱讀“入門-Python Pandas” ,我們在下面的指令碼中解釋程式碼。

A Python script that is going to generate a HTML Pandas Profiling Report using fake data.
一個Python指令碼,它將使用假資料生成HTML Pandas分析報告。

After executing the script a new HTML file called pandas_profile_text.html will be created in your project root directory. To view the report right-click on the HTML file and select Open in Browser > Default.

執行指令碼後,將在專案根目錄中建立一個名為pandas_profile_text.html的新HTML檔案。 要檢視報告,請右鍵單擊HTML檔案,然後選擇Open in Browser > Default

熊貓分析報告 (Pandas Profiling Report)

總覽(Overview)

Image for post
Overview section within the Pandas Profiling Report
熊貓分析報告中的概述部分

The Overview section, the first section within the Pandas Profiling Report, shows summarised statistics for the dataset as a whole. It returns the number of variables, which is the number of columns that were included in the passed DataFrame. The number of observations is the number of rows that were received. The Overview also provides the number of missing cells or duplicate rows and a percentage of total records that were impacted. The missing cells and duplicate row statistics are quite important as a data scientist as these may indicate broader data quality issues or issues with the code used to extract the data. The overview section also includes data around the size of the dataset in memory, the average record size in memory and any data types that are recognised.

概述部分(Pandas分析報告的第一部分)顯示了整個資料集的摘要統計資訊。 它返回變數的數量,即傳遞的DataFrame中包含的列數。 觀察數是已接收的行數。 概述還提供了丟失的單元格或重複的行數以及受影響的總記錄的百分比。 作為資料科學家,缺失的單元格和重複的行統計資訊非常重要,因為它們可能表示更廣泛的資料質量問題或用於提取資料的程式碼問題。 概述部分還包括有關記憶體中資料集大小,記憶體中平均記錄大小以及可識別的任何資料型別的資料。

Under the Warnings tab within the Overview section, you can find collated warnings for any of the variables within the dataset. In this example, we received a high cardinality warning for name, email and city. Within this context, the high cardinality means that the columns that were flagged contain a very high number of distinct values, you would expect this for employee number and email in the real world.

在“概述”部分的“警告”選項卡下,可以找到資料集中任何變數的整理的警告。 在此示例中,我們收到了有關名稱電子郵件城市高基數警告。 在這種情況下,高基數意味著標記的列包含非常多的不同值,您希望在現實世界中對僱員編號和電子郵件使用此值。

變數—分類 (Variables — Categorial)

Image showing the output from the Pandas Profiling Report for a categorical variable.
Pandas Profiling Report results for a categorical variable
類別變數的Pandas分析報告結果

The Variables section within the Pandas Profiling report analyses the columns within the passed DataFrame. A categorical variable is a column that contains data that represents a Python string type.

Pandas Profiling報告中的Variables部分分析了傳遞的DataFrame中的列。 分類變數是一列,其中包含表示Python字串型別的資料。

A typical metric returned for categorical variables is the length of the strings within the column. To view the generated histogram select Toggle Details then navigate to the Length tab. The length tab also contains statistics regarding the maximum, median, mean and minimum values of the string length.

返回的用於分類變數的典型指標是列中字串的長度。 要檢視生成的直方圖,請選擇“ Toggle Details然後導航到“ Length選項卡。 長度選項卡還包含有關字串長度的最大值,中位數,平均值和最小值的統計資訊。

變數-數值 (Variables — Numerical)

Image showing the output from the Pandas Profiling Report for a numerical variable.
Pandas Profiling Report results for a numerical variable
熊貓分析報告結果為一個數字變數

Pandas Profiling offers an incredibly in-depth analysis of numerical variables covering quantile and descriptive statistics. It returns the minimum and maximum values within the dataset and the range between. It displays quartile values which measure the distribution of the ordered values in the dataset above and below the median by dividing the set into four bins. When considering the quartile values, if there is a greater distance between quartile one and the median verse the median and quartile three then we interpret this as meaning a greater scatter of smaller values than the larger values. The interquartile range is simply the results of quartile three minus quartile one.

熊貓分析提供了涵蓋分位數描述性統計資料的令人難以置信的深度分析。 它返回資料集中的最小值最大值及其之間的範圍。 它顯示其中通過將所述一組為四個二進位制位測量有序值的在上方和下方的中值資料集的分佈的四分位數的值。 在考慮四分位數時,如果四分位數1與中位數和中位數與四分位數3之間的距離較大,則我們將其解釋為意味著較小值的分散程度大於較大值。 四分位數範圍僅是四分位數三減四分之一的結果。

Standard deviation reflects the distributions of the dataset with regards to its mean value. A low standard deviation implies that the values in the data set are closer to the mean, whereas a higher standard deviation value implies that the dataset values are spread over a greater range. The coefficient of variation, also known as relative standard deviation, is the ratio of the standard deviation to the mean. Kurtosis can be used to describe the shape of the data by measuring the values within the tails of the distribution relative to the mean of the ordered dataset. The Kurtosis value varies depending on the distribution of the data and the presence of extreme outliers. The median absolute deviation is another statistical measure that reflects the distribution of the data around the median and is a more robust measure of the spread when an extreme outlier is present. Skewness reflects the level of distortion from a standard bell-shaped probability distribution. Positive skewness is considered skewness to the right and has a longer tail to the right of the distribution and a negative to the left.

標準差反映有關資料集平均值的分佈。 低標準偏差表示資料集中的值更接近平均值,而較高的標準偏差值表示資料集值分佈在較大範圍內。 變異係數,也稱為相對標準偏差,是標準偏差與平均值的比率。 峰度可用於通過測量分佈尾部相對於有序資料集平均值的值來描述資料的形狀。 峰度值根據資料分佈和極端異常值的存在而變化。 中位數絕對偏差是另一種統計量度,可反映資料在中位數附近的分佈,並且是在存在極端離群值時對散佈的更可靠度量。 偏斜度反映了標準鐘形概率分佈的失真程度。 正偏度被認為是右側偏度,並且在分佈的右側具有較長的尾巴,而在左側則為負。

互動與相關 (Interaction and Correlations)

Graph showing the interaction of two variable from the Pandas Profiling Report.
Interaction graph from the Pandas Profiling Report.
熊貓分析報告中的互動圖。

The Interaction and Correlations sections are where Pandas Profiling really sets itself ahead of other exploratory tools. It analyses all the variables as pairs and highlights any highly correlating variables using Pearson, Spearman, Kendal and Phik measures. It provides a powerful easy to understand visual representation of any data that correlations strongly together. As a data scientist, this is a great starting point for questions as to why these data pairs may correlate.

互動和關聯部分是Pandas Profiling真正領先於其他探索工具的地方。 它對所有變數進行成對分析,並使用Pearson,Spearman,Kendal和Phik度量突出顯示任何高度相關的變數。 它提供了強大且易於理解的任何緊密關聯在一起的資料的視覺表示。 作為資料科學家,這是質疑為什麼這些資料對可能相互關聯的一個很好的起點。

缺失值 (Missing Values)

Bar chart displaying the missing values in each column from the Pandas Profiling Report.
Missing values bar chart from the Pandas Profiling Report
熊貓分析報告中的缺失值條形圖

The Missing Values section builds on the missing cells metric from the Overview section. It visually represents where the missing values are occurring against all the columns within the DataFrame. This section may highlight data quality issues and may require missing data to be mapped to a default value which we will cover in a later article.

“缺少值”部分基於“概述”部分中的“缺少單元格”度量標準。 它直觀地表示DataFrame中所有列的缺失值發生在哪裡。 本節可能重點介紹資料質量問題,並且可能要求將丟失的資料對映到預設值,我們將在以後的文章中介紹。

樣品部分 (Sample Section)

The sample section displays a snapshot of results from the head and tail of the dataset. If the dataset is ordered on a particular column you can use this section to gain an understanding of what type of records the minimum and maximum column values are associated with.

樣本部分顯示了資料集頭部和尾部的結果快照。 如果資料集在特定列上排序,則可以使用本節來了解最小和最大列值與哪種記錄型別相關聯。

概要 (Summary)

Pandas Profiling is an incredible open-source tool that every data scientist should consider adding to their toolbox for the data exploration phase in any project. It is an efficient way to digest and analyse an unfamiliar dataset by providing in-depth descriptive statistics, visual distribution graphs and a powerful set of correlation tools.

Pandas Profiling是令人難以置信的開源工具,每個資料科學家都應考慮將其新增到工具箱中,以進行任何專案中的資料探索階段。 通過提供深入的描述性統計資訊,視覺化分佈圖和一組強大的關聯工具,這是一種有效的方法來消化和分析不熟悉的資料集。

Thank you for taking the time to read our article, we hope you have found it valuable.

感謝您抽出寶貴的時間閱讀我們的文章,希望您發現它有價值。

翻譯自: https://towardsdatascience.com/learning-pandas-profiling-fc533336edc7

熊貓燒香原始碼分析