Pandas 型別轉換之簡單求最大值及其索引

阿新 • • 發佈：2020-12-10

技術標籤：# pandas 資料分析資料分析 python

簡介

二、統計哪一個sku在2019年賣出去的數量最多

1. 使用pivot_table 解決

2. 使用groupby 解決

我是總結

簡介

在做資料分析的時候，很重要的一點是要了解資料的具體型別，避免在資料分析過程中遇到奇怪的問題。
使用pandas進行資料分析時，難免會遇到需要轉換資料型別的問題。本文主要介紹pandas基本資料型別(dtype)

Pandas Data Type

Pandas dtype	Python type	NumPy type	Usage
object	str or mixed	string, unicode, mixed types	Text or mixed numeric and non-numeric values
int64	int	int_, int8, int16, int32, int64, uint8, uint16, uint32, uint64	Integer numbers
float64	float	float_, float16, float32, float64	Floating point numbers
bool	bool	bool_	True/False values
datetime64	NA	datetime64[ns]	Date and time values
timedelta[ns]	NA	NA	Differences between two datetimes
category	NA	NA	Finite list of text values

為什麼要關注dtype

使用pandas進一步資料分析之前要先檢查資料
可能因為資料型別導致的報錯和錯誤的結果

本文將使用如下csv進行說明：

# data_type.csv
Sku,Views,Month,Day,Year,Sold,Reviews,Active
212039,20,2,2,2019,10,2,Y
212038,21,2,2,2018,10,2,Y
212037,22,2,2,2019,10,2,Y
212036,23,2,2,2019,10,2,Y
212035,24,2,2,2019,10,2,Y
212034,25,2,2,2019,10,2,Y
212033,26,2,2,2019,10,2,Y
212032,27,2,2,2019,10,2,Y
212031,28,2,2,2019,10,2,N
212030,29,2,2,2019,10,2,N
212039,20,3,3,2019,100,50,Y
212038,21,3,3,2019,90,48,Y
212037,22,3,3,2019,80,46,Y
212036,23,3,3,2019,70,44,Y
212035,無,3,3,2019,無,0,Y

import pandas as pd
import numpy as np
df = pd.read_csv("../datas/data_type.csv")

df

	Sku	Views	Month	Day	Year	Sold	Reviews	Active
0	212039	20	2	2	2019	10	2	Y
1	212038	21	2	2	2018	10	2	Y
2	212037	22	2	2	2019	10	2	Y
3	212036	23	2	2	2019	10	2	Y
4	212035	24	2	2	2019	10	2	Y
5	212034	25	2	2	2019	10	2	Y
6	212033	26	2	2	2019	10	2	Y
7	212032	27	2	2	2019	10	2	Y
8	212031	28	2	2	2019	10	2	N
9	212030	29	2	2	2019	10	2	N
10	212039	20	3	3	2019	100	50	Y
11	212038	21	3	3	2019	90	48	Y
12	212037	22	3	3	2019	80	46	Y
13	212036	23	3	3	2019	70	44	Y
14	212035	無	3	3	2019	無	0	Y

df.dtypes

Sku         int64
Views      object
Month       int64
Day         int64
Year        int64
Sold       object
Reviews     int64
Active     object
dtype: object

一、astype and apply

下面介紹下astype和apply兩個函式, 具體用法可以使用help(df[‘Active’].astype)

astype: 型別轉換，轉換為指定的pandas data type
apply：將函式返回值儲存到Series中

首先將Active列轉換為bool看看發生了什麼

df['Active'].astype('bool')

0     True
1     True
2     True
3     True
4     True
5     True
6     True
7     True
8     True
9     True
10    True
11    True
12    True
13    True
14    True
Name: Active, dtype: bool

從上面結果看到所有都為True，第8和第9行也顯示了True，而期望的結果則是第8和第9行顯示False

那如何做到呢

方案一：手寫函式替換
方案二：使用lambda
方案三：使用np.where

方案一

# 方案一
def convert_bool(val):
    """
    Convert the string value to bool
     - if Y, then return True
     - if N, then return False
    """

    if val == 'Y':
        return True
    return False

df['Active'].apply(convert_bool)

0      True
1      True
2      True
3      True
4      True
5      True
6      True
7      True
8     False
9     False
10     True
11     True
12     True
13     True
14     True
Name: Active, dtype: bool

方案二

# 方案二
df["Active"].apply(lambda item: True if item=='Y' else False)

0      True
1      True
2      True
3      True
4      True
5      True
6      True
7      True
8     False
9     False
10     True
11     True
12     True
13     True
14     True
Name: Active, dtype: bool

方案三

# 方案三
np.where(df["Active"] == "Y", True, False)
# df['Active'] = np.where(df["Active"] == "Y", True, False)

array([ True,  True,  True,  True,  True,  True,  True,  True, False,
       False,  True,  True,  True,  True,  True])

二、統計哪一個sku在2019年賣出去的數量最多

第一步: 統計所有sku在2019年銷售的數量之和
第二步: 取出最大銷售數量的sku

1. 使用pivot_table 解決

使用pivot_table函式
將Sku和Year作為索引
Sold作為values計算

# drop if year != 2019
newdf = df.copy(deep=True)
newdf = newdf[newdf["Year"] == 2019]
pd.pivot_table(newdf, index=['Sku','Year'], values=['Sold'], aggfunc=np.sum)

		Sold
Sku	Year
212030	2019	10
212031	2019	10
212032	2019	10
212033	2019	10
212034	2019	10
212035	2019	10無
212036	2019	1070
212037	2019	1080
212038	2019	90
212039	2019	10100

從上面結果看出，sold最終結果不是我們期望的，看起來像是字串拼接，讓我們一起看看發生了什麼

首先想到的是檢查資料型別

newdf.dtypes

Sku         int64
Views      object
Month       int64
Day         int64
Year        int64
Sold       object
Reviews     int64
Active     object
dtype: object

Sold object不是int型別，所以導致np.sum計算時得到的結果不是期望的

那直接轉換成int型別？？？

newdf['Sold'].astype(int)
# will get follow error:
# ValueError: invalid literal for int() with base 10: '無'

毫無疑問地報錯了，這就需要我們進行資料清理，將無效資料去掉

這裡我們看一個神奇的函式

pd.to_numeric(arg, errors=’coerce’, downcast=None) 可以使用help函式檢視具體用法
If errors = ‘coerce’, then invalid parsing will be set as NaN.即解析不出來將會返回NaN

# fillna if NaN, then fill in 0.
pd.to_numeric(newdf['Sold'], errors='coerce').fillna(0)

0      10.0
2      10.0
3      10.0
4      10.0
5      10.0
6      10.0
7      10.0
8      10.0
9      10.0
10    100.0
11     90.0
12     80.0
13     70.0
14      0.0
Name: Sold, dtype: float64

# 重寫df['Sold']
# 可以看到newdf['212035']['Sold']='無' 變成了結果：0.0
newdf['Sold'] = pd.to_numeric(newdf['Sold'], errors='coerce').fillna(0)
newdf

	Sku	Views	Month	Day	Year	Sold	Reviews	Active
0	212039	20	2	2	2019	10.0	2	Y
2	212037	22	2	2	2019	10.0	2	Y
3	212036	23	2	2	2019	10.0	2	Y
4	212035	24	2	2	2019	10.0	2	Y
5	212034	25	2	2	2019	10.0	2	Y
6	212033	26	2	2	2019	10.0	2	Y
7	212032	27	2	2	2019	10.0	2	Y
8	212031	28	2	2	2019	10.0	2	N
9	212030	29	2	2	2019	10.0	2	N
10	212039	20	3	3	2019	100.0	50	Y
11	212038	21	3	3	2019	90.0	48	Y
12	212037	22	3	3	2019	80.0	46	Y
13	212036	23	3	3	2019	70.0	44	Y
14	212035	無	3	3	2019	0.0	0	Y

再次執行pivot_table函式

frame = pd.pivot_table(newdf, index=['Sku'], values=['Sold'], aggfunc=[np.sum])
frame

	sum
	Sold
Sku
212030	10.0
212031	10.0
212032	10.0
212033	10.0
212034	10.0
212035	10.0
212036	80.0
212037	90.0
212038	90.0
212039	110.0

獲取最大值

# 方案一
max_sold_nums = frame[('sum','Sold')].max()
# 獲取索引
max_sold_idx = frame[('sum','Sold')].idxmax()
# 獲取某一行
max_sold_infos = frame.loc[max_sold_idx]
print('Max sold numbers: \n', max_sold_nums)
print('Max sold sku details: \n', max_sold_infos)

Max sold numbers: 
 110.0
Max sold sku details: 
 sum  Sold    110.0
Name: 212039, dtype: float64

# 方案二
# 將columns的MultiIndex拆分，使用stack函式
frame.columns

MultiIndex([('sum', 'Sold')],
           )

frame.stack().reset_index()

	Sku	level_1	sum
0	212030	Sold	10.0
1	212031	Sold	10.0
2	212032	Sold	10.0
3	212033	Sold	10.0
4	212034	Sold	10.0
5	212035	Sold	10.0
6	212036	Sold	80.0
7	212037	Sold	90.0
8	212038	Sold	90.0
9	212039	Sold	110.0

single_frame = frame.stack().reset_index()
max_sold_nums = single_frame['sum'].max()
max_sold_idx = single_frame['sum'].idxmax()
max_sold_infos = single_frame.loc[max_sold_idx]
print('Max sold numbers: \n', max_sold_nums)
print('Max sold sku details: \n', max_sold_infos)

Max sold numbers: 
 110.0
Max sold sku details: 
 Sku        212039
level_1      Sold
sum           110
Name: 9, dtype: object

2. 使用groupby 解決


max_sold_nums = newdf.groupby(['Sku'])['Sold'].sum().max()
max_sold_idx = newdf.groupby(['Sku'])['Sold'].sum().idxmax()

print('Max sold numbers: \n', max_sold_nums)
print('Max sold sku: \n', max_sold_idx)

Max sold numbers: 
 110.0
Max sold sku: 
 212039

我是總結

介紹了pandas的data type以及型別轉換
max，idxmax以及loc的用法
pivot_table 透視表的簡單使用
groupby的簡單使用

掃碼關注公眾號

掃碼關注公眾號: 風起帆揚了
來一起學習，成長，分享
航行在測試的大道上
喜歡就點贊吧

Pandas 型別轉換之簡單求最大值及其索引

技術標籤：# pandas資料分析資料分析python 目錄簡介 Pandas Data Type 為什麼要關注dtype

二維爬山掃描演算法求最大值

#測試演算法import numpy asnppointdata=np.random.randint(1,100,size=(1000,2))print(pointdata)score=np.random.randn(1000)score=score.reshape(-1,1)print(score)finaldata=np.hstack((pointdata,score))print(