機器學習入門 pandas 庫基本使用

阿新 • • 發佈：2022-04-07

numpy和pandas的職責

numpy 主要是函式，呼叫api
pandas是主要的，來進行資料分析
pandas
資料處理
運算元 map filter groupby apply
資料切片

pandas

1.官網

[https://pandas.pydata.org/](https://pandas.pydata.org/)

2.概述

pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool,

built on top of the Python programming language.
【pandas就是一個數據分析的工具】

3.掌握核心的程式設計模型【核心api】【資料型別】

1.Series

1.簡介

        1.is a one-dimensional labeled array 
        【Series是一個一維的標籤陣列】
        2.capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.)
        【裡面可以放各種資料型別】
        3.The axis labels are collectively referred to as the index.
        【標籤就是索引，可以用索引取Series裡面的元素】

2.例項化

import numpy as np

import pandas as pd
if __name__ == '__main__':

    # 建立Series，不指定index
    n1_arr = np.random.randint(1, 10, 3)
    pd_series = pd.Series(data=n1_arr)
    print(pd_series)
    print(type(pd_series))
    print(pd_series.index)
    # 建立Series,指定index
    print('**'*20)
    n1_arr2 = np.random.randint(1, 10, 3)
    pd_series2 = pd.Series(data=n1_arr2,index=['a','b','c'])
    print(pd_series2)

2.DataFrame

1.簡介

        1.DataFrame is a 2-dimensional labeled data structure with columns of potentially different types
            【DataFrame 是一個 二維 標籤陣列 多個不同資料型別的列】
            【index (row labels) and columns (column labels)】
            【DataFrame理解為一個有行有列的table】
        2.如何建立
            Dict of 1D ndarrays, lists, dicts, or Series
            2-D numpy.ndarray
            Structured or record ndarray
            A Series
            Another DataFrame
            【DF可以由多種資料轉化而來】
        3.a dict of Series objects
        df =》 table =》 有行有列
        DataFrame = DataFrame【Series】

2.例項化

import numpy as np

import pandas as pd
if __name__ == '__main__':
    # 1
    data = np.random.randint(1, 10, 3)
    df = pd.DataFrame(data=data)
    print(df)
    print(df.index)
    print(df.columns)

    # 2
    data1 = np.random.randint(1, 10, 3)
    df = pd.DataFrame(data=data1,index=['a','b','c'],columns=['num'])
    print(df)
    print(df.index)
    print(df.columns)

3.總結：

    1.df 就是table
    2.Series 就是一個table 就一個列 沒有列名
    3.df = df【Series + 列名】

4.思考：

資料分析師拿到一個數據一般做哪些事情？【大資料工程師也是一樣】
1.資料載入 sourc
2.資料分析 transform
3.資料輸出 sink
df 基本操作：
1.資料載入
處理的資料檔案格式：
csv json text
結構化資料：
csv：按逗號進行分割 =》 excel開啟的
json：kv
非結構化資料
txt

5.資料操作

import numpy as np

import pandas as pd
if __name__ == '__main__':
    # 1.載入資料
    df = pd.read_csv(r"D:\ \python-sk\0407data\1.csv")
    # print(df)

    # 2.資料操作 todo...
    print(df.head()) # 前五行
    print('*'*20)
    print(df.tail()) # 後五行

    # 檢視df屬性，瞭解
    print(df.shape)
    print('*' * 20)
    print(df.index)
    print('*' * 20)
    print(df.columns)
    print('*' * 20)
    print(df.ndim)
    print('*' * 20)
    print(df.info())

6.連線

import numpy as np

import pandas as pd
if __name__ == '__main__':
    # 1.載入資料
    df = pd.read_csv(r"D:\ \python-sk\0407data\1.csv")

    print(df.head())

    # 資料拼接
    '''
    table
        union【兩個表形成一個大表】
        union all
        join【公共欄位提出來】
        
        面試會問：
        union和unionall區別
        union會自動壓縮多個結果集合中的重複結果
        union all則將所有的結果全部顯示出來,不管是不是重複
    '''
    # union =》 concat 預設行拼接
    df_union = pd.concat([df, df])
    print(df_union.head(10))
    '''
    a,b
    
    select 
    xxx
    from a left join b
    on a.id = b.id and a.name = b.name
    
    join : inner交集
           left左全
           right,右全
           full outer先左全，再右全，然後都要
    '''

    # join =》 merge
    df_join = pd.merge(df, df, on="id",suffixes=("_l", "_r"))
    print(df_join.head())

    pd_a = pd.DataFrame({'name': ['a', 'c'], 'id': [1, 2]})
    pd_b = pd.DataFrame({'name': ['a', 'b'], 'address': ['大連', '瀋陽']})

    print(pd_a.head())
    print(pd_b.head())

    # join左連線，匹配不上就是NaN
    df_join = pd.merge(pd_a,pd_b,how="left",on="name")
    print(df_join.head())

    # join右連線，匹配不上就是NaN
    df_join = pd.merge(pd_a, pd_b, how="right", on="name")
    print(df_join.head())

    # join右連線，匹配不上就是NaN
    df_join = pd.merge(pd_a, pd_b, how="outer", on="name")
    print(df_join.head())

    emp = pd.read_csv(r"D:\ \python-sk\0407data\emp.csv")
    dept = pd.read_csv(r"D:\ \python-sk\0407data\dept.csv")

    # 顯示全【不然是省略號】,太多了可能會換行
    pd.set_option("display.max_columns",None)
    df_join = pd.merge(emp, dept, how="right", on="deptno")
    print(df_join.head())

7.切片

import numpy as np
import pandas as pd
if __name__ == '__main__':
    # 1.載入資料
    df = pd.read_csv(r"D:\ \python-sk\0407data\1.csv")
    print(df.head())

    # 切片
    # 1.獲取列,不帶列名，意義不大
    print(df['name'].head()) #series
    # print(df['name'].values)
    # 1.獲取列，帶列名，意義不大
    print(df[['name']].head()) #df

    # 2.取行 ：1.行的切片
    print(df[0:2]) # 這個瞭解，沒必要掌握
    print("**"*20)
    # 2.boolean【按條件來取】
    print(df[[True, False,True]])
    print("**" * 20)
    age = df['age'] > 20
    print(age)
    print("**" * 20)
    print(df[df['age'] > 20]) #返回一個df，這個可以看看
    print("**" * 20)
    print(df[(df['age'] > 18) & (df['name'] == 'ls')]) #返回一個df,多條件

8.重點【運算元】：資料清洗，資料計算，資料分析

import numpy as np
import pandas as pd
if __name__ == '__main__':
    # 1.載入資料
    df = pd.read_csv(r"D:\ \python-sk\0407data\emp.csv")
    print(df.head())

    #todo...資料分析
    '''
        資料清洗 =》 
        資料轉換 =》 
    '''
    # #map ： 一一對映 y=f(x)
    # 匿名函式
    # df['sal_add'] = df['sal'].map(lambda sal:sal+200)
    # print('-'*20)
    # print(df.head())

    # 函數語言程式設計
    def sal_level(sal):
        sal_f = np.float64(sal)
        if(sal_f>100):
            return "high"
        elif(sal_f > 10):
            return "low"
    # 預設裡面就一個引數的話，不用傳引數進去
    df['sal_level'] = df['sal'].map(sal_level)
    print('--'*10)
    print(df.head())

    df['ename'] = df['ename'].map(lambda x: x.upper())

    print('--' * 10)
    print(df.head())

    #applymap => 作用於整張表
    df2 = df.applymap(lambda x: x*2)
    print('--' * 10)
    print(df2.head())

    #apply，可以取列,比map應用更廣泛一點,直接作用到某個欄位，
    df['ename01'] = df.apply(lambda x:x['ename']+'abc',axis=1)
    print('--' * 10)
    print(df.head())

    # 取欄位
    df_filter = df.filter(items=['ename','sal','deptno'],axis=1)
    print(df_filter.head())

    # like按照行索引走，得看第一列【暫時不用】
    df_filter2 = df.filter(like='2', axis=0)
    print(df_filter2.head())

    # where 條件過濾
    df_where = df_filter.where(df_filter['sal'] > 100)
    print('--' * 10)
    print(df_where.head())

    # # where 多條件過濾
    df_where2 = df_filter.where((df_filter['sal'] > 100) & (df_filter['deptno'] == 20))
    print('--' * 10)
    print(df_where2.head())
    # print(df_where2.info())
    # print(df_filter.info())

    # 空值處理
    # name is null
    #      is not null
    # dropna any 有空值就能幹掉 ， all 必須全是空值才能幹掉
    print(df_where.head())
    df_etl = df_where.dropna(how='all')
    print('--'*10)
    print(df_etl.head())

    # fillna填充空值，value可以指定替換成什麼東西
    df_etl1 = df_where.fillna(value=0)
    print('--'*10)
    print(df_etl1.head())

    # fillna填充空值，value靈活使用
    values = {'ename':'--','sal':0}
    df_etl1 = df_where.fillna(value=values)
    print('--'*10)
    print(df_etl1.head())

    # 排序 order by sort by
    # sort_values
    # 升序True 降序False
    print('--'*20)
    print(df_filter.head())
    values = df_filter.sort_values(by=['sal'], ascending=True)
    print(values)

    # 升序True 降序False,工資相同，名字降序
    print('--'*20)
    print(df_filter.head())
    values2 = df_filter.sort_values(by=['sal','ename'], ascending=[False,True])
    print(values)

9.filter

select
    xxx
    xxx
    from xxx
    where 
        xxx like ''

  # 取欄位
    df_filter = df.filter(items=['ename','sal','deptno'],axis=1)
    print(df_filter.head())

    # like按照行索引走，得看第一列【暫時不用】
    df_filter2 = df.filter(like='2', axis=0)
    print(df_filter2.head())

    # where 條件過濾
    df_where = df_filter.where(df_filter['sal'] > 100)
    print('--' * 10)
    print(df_where.head())

    # # where 多條件過濾
    df_where2 = df_filter.where((df_filter['sal'] > 100) & (df_filter['deptno'] == 20))
    print('--' * 10)
    print(df_where2.head())
    # print(df_where2.info())
    # print(df_filter.info())

如果公司裡面有寫好的api，那一定比較厲害

10.分組聚合

分組聚合
    統計指標
        維度 指標
        group by
        分組 + 聚合函式 =》 指標
        
        groupby：
            
        資料：
            10,a
            10,b
            20,c
            30,d
        groupby => 分組 key “把相同的key放到一起” =》 
        聚合函式 => 去做一些事情【統計指標】
        
        group by => 
        10 , <a,b>
        20 , <c>
        30 , <d>
        聚合函式 =》 count()
        10,2
        20,1
        30,1

import numpy as np
import pandas as pd
if __name__ == '__main__':
    # 1.載入資料
    df = pd.read_csv(r"D:\ \python-sk\0407data\emp.csv")
    print(df.head())

    print('--'*10)

    # sum不太行，只能作用到整個欄位
    df_res = df.groupby(by=['deptno']).sum().filter(items=['sal'])
    print('--' * 10)
    print(df_res.head())



    df_res2 = df.filter(items=['ename','sal','deptno']).groupby(by=['deptno']).sum()
    print('--' * 10)
    print(df_res2.head())

    pd.set_option("display.max_columns",None)
    df_res3 = df.groupby(by=['deptno']).agg('sum','min','mean') # 算全部維度
    print('--' * 10)
    print(df_res3.head())

    pd.set_option("display.max_columns", None)
    df_res4 = df.groupby(by=['deptno']).agg({'sal':['min','max']})  # 算部分欄位,加引數逗號分隔
    print('--' * 10)
    print(df_res4.head())


    # 寫入檔案
    # 1.修改列名
    df_res.columns = pd.Series(["sal"])
    print(df_res)
    # 2.寫入檔案
    df_res.to_csv(r'D:\ \python-sk\0407data\emp.csv')
    print('--')
    print(df_res.filter(items=['sal'], axis=1))
    '''
    select
    name
    sun(score) as scoreall
    from table
    where xxx
    '''