1. 程式人生 > >半小時學完Pandas

半小時學完Pandas

前面學了Numpy,numpy能夠幫助我們處理數值,但是pandas除了處理數值之外(基於numpy),還能夠幫助我們處理其他型別的資料,同樣半小時學完下面的內容是沒問題的。

pandas主要是兩種資料型別

  1. Series 一維,帶標籤陣列
  2. DataFrame 二維,Series容器

第一部分:Series 建立陣列,陣列的索引和切片

In [1]: import pandas as pd   #導包

In [2]: t = pd.Series([1,2,3,4,5])   #建立Series型別的一維陣列

In [3]: t
Out[3]:   #輸出的第一列0-4也就是標籤  可以在建立的時候自己指定,使用index屬性
0    1
1    2
2    3
3    4
4    5
dtype: int64

In [4]: type(t)  #t的型別是Series
Out[4]: pandas.core.series.Series

In [5]: t.dtype
Out[5]: dtype('int64')

In [6]: t2 = pd.Series([5,6,3,2,9],index=list("abcde"))  #建立帶標籤的陣列,使用index屬性

In [7]: t2
Out[7]:
a    5
b    6
c    3
d    2
e    9
dtype: int64

#使用字典來建立陣列,key作為標籤,value作為陣列中的元素
In [8]: d = {"name":"Jack","age":20,"sex":"male"}

In [9]: d
Out[9]: {'name': 'Jack', 'age': 20, 'sex': 'male'}
#傳入字典,建立陣列
In [10]: t3 = pd.Series(d)

In [11]: t3
Out[11]:
age       20
name    Jack
sex     male
dtype: object

In [12]: t3.dtype
Out[12]: dtype('O')

#根據標籤來取值
In [13]: t3["age"]
Out[13]: 20

In [14]: t3["name"]
Out[14]: 'Jack'

#索引值,0開始
In [15]: t3[1]
Out[15]: 'Jack'

#取值方法跟列表和numpy是一樣的
In [16]: t3[1:]
Out[16]:
name    Jack
sex     male
dtype: object

In [17]: t3[0:2:1]
Out[17]:
age       20
name    Jack
dtype: object

#根據索引取不連續的多個值
In [18]: t3[[0,2]]
Out[18]:
age      20
sex    male
dtype: object

#根據標籤取多個值
In [19]: t3[["name","age"]]
Out[19]:
name    Jack
age       20
dtype: object

#獲取陣列的標籤
In [20]: t3.index
Out[20]: Index(['age', 'name', 'sex'], dtype='object')

#遍歷標籤
In [21]: for i in t3.index: print(i)
age
name
sex
#獲取陣列的值
In [22]: t3.values
Out[22]: array([20, 'Jack', 'male'], dtype=object)

In [23]: for i in t3.values: print(i)
20
Jack
male

#獲取連續標籤
In [24]: t3.index[:2]
Out[24]: Index(['age', 'name'], dtype='object')

第二部分:DataFrame建立陣列,常用方法

In [1]: import pandas as pd

In [2]: import numpy as np

#建立二維陣列,型別DataFrame
In [7]: t = pd.DataFrame(np.arange(5,29).reshape(4,6))

In [8]: t
Out[8]:   #可以看到索引也是二維的,但是也可以自己指定索引
    0   1   2   3   4   5
0   5   6   7   8   9  10
1  11  12  13  14  15  16
2  17  18  19  20  21  22
3  23  24  25  26  27  28

#使用index和columns來指定橫縱索引和豎軸索引
In [9]: t2 = pd.DataFrame(np.arange(12).reshape((3,4)),index=list("abc"),columns=list("WXYZ"))

In [10]: t2
Out[10]:
   W  X   Y   Z
a  0  1   2   3
b  4  5   6   7
c  8  9  10  11

#使用字典來建立DataFrame,key為豎軸索引
In [11]: d = {"name":["zhangsan","lisi"],"age":[20,18],"address":["Peking","Shanghai"]}

In [12]: d
Out[12]:
{'name': ['zhangsan', 'lisi'],
 'age': [20, 18],
 'address': ['Peking', 'Shanghai']}

In [13]: t3 = pd.DataFrame(d)

In [14]: t3
Out[14]:
    address  age      name
0    Peking   20  zhangsan
1  Shanghai   18      lisi

In [15]: d2 = [{"name":"xiaohong","age":18,"tel":10089},{"name":"xiaohong","tel":10089},
    ...:       {"name":"xiaohong","age":18}]

In [16]: d2
Out[16]:
[{'name': 'xiaohong', 'age': 18, 'tel': 10089},
 {'name': 'xiaohong', 'tel': 10089},
 {'name': 'xiaohong', 'age': 18}]

In [17]: t4 = pd.DataFrame(d2)

In [18]: t4
Out[18]:
    age      name      tel
0  18.0  xiaohong  10089.0
1   NaN  xiaohong  10089.0
2  18.0  xiaohong      NaN

#DataFrame的一些屬性
In [19]: t3.shape
Out[19]: (2, 3)

In [20]: t3.dtypes
Out[20]:
address    object
age         int64
name       object
dtype: object

In [21]: t3.index
Out[21]: RangeIndex(start=0, stop=2, step=1)

In [22]: t3.columns
Out[22]: Index(['address', 'age', 'name'], dtype='object')

In [23]: t3.values
Out[23]:
array([['Peking', 20, 'zhangsan'],
       ['Shanghai', 18, 'lisi']], dtype=object)

In [25]: t3.head(1)#獲取頭一行
Out[25]:
  address  age      name
0  Peking   20  zhangsan

In [26]: t3.tail(1)#獲取最後兩行
Out[26]:
    address  age  name
1  Shanghai   18  lisi

In [27]: t3.info()   #相關資訊概覽
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
address    2 non-null object
age        2 non-null int64
name       2 non-null object
dtypes: int64(1), object(2)
memory usage: 72.0+ bytes

第三部分:DataFrame讀取檔案,讀取行列操作

#讀取檔案,這裡讀取的是csv檔案,還提供了其他的比如讀取json檔案的方法
In [47]: df = pd.read_csv("D:/dogNames2.csv")

#讀取結果是DataFrame型別,我們取頭4行看看
In [48]: df.head(4)
Out[48]:
  Row_Labels  Count_AnimalName
0          1                 1
1          2                 2
2      40804                 1
3      90201                 1

#取2-5行   注意:方括號寫數字,表示取行,對行進行操作;寫字串,表示取列索引,對列進行操作
In [49]: df[2:5]
Out[49]:
  Row_Labels  Count_AnimalName
2      40804                 1
3      90201                 1
4      90203                 1

In [50]: df[:5]
Out[50]:
  Row_Labels  Count_AnimalName
0          1                 1
1          2                 2
2      40804                 1
3      90201                 1
4      90203                 1

In [51]: df["Row_Labels"][:5]
Out[51]:
0        1
1        2
2    40804
3    90201
4    90203
Name: Row_Labels, dtype: object

In [52]: df["Row_Labels"].index
Out[52]: RangeIndex(start=0, stop=16220, step=1)

In [53]: df[2:5]["Row_Labels"]
Out[53]:
2    40804
3    90201
4    90203
Name: Row_Labels, dtype: object

In [54]: print(df[2:5]["Row_Labels"])
2    40804
3    90201
4    90203
Name: Row_Labels, dtype: object

In [55]: type(df[2:5]["Row_Labels"])
Out[55]: pandas.core.series.Series

#按行讀取的結果是DataFrame型別
In [56]: type(df[2:5]) 
Out[56]: pandas.core.frame.DataFrame

#按照列讀取的結果是Series型別
In [58]: type(df["Row_Labels"])
Out[58]: pandas.core.series.Series

#還有更多的經過pandas優化過的選擇方式:
#df.loc 通過標籤索引行資料
#df.iloc 通過位置獲取行資料

In [71]: t3 = pd.DataFrame(np.arange(12).reshape(3,4),index=list("abc"),columns=list("WXYZ"))

In [72]: t3
Out[72]:
   W  X   Y   Z
a  0  1   2   3
b  4  5   6   7
c  8  9  10  11

#逗號前寫行,可以取一行,連續多行,不連續多行,逗號後表示列,取法一樣。
In [73]: t3.loc["a","W"]  
Out[73]: 0

In [74]: t3.loc[:"c","W":"Y"]
Out[74]:
   W  X   Y
a  0  1   2
b  4  5   6
c  8  9  10

In [75]: t3.loc[["a","c"],["W","Z"]]
Out[75]:
   W   Z
a  0   3
c  8  11

In [76]: t3.iloc[1]
Out[76]:
W    4
X    5
Y    6
Z    7
Name: b, dtype: int32

In [77]: t3.iloc[[0,1]]
Out[77]:
   W  X  Y  Z
a  0  1  2  3
b  4  5  6  7

In [78]: t3.iloc[0,1]
Out[78]: 1

In [79]: t3.iloc[0,[1,2,3]]
Out[79]:
X    1
Y    2
Z    3
Name: a, dtype: int32

In [80]: t3.iloc[:2,0:3]
Out[80]:
   W  X  Y
a  0  1  2
b  4  5  6

#看下型別,Series   什麼時候是Series型別什麼時候是DataFrame型別是容易知道的
In [81]: type(t3.iloc[1])
Out[81]: pandas.core.series.Series


In [84]: type(t3.iloc[:,2])
Out[84]: pandas.core.series.Series

第四部分:Pandas的布林型別索引,str操作,缺失資料的處理,常用統計方法

#假如我們想找到所有的使用次數超過700並且名字的字串的長度大於4的狗的名字,應該怎麼選擇?
In [85]: df = pd.read_csv("D:/dogNames2.csv")  #讀取csv檔案

In [89]: df.head(10)  #取前10行資料看看格式
Out[89]:
  Row_Labels  Count_AnimalName
0          1                 1
1          2                 2
2      40804                 1
3      90201                 1
4      90203                 1
5     102201                 1
6    3010271                 1
7      MARCH                 2
8      APRIL                51
9     AUGUST                14

#解答問題  注意: & 表示且,|表示或
In [90]: df[(df["Row_Labels"].str.len() > 4) & (df["Count_AnimalName"]>700)]
Out[90]:
      Row_Labels  Count_AnimalName
1156       BELLA              1195
2660     CHARLIE               856
8552       LUCKY               723
12368      ROCKY               823

In [91]: t3 = pd.DataFrame(np.arange(12).reshape(3,4),index=list("abc"),columns=list("WXYZ"))

In [92]: t3
Out[92]:
   W  X   Y   Z
a  0  1   2   3
b  4  5   6   7
c  8  9  10  11

In [93]: t3.iloc[1:,2:] = np.nan

In [94]: t3
Out[94]:
   W  X    Y    Z
a  0  1  2.0  3.0
b  4  5  NaN  NaN
c  8  9  NaN  NaN

In [95]: pd.isnull(t3)
Out[95]:
       W      X      Y      Z
a  False  False  False  False
b  False  False   True   True
c  False  False   True   True

In [96]: t3["W"]
Out[96]:
a    0
b    4
c    8
Name: W, dtype: int32

In [97]: t3["Y"]
Out[97]:
a    2.0
b    NaN
c    NaN
Name: Y, dtype: float64

#最外層的t3裡面是個Series型別,但是得到的是true對應的那一行
In [98]: t3[pd.notnull(t3["Y"])] 
Out[98]:
   W  X    Y    Z
a  0  1  2.0  3.0

In [99]: pd.notnull(t3["Y"])
Out[99]:
a     True
b    False
c    False
Name: Y, dtype: bool

In [100]: type(pd.notnull(t3["Y"]))
Out[100]: pandas.core.series.Series

In [101]: t3
Out[101]:
   W  X    Y    Z
a  0  1  2.0  3.0
b  4  5  NaN  NaN
c  8  9  NaN  NaN

#刪除存在nan的那行,如果指定axis = 1那就是刪除列,預設how就是等於any
In [102]: t = t3.dropna(axis=0,how="any")

In [103]: t
Out[103]:
   W  X    Y    Z
a  0  1  2.0  3.0

In [104]: t3  #發現t3並沒有改變,操作後會返回個新的DataFrame
Out[104]:
   W  X    Y    Z
a  0  1  2.0  3.0
b  4  5  NaN  NaN
c  8  9  NaN  NaN

In [105]: t3.iloc[2]=np.nan  #現在我們將最後一行全改成nan

In [106]: t3
Out[106]:
     W    X    Y    Z
a  0.0  1.0  2.0  3.0
b  4.0  5.0  NaN  NaN
c  NaN  NaN  NaN  NaN

#刪除一行全為nan的那行
In [107]: t2 = t3.dropna(axis=0,how="all")

In [108]: t2
Out[108]:
     W    X    Y    Z
a  0.0  1.0  2.0  3.0
b  4.0  5.0  NaN  NaN

In [109]:  t3.dropna(axis=0,how="all",inplace=True)

In [110]: t3
Out[110]:
     W    X    Y    Z
a  0.0  1.0  2.0  3.0
b  4.0  5.0  NaN  NaN

#注意這裡,inplace=True表示原地刪除,不會返回新的DataFrame,t3直接就改變了
In [111]:  t3.dropna(axis=0,inplace=True)

In [112]: t3
Out[112]:
     W    X    Y    Z
a  0.0  1.0  2.0  3.0

#現在做一件事,處理DataFrame中的NaN,利用pandas的統計方法替換為該列的均值
In [115]: d2 = [{"name":"xiaohong","age":18,"tel":100},{"name":"xiaohong","tel":102},
     ...:       {"name":"xiaohong","age":20}]
     ...: t2 = pd.DataFrame(d2)

In [116]: t2
Out[116]:
    age      name    tel
0  18.0  xiaohong  100.0
1   NaN  xiaohong  102.0
2  20.0  xiaohong    NaN

#分別計算三個列屬性的均值預設的是axis=0,也可以指定axis=1,age的均值是19=(18+20)、2
In [117]: t2.mean()
Out[117]:
age     19.0
tel    101.0
dtype: float64

In [118]: type(t2.mean())
Out[118]: pandas.core.series.Series

In [119]: t2
Out[119]:
    age      name    tel
0  18.0  xiaohong  100.0
1   NaN  xiaohong  102.0
2  20.0  xiaohong    NaN

#Nan處替換為均值
In [120]: t2.fillna(t2.mean())
Out[120]:
    age      name    tel
0  18.0  xiaohong  100.0
1  19.0  xiaohong  102.0
2  20.0  xiaohong  101.0

#發現t2並沒有變,我們需要重新定義一個變數來接收
In [121]: t2
Out[121]:
    age      name    tel
0  18.0  xiaohong  100.0
1   NaN  xiaohong  102.0
2  20.0  xiaohong    NaN



In [123]: t2 = t2.fillna(t2.mean())

In [124]: t2
Out[124]:
    age      name    tel
0  18.0  xiaohong  100.0
1  19.0  xiaohong  102.0
2  20.0  xiaohong  101.0

小練習:現在有關於電影的csv檔案,內容如下
在這裡插入圖片描述
現在來做一些統計資訊

#1、統計電影分類(Genre屬性)

#讀取檔案
In [127]: df = pd.read_csv("D:\IMDB-Movie-Data.csv")
#轉換為str,分割,預設就是都好分割,檢視前5行的樣子
In [132]:  df["Genre"].str.split()[:5]
Out[132]:
0     [Action,Adventure,Sci-Fi]
1    [Adventure,Mystery,Sci-Fi]
2             [Horror,Thriller]
3     [Animation,Comedy,Family]
4    [Action,Adventure,Fantasy]
Name: Genre, dtype: object
#放到list裡面去,接著檢視前五行的樣子
In [133]:  temp_list = df["Genre"].str.split().tolist()

In [134]: temp_list[:5]
Out[134]:
[['Action,Adventure,Sci-Fi'],
 ['Adventure,Mystery,Sci-Fi'],
 ['Horror,Thriller'],
 ['Animation,Comedy,Family'],
 ['Action,Adventure,Fantasy']]
#轉到一個列表裡面,形成["",""]的形式   我們可以看到每個電影有幾種型別,接下來進行統計
In [135]: ans = list(set([j for i in temp_list for j in i]))
In [137]: ans[:10]
Out[137]:
['Animation,Family,Fantasy',
 'Crime,Drama',
 'Adventure,Drama,Horror',
 'Comedy,Family,Fantasy',
 'Comedy,Horror,Sci-Fi',
 'Action,Horror,Thriller',
 'Biography,Crime,Drama',
 'Mystery,Thriller,Western',
 'Drama,Family,Fantasy',
 'Action,Horror,Mystery']
#接下來的操作打算寫到開發工具裡面了,如下

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
#讀取csv中的檔案  還可以讀SQL,json等
df = pd.read_csv("./IMDB-Movie-Data.csv")
# print(df)

# print(df["Genre"]) #分類

#統計分類的列表
temp_list = df["Genre"].str.split(",").tolist()#[[],[],[]]
genre_list = list(set(i for j in temp_list for i in j))

#構造全為0的陣列
zeros_df = pd.DataFrame(np.zeros((df.shape[0],len(genre_list))),columns=genre_list)
# print(zeros_df)

print(df.shape[0])

#給每個電影出現分類的位置賦值為1
for i in range(df.shape[0]):
    zeros_df.loc[i,temp_list[i]] = 1
# print(zeros_df.head(3))

#統計每個分類的電影的數量和
genre_count = zeros_df.sum(axis = 0)
print(genre_count)

#排序
genre_count = genre_count.sort_values()
_x = genre_count.index
_y = genre_count.values

#畫圖
plt.figure(figsize=(20,8),dpi=80)
plt.bar(range(len(_x)),_y)
plt.xticks(range(len(_x)),_x)
plt.show()

第五部分:資料合併join,merge

#建立兩個二維陣列來進行測試
In [150]: df1 = pd.DataFrame(np.ones((2,4)),index=["A","B"],columns=list("abcd"))
     ...: df2 = pd.DataFrame(np.zeros((3,3)),index=["A","B","C"],columns=list("xyz"))

In [151]: df1
Out[151]:
     a    b    c    d
A  1.0  1.0  1.0  1.0
B  1.0  1.0  1.0  1.0

In [152]: df2
Out[152]:
     x    y    z
A  0.0  0.0  0.0
B  0.0  0.0  0.0
C  0.0  0.0  0.0

#df1和df2合併,以df1為準   注意是往後拼接,df2多餘的就不要了   注意行索引一樣
In [153]: df1.join(df2)
Out[153]:
     a    b    c    d    x    y    z
A  1.0  1.0  1.0  1.0  0.0  0.0  0.0
B  1.0  1.0  1.0  1.0  0.0  0.0  0.0

#df2和df1合併,以df1為準   注意是往後拼接,不足的補NaN
In [154]: df2.join(df1)
Out[154]:
     x    y    z    a    b    c    d
A  0.0  0.0  0.0  1.0  1.0  1.0  1.0
B  0.0  0.0  0.0  1.0  1.0  1.0  1.0
C  0.0  0.0  0.0  NaN  NaN  NaN  NaN

#合併後是會返回新的二維陣列的
In [155]: df1
Out[155]:
     a    b    c    d
A  1.0  1.0  1.0  1.0
B  1.0  1.0  1.0  1.0

In [160]: df1.merge(df3,on="a")
Out[160]:
   a    b    c    d  f  x
0  1  1.0  1.0  1.0  0  2
1  1  1.0  1.0  1.0  0  2

In [161]: df3 = pd.DataFrame(np.arange(9).reshape((3,3)),columns=list("fax"))

In [162]: df3
Out[162]:
   f  a  x
0  0  1  2
1  3  4  5
2  6  7  8
#注意看這裡,on="a"表示按照a相等來進行拼接,相當於SQL中的inner join,為什麼df3中的第一行被拼接了兩次呢,因為df1的a那列兩個1都能跟df3中的第一行a列的那個一進行連線。

In [163]: df1.merge(df3,on="a")
Out[163]:
   a    b    c    d  f  x
0  1  1.0  1.0  1.0  0  2
1  1  1.0  1.0  1.0  0  2

#修改了df1的第一行a列那個1位100,再次拼接就只有一行了
In [165]: df1.loc["A","a"] = 100

In [166]: df1
Out[166]:
       a    b    c    d
A  100.0  1.0  1.0  1.0
B    1.0  1.0  1.0  1.0

In [167]: df1.merge(df3,on="a")
Out[167]:
   a    b    c    d  f  x
0  1  1.0  1.0  1.0  0  2

#下面的3個測試分別相當於SQL的外連線(?),左外連線,右外連線。
In [168]: df1.merge(df3,on="a",how="outer")
Out[168]:
       a    b    c    d    f    x
0  100.0  1.0  1.0  1.0  NaN  NaN
1    1.0  1.0  1.0  1.0  0.0  2.0
2    4.0  NaN  NaN  NaN  3.0  5.0
3    7.0  NaN  NaN  NaN  6.0  8.0

In [169]: df1.merge(df3,on="a",how="left")
Out[169]:
     a    b    c    d    f    x
0  100  1.0  1.0  1.0  NaN  NaN
1    1  1.0  1.0  1.0  0.0  2.0

In [170]: df1.merge(df3,on="a",how="right")
Out[170]:
     a    b    c    d  f  x
0  1.0  1.0  1.0  1.0  0  2
1  4.0  NaN  NaN  NaN  3  5
2  7.0  NaN  NaN  NaN  6  8

第六部分:分組與聚合

打算通過練習來體現,現有csv檔案如下
在這裡插入圖片描述
品牌是星巴克,現在來統計每個國家星巴克的數量

#讀取資料
In [178]: df = pd.read_csv("D:/starbucks_store_worldwide.csv")
#根據國家進行分組  
In [179]: g = df.groupby(by="Country")

g是可以迭代的,下面給出迭代後結果

#可以進行遍歷
for i,j in grouped:
     print(i)
     print("-"*100)
     print(j)
     print("*"*100)

輸出:

****************************************************************************************************
ZA
----------------------------------------------------------------------------------------------------
           Brand  Store Number      Store Name Ownership Type  \
25597  Starbucks  47608-253804   Rosebank Mall       Licensed   
25598  Starbucks  47640-253809    Menlyn Maine       Licensed   
25599  Starbucks  47609-253286  Mall of Africa       Licensed   

                                          Street Address          City  \
25597          Cnr Tyrwhitt and Cradock Avenue, Rosebank  Johannesburg   
25598  Shop 61B, Central Square, Cnr Aramist & Coroba...        Menlyn   
25599             Shop 2077, Upper Level, Waterfall City       Midrand   

      State/Province Country Postcode Phone Number  \
25597             GT      ZA     2194  27873500159   
25598             GT      ZA      181          NaN   
25599             GT      ZA     1682  27873500215   

                             Timezone  Longitude  Latitude  
25597  GMT+000000 Africa/Johannesburg      28.04    -26.15  
25598  GMT+000000 Africa/Johannesburg      28.28    -25.79  
25599  GMT+000000 Africa/Johannesburg      28.11    -26.02  
****************************************************************************************************
。。。

可以看到是 國家 + 每行資訊,現在我們只需要統計每個國家裡面的行數即可,比如ZA這個國家的行數就是3,即3家星巴克。

#統計完了
In [180]: g["Brand"].count()
Out[180]:
Country
AD        1
AE      144
AR      108
AT       18
AU       22
AW        3
AZ        4
BE       19
BG        5
BH       21
BN        5
...

現在我們來統計中國每個省的星巴克數量

In [182]: china_data = df[df["Country"]=="CN"]
In [185]: china_data.groupby(by="State/Province")["Brand"].count()
Out[185]:
State/Province
11    236
12     58
13     24
14      8
15      8
21     57
22     13
23     16
31    551
32    354
33    315
34     26
35     75
36     13
37     75
41     21
42     76
43     35
44    333
45     21
46     16
50     41
51    104
52      9
53     24
61     42
62      3
63      3
64      2
91    162
92     13

第七部分:索引與複合索引

In [190]: df1 = pd.DataFrame(np.ones((2,4)),index=["A","B"],columns=list("abcd"))

In [191]: df1
Out[191]:
     a    b    c    d
A  1.0  1.0  1.0  1.0
B  1.0  1.0  1.0  1.0

In [192]: df1.index = ["a","b"]  #修改索引

In [193]: df1  
Out[193]:  #發現"A","B"被修改成了"a","b‘’
     a    b    c    d
a  1.0  1.0  1.0  1.0
b  1.0  1.0  1.0  1.0

In [195]: df1.reindex(["A","T"])   #重新修改後成了NaN
Out[195]:
    a   b   c   d
A NaN NaN NaN NaN
T NaN NaN NaN NaN
	
In [196]: df1 = df1.set_index("a")  #將a那列作為行索引,a那列就消失了
In [197]: df1
Out[197]:
       b    c    d
a
1.0  1.0  1.0  1.0
1.0  1.0  1.0  1.0

In [198]: df1 = pd.DataFrame(np.ones((2,4)),index=["A","B"],columns=list("abcd"))

In [199]: df1
Out[199]:
     a    b    c    d
A  1.0  1.0  1.0  1.0
B  1.0  1.0  1.0  1.0

#指定為False那麼a這列就不會消失了
In [200]: df1 = df1.set_index("a",drop=False) 

In [201]: df1
Out[201]:
       a    b    c    d
a
1.0  1.0  1.0  1.0  1.0
1.0  1.0  1.0  1.0  1.0

第八部分:時間序列

#生成一段時間範圍  表示從什麼時候生成到什麼時候,freq是間隔,D表示天,10D表示10天,M表示month等。注意:時間格式下面寫的三種都可以,不要亂寫
In [204]: pd.date_range(start="20171201",end="20180101",freq="D")
Out[204]:
DatetimeIndex(['2017-12-01', '2017-12-02', '2017-12-03', '2017-12-04',
               '2017-12-05', '2017-12-06', '2017-12-07', '2017-12-08',
               '2017-12-09', '2017-12-10', '2017-12-11', '2017-12-12',
               '2017-12-13', '2017-12-14', '2017-12-15', '2017-12-16',
               '2017-12-17', '2017-12-18', '2017-12-19', '2017-12-20',
               '2017-12-21', '2017-12-22', '2017-12-23', '2017-12-24',
               '2017-12-25', '2017-12-26', '2017-12-27', '2017-12-28',
               '2017-12-29', '2017-12-30', '2017-12-31', '2018-01-01'],
              dtype='datetime64[ns]', freq='D')

In [205]: pd.date_range(start="20171201",end="20180101",freq="10D")
Out[205]: DatetimeIndex(['2017-12-01', '2017-12-11', '2017-12-21', '2017-12-31'], dtype='datetime64[ns]', freq='10D')

#periods表示生成的個數
In [206]: pd.date_range(start="2017-12-01",periods=10,freq="D")
Out[206]:
DatetimeIndex(['2017-12-01', '2017-12-02', '2017-12-03', '2017-12-04',
               '2017-12-05', '2017-12-06', '2017-12-07', '2017-12-08',
               '2017-12-09', '2017-12-10'],
              dtype='datetime64[ns]', freq='D')

In [207]: pd.date_range(start="2017/12/01",periods=10,freq="M")
Out[207]:
DatetimeIndex(['2017-12-31', '2018-01-31', '2018-02-28', '2018-03-31',
               '2018-04-30', '2018-05-31', '2018-06-30', '2018-07-31',
               '2018-08-31', '2018-09-30'],
              dtype='datetime64[ns]', freq='M')