如何用pandas讀取CVS格式資料
阿新 • • 發佈:2019-01-01
本文主要介紹的是如何利用pandas來讀取CVS格式的資料
CVS格式指的是:每個元素之間均已逗號隔開,不管檔案字尾名是什麼,例如.txt,.data等等
如
#x.txt
1,2,3
4,5,6
----------------------------------------------------------
column_name=['A','B','C']
t=pd.read_csv('./x.txt',names=column_name)
print t
>>
A B C
0 1 2 3
1 4 5 6
1.匯入pandas包
import pandas as pd
2.利用read_csv函式讀取
train=pd.read_csv('./Datasets/Breast-Cancer/breast-cancer-train.csv')
test=pd.read_csv('./Datasets/Breast-Cancer/breast-cancer-test.csv')
print np.shape(train)
print type(train)
>> (175,4)
>> <class 'pandas.core.frame.DataFrame'>
讀取後的資料儲存在train中,但其資料型別不是我們常用的array或者array;此時可以用np.array(train)強制轉換成array型別,之後的操作就同矩陣操作一樣了。
3.擬合數據
3.1 轉換成array型別處理
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
train=pd.read_csv('./Datasets/Breast-Cancer/breast-cancer-train.csv')
test=pd.read_csv('./Datasets/Breast-Cancer/breast-cancer-test.csv')
train_data = np.array(train)
test_data = np.array(test)
X_train = train_data[:,1 :3] # 取第1,2列作為訓練集
y_train = train_data[:,3] # 取第3列為標籤
X_test = test_data[:,1:3]
y_test = test_data[:,3]
p_index = np.where(train_data[:,3]==1)[0] # 取出所以正樣本的索引
n_index = np.where(train_data[:,3]==0)[0] # 取出所以負樣本的索引
positive = X_train[p_index,:] # 取出所以正樣本
nagative = X_train[n_index,:] # 取出所以負樣本
plt.scatter(nagative[:,0],nagative[:,1],marker='o',s=200,c='red') #繪製樣本點
plt.scatter(positive[:,0],positive[:,1],marker='x',s=150,c='black')
plt.show()
lr=LogisticRegression()
lr.fit(X_train,y_train)
print lr.score(X_test,y_test)
3.2 利用DataFrame處理
import pandas as pd
import matplotlib.pyplot as plt
train=pd.read_csv('./Datasets/Breast-Cancer/breast-cancer-train.csv')
test=pd.read_csv('./Datasets/Breast-Cancer/breast-cancer-test.csv')
negative=train.loc[train['Type']==0][['Clump Thickness','Cell Size']]
positive=train.loc[train['Type']==1][['Clump Thickness','Cell Size']]
plt.scatter(negative['Clump Thickness'],negative['Cell Size'],\
marker='o',s=200,c='red')
plt.scatter(positive['Clump Thickness'],positive['Cell Size'],\
marker='x',s=150,c ='black')
plt.show()
X_train=train[['Clump Thickness','Cell Size']]
y_train=train['Type']
X_test=test[['Clump Thickness','Cell Size']]
y_test=test['Type']
lr=LogisticRegression()
lr.fit(X_train,y_train)
print lr.score(X_test,y_test)
參考:
python機器學習及實踐