利用python進行資料分析-第六章筆記
Chapter 6 Data Loading, Storage, and File Formats
Reading and Writing Data in Text Format
最常用的是 read_csv
和 read_table
,不過數模競賽裡很多都是用 excel 給資料,不知道今年是個啥情況。
下表是一些常用的資料讀取方法:
其中根據試驗資料的不同形式,可以選擇不同的read_csv
引數進行調整。這部分我覺得根據具體資料具體處理即可,用到再去查文件,現在不必要把所有的引數都記住。
需要注意的是,read_csv
會把資料中的 NA 和 NULL 標記為空值;也可以通過 na_values
result = pd.read_csv('examples/ex5.csv', na_values=['NULL'])
# 可以用下面這種形式分別規定每一列中的空值
sentinels = {'message': ['foo', 'NA'], 'something': ['two']}
result = pd.read_csv('examples/ex5.csv', na_values=['NULL'])
Reading Text Files in Pieces
這部分主要是用 chunksize
這個引數。
chunker = pd.read_csv('examples/ex6.csv', chunksize=1000) tot = pd.Series([]) chunker_list = [] for piece in chunker: tot = tot.add(piece['key'].value_counts(), fill_value=0) chunker_list.append(piece) print(chunker_list)
觀察輸出,可以發現,chunker 相當於是一個 iterator,每個元素都是 chunksize 的大小。需要注意的是,chunker 迭代之後就不可以再用了,這種特性有點類似於之前提到過的 generator。
Writing Data to Text Format
主要是使用to_csv
這個方法。
Working with Delimited Formats
In some cases, however, some manual processing may be necessary. It’s not uncommon to receive a file with one or more malformed lines that trip up read_table.
read_csv
和read_table
功能已經足夠強大,但是為了防止資料中存在格式不正確的資料使得這兩個方法失效,所以考慮 python 內建的 csv 讀取方法:
import csv
f = open('examples/ex7.csv')
reader = csv.reader(f)
一個簡單的讀取資料方法:
with import csv
with open('examples/ex7.csv') as f:
lines = list(csv.reader(f))
headers, values = lines[0], lines[1:]
data_dict = {h: v for h, v in zip(headers, zip(*values))}
print(data_dict)
為了個性化的定義資料的分隔符等,可以用csv.Dialect
子類,通過一些屬性來定義資料的組織形式:
class my_dialect(csv.Dialect):
lineterminator = '\n'
delimiter = ';'
quotechar = '"'
quoting = csv.QUOTE_MINIMAL
with open('examples/ex7.csv') as f:
reader = csv.reader(f, dialect=my_dialect)
這裡是一些 dialect 的屬性。
JSON Data
可以使用 python 內建的 json 進行讀取,也可以用pd.read_json
,但是read_json
預設情況下認為 json 檔案中的每一行就是 DataFrame 的每一行。
XML and HTML: Web Scraping
主要使用read_html
方法,順便可以進行一些簡單的資料分析。
Parsing XML with lxml.objectify
使用 lxml.objectify.parse
來對 XML 資料檔案進行處理,呼叫 getroot()
方法得到處理後資料的 root,呼叫 XML 最外層的標籤,會返回一個可迭代的物件。
具體遇到直接查文件吧。。。
Binary Data Formats
One of the easiest ways to store data (also known as serialization) efficiently in binary format is using Python’s built-in pickle serialization. pandas objects all have a to_pickle method that writes the data to disk in pickle format.
同時,作者指出應該慎用 pickle 這種方式,只推薦用 pickle 儲存短期的資料:
pickle is only recommended as a short-term storage format. The problem is that it is hard to guarantee that the format will be stable over time; an object pickled today may not unpickle with a later version of a library.
Using HDF5 Format
HDF5( hierarchical data format )便於儲存大型、複雜的資料集結構,pandas 提供了一個高階介面 HDFStore
來儲存 DataFrame 和Series。
frame = pd.DataFrame({'a': np.random.randn(100)})
store = pd.HDFStore('mydata.h5')
store['obj1'] = frame
store['obj1_col'] = frame['a']
print(store)
HDFStore 支援兩種儲存形式,一種是 fixed,一種是 table,詳細的文件在這裡,可以看到 table 的速度雖然慢,但是支援 searching and selecting subsets 操作。
HDF5 is not a database. It is best suited for write-once, read-many datasets. While data can be added to a file at any time, if multiple writers do so simultaneously, the file can become corrupted.
Read Microsoft Excel Files
pandas also supports reading tabular data stored in Excel 2003 (and higher) files using either the ExcelFile class or pandas.read_excel function.
To write pandas data to Excel format, you must first create an ExcelWriter, then write data to it using pandas objects’ to_excel method:
writer = pd.ExcelWriter('examples/ex2.xlsx')
frame.to_excel(writer, 'Sheet1')
# another method
frame.to_excel('examples/ex2.xlsx')
writer.save()
Interacting with Web APIs
這部分其實與 pandas 關係不大,不細讀了。
Interacting with Databases
同上。
Conclusion
Getting access to data is frequently the first step in the data analysis process. We have looked at a number of useful tools in this chapter that should help you get started. In the upcoming chapters we will dig deeper into data wrangling, data visualization, time series analysis, and other topics.