《利用python進行資料分析》學習筆記（一）

阿新 • • 發佈：2021-06-25

處理usa.gov資料

匯入資料

   import json
   path = 'usagov_bitly_data2012-03-16-1331923249.txt'
   records = [json.loads(line) for line in open(path)]

對時區進行計數

因為不是所有記錄都有時區欄位，所以必須加入if判斷，否則報錯。

   # time_zones = [rec['tz'] for rec in records]
   time_zones = [rec['tz'] for rec in records if 'tz' in rec]

計數函式

   # 方法1
   def get_counts(sequence):
     counts = {}
     for x in sequence:
       if x in counts:
         counts[x] += 1
       else:
         counts[x] = 1
     return counts
   #  得到前10位的時區及其計數值  
   def top_counts(count_dict, n=10):
     value_key_pairs = [(count, tz) for tz, count in count_dict.items()]
     value_key_pairs.sort()
     return value_key_pairs[-n:]
  
   counts = get_counts(time_zones)
   print (counts['America/New_York'])
   print(len(time_zones))

   # 方法2
   from collections import defaultdict
  
   def get_counts2(sequence):
     counts = defaultdict(int) # values will initialize to 0
     for x in sequence:
       counts[x] += 1
     return counts

   # 方法3
   from collections import Counter
   counts = Counter(time_zones)
   # 得到前10位的時區及其計數值
   counts.most_common(10)

用pandas對時區進行計數

   from pandas import DataFrame, Series
   import pandas as pd
   import numpy as np
   frame = DataFrame(records)
   tz_counts = frame['tz'].value_counts()

上述所有計數程式所要得到的就是如下圖的時區與計數的鍵值對，換言之，所有的counts，tz_counts的結果形式如下圖：

替換預設值

   clean_tz = frame['tz'].fillna('Missing')
   clean_tz[clean_tz == ''] = 'Unknown'
   tz_counts = clean_tz.value_counts()
   print(tz_counts[:10])

執行上述程式後，上圖中的計數521對應的空白鍵將會替換成‘Unknown’。

usa.gov示例資料中最常出現的時區畫圖

   import matplotlib.pyplot as plt
   plt.figure(1)
   tz_counts[:10].plot(kind='barh',rot=0)
   plt.show()

利用pandas處理usa.gov中的a欄位

提取a段字串的第一節（與瀏覽器大致對應），進行計數

    print(frame.a[0].split())
    results = Series([x.split()[0] for x in frame.a.dropna()])
    print(results[:5])
    print(results.value_counts()[:8])

按windows和非windows使用者進行計數

    import numpy as np
    cframe = frame[frame.a.notnull()]
    operating_system = np.where(cframe['a'].str.contains('Windows'),'Windows','Not Windows')
    print(operating_system[:5])

根據時區和新得到的作業系統列表對資料進行分組

    by_tz_os = cframe.groupby(['tz',operating_system])
    print(by_tz_os.mean())
    agg_counts = by_tz_os.size().unstack().fillna(0)
    print(agg_counts[:10])

選取最常出現的時區然後畫圖

    # 根據agg_counts中的行數構造一個間接索引陣列
    indexer = agg_counts.sum(1).argsort()
    print(indexer)
    count_subset = agg_counts.take(indexer)[-10:]
    print(count_subset)
    # 按windows和非windows使用者統計的最常出現的時區
    count_subset.plot(kind='barh',stacked=True)
    plt.show()
    # 按windows和非windows使用者比例統計的最常出現的時區
    normed_subset = count_subset.div(count_subset.sum(1),axis=0)
    normed_subset.plot(kind='barh',stacked=True)
    plt.show()

查漏補缺

記錄處理usa.gov中遇到的不是很熟練的函式

1. collections.defaultdict()

defaultdict(function_factory)構建的是一個類似字典的物件，其中keys的值需要自行確定賦值，而values的初始值是預設值，並且values的類是由function_factory來決定的。由方法二中的defaultdict（int）為例：

    from collections import defaultdict
    counts = defaultdict(int)
    print(counts)
    print(counts['ad'])
    print(counts)

輸出結果如下

    defaultdict(<class 'int'>, {})
    0
    defaultdict(<class 'int'>, {'ad': 0})

現在明瞭了，初始化的時候，counts裡什麼都沒有，一旦賦予了鍵，那麼counts裡就有該鍵值對了（值是預設值，int型的預設值是0）。

2. collection.Counter()

Counter類是用來跟蹤值出現的次數的，它是一個無序的容器型別，以字典的鍵值對形式儲存，其中元素作為key，其計數值作為value。程式碼中用到的Counter.most_common([n])是用來返回一個topN列表的，誰出現的次數多，誰排在前面。 http://www.pythoner.com/205.html

3. pandas.Series.value_counts

    Series.value_counts(normalize=False, sort=True, ascending=False, bins=None, dropna=True)

返回一個對唯一值計數的一個object，實際上就是一個map-reduce的過程。

parameters ：

normalize : 布林量，預設為False

如果是True，將會返回特定值出現的相對頻率。

sort : 布林量，預設為True

根據計數值來排序

ascending : 布林量，預設為False

按升序排列，預設為False的話，就是降序

bins : int型， optional

並不是計數，而是把它們歸到一個個半開閉的區間，是一個對於pd.cut的便捷操作，只對數值資料有效。

dropna : 布林量，預設為True

不包括對於NaN的計數

http://pandas.pydata.org/pandas- docs/stable/generated/pandas.Series.value_counts.html

4. 處理缺失資料

pandas.DataFrame.fillna

    DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)

運用特定的方法填充 NA/NaN

    # 是將缺失值變為空值(' ')
    clean_tz = frame['tz'].fillna('Missing')

http://pandas.pydata.org/pandas- docs/version/0.17.0/generated/pandas.DataFrame.fillna.html

pandas.DataFrame.dropna

    DataFrame.dropna(axis=0, how=’any’, thresh=None, subset=None, inplace=False)

根據給定的軸和特定的方式進行資料剔除，返回剔除後的帶有labels的object http://pandas.pydata.org/pandas- docs/stable/generated/pandas.DataFrame.dropna.html

pandas.Series.isnull

    Series.isnull()

返回一個布林型的，同樣大小的object，每個布林變數指示的是值是否為null http://pandas.pydata.org/pandas- docs/stable/generated/pandas.Series.isnull.html

pandas.Series.notnull

    Series.notnull()

返回一個布林型的，同樣大小的object，每個布林變數指示的是值是否為not null http://pandas.pydata.org/pandas- docs/stable/generated/pandas.Series.notnull.html

5. numpy.where()

    numpy.where(condition,[x,y])

根據condition來判斷返回x還是y，如果condition沒給，預設為nonzero().

    # 判斷a欄位中是否含有'Windows'，如果含有，返回'Windows'，否則返回'Not Windows'
    operating_system = np.where(cframe['a'].str.contains('Windows'),'Windows','Not Windows')

https://docs.scipy.org/doc/numpy/reference/generated/numpy.where.html

6. str.split()

    str.split(str="",num=string.count(str))[n]

parameters ：

str : str

表示為分隔符，預設為空格，但是不能為空(’’)。若字串中沒有分隔符，則把整個字串作為列表的一個元素

num : int

表示分割次數。如果存在引數num，則僅分隔成 num+1 個子字串，並且每一個子字串可以賦給新的變數

[n] : int

表示選取第n個分片

《利用python進行資料分析》學習筆記（一）

處理usa.gov資料

匯入資料

對時區進行計數

用pandas對時區進行計數

利用pandas處理usa.gov中的a欄位

查漏補缺

1. collections.defaultdict()

2. collection.Counter()

3. pandas.Series.value_counts

4. 處理缺失資料

5. numpy.where()

6. str.split()

《利用Python進行資料分析》筆記---第2章--MovieLens 1M資料集

資料倉庫學習筆記（一）

資料結構學習筆記（一）

《利用python進行資料分析》學習筆記（一）

利用python進行資料分析（第二版）筆記

《利用Python進行資料分析》 —— （1）

《利用Python進行資料分析》 —— （2）

"利用python進行資料分析"學習記錄01

利用python進行資料分析-第四章筆記

利用python進行資料分析-第五章筆記

利用python進行資料分析-第六章筆記

pandas frame 刪除一行_利用Python進行資料分析（第五章、Pandas入門）【三】

資料載入、儲存及檔案格式知識圖譜-《利用Python進行資料分析》

資料清洗與準備知識圖譜-《利用Python進行資料分析》

資料規整：連線、聯合與重塑知識圖譜-《利用Python進行資料分析》

繪圖和視覺化知識圖譜-《利用Python進行資料分析》

資料聚合與分組操作知識圖譜-《利用Python進行資料分析》

時間序列知識圖譜-《利用Python進行資料分析》

高階Pandas知識圖譜-《利用Python進行資料分析》

利用Python進行資料分析_資料聚合與分組運算_資料聚合

《利用python進行資料分析》學習筆記（一）

處理usa.gov資料

匯入資料

對時區進行計數

用pandas對時區進行計數

利用pandas處理usa.gov中的a欄位

查漏補缺

1. collections.defaultdict()

2. collection.Counter()

3. pandas.Series.value_counts

4. 處理缺失資料

5. numpy.where()

6. str.split()

相關推薦