1. 程式人生 > 其它 >2021-01-30

2021-01-30

技術標籤:資料分析python大資料資料探勘

'''
練習1
• 讀取北向.csv 指定 trade_date 為 行索引
• 檢視資料的基本資訊 有無缺失值 對其缺失值進行處理
• 刪除缺失值所在行
• 檢視資料的基本資訊 檢視資料是否清洗完畢
• 標籤為 index 這列沒啥用 將該列刪除
• 觀察資料是否有重複行
• 將重複行進行刪除
• 將行索引 進行升序
• 將處理好的資料 儲存至 北向(副).csv
'''

import numpy as np
import pandas as pd

data = pd.read_csv(r"北向.csv")
print(data)
'''
	index	trade_date	ggt_ss	ggt_sz	hgt	sgt	north_money	south_money
0	0	20190624	-541.17	792.38	-757.96	-1153.14	-1911.10	251.21
1	1	20190621	-97.40	701.36	3722.36	3608.14	7330.50	603.96
2	2	20190620	660.05	555.23	1914.44	3650.47	5564.91	1215.28
3	3	20190619	-491.58	186.47	2092.51	2831.23	4923.74	-305.11
4	4	20190618	1667.40	832.29	974.92	617.24	1592.16	2499.69
...	...	...	...	...	...	...	...	...
879	295	20190612	2032.73	912.14	1467.34	-181.33	1286.01	2944.87
880	296	20190611	2699.37	1038.56	3774.59	3171.37	6945.96	3737.93
881	297	20190610	1160.59	703.69	4957.98	2939.29	7897.27	1864.28
882	298	20190606	-13.56	-20.15	1500.24	-421.68	1078.56	-33.71
883	299	20190605	218.43	394.27	2276.22	781.60	3057.82	612.70
884 rows × 8 columns
'''

'''
練習2
讀取 FoodFacts.csv 資料,該資料是全球食品資料,需分析每個國家新增劑的平均使用。
步驟分析
• 1.讀取資料
• 2.資料質量考量
• 3.清洗資料
• 4.對各個國家的使用數量進行統計
• 1.清洗,統計國家資料
• 2.通過國家統計新增劑用量
• 5.儲存統計結果
'''

data_food = pd.read_csv(r"FoodFacts.csv ")
data_food.info()
data_food.head(10)

data = data_food.dropna(axis=1,how="all")
data_food.info()
data_food

data_food.columns
'''
Index(['code', 'url', 'creator', 'created_t', 'created_datetime',
       'last_modified_t', 'last_modified_datetime', 'product_name',
       'generic_name', 'quantity',
       ...
       'caffeine_100g', 'taurine_100g', 'ph_100g',
       'fruits_vegetables_nuts_100g', 'collagen_meat_protein_ratio_100g',
       'cocoa_100g', 'chlorophyl_100g', 'carbon_footprint_100g',
       'nutrition_score_fr_100g', 'nutrition_score_uk_100g'],
      dtype='object', length=159)
'''

data1= pd.read_csv(r"FoodFacts.csv ",usecols=["countries_en","additives_n"])
data1.info()
data1.head()

'''
countries_en	additives_n
0	France	NaN
1	France	NaN
2	France	NaN
3	France	NaN
4	France	NaN
'''
data1 = data1.dropna()
data1

'''
	countries_en	additives_n
5	United Kingdom	0.0
6	France	0.0
8	France	0.0
10	United Kingdom	5.0
11	United Kingdom	5.0
...	...	...
65480	United States	4.0
65490	France	0.0
65494	France	0.0
65499	France	0.0
65501	France	0.0
43616 rows × 2 columns
'''

data_country = data1['countries_en'][~data1['countries_en'].str.contains(',')]
count=data_country.drop_duplicates().count()
total_countries = data_country.drop_duplicates()

mean_additive_list=[]
for country in total_countries:
    a = data1[data1["countries_en"].str.contains(country,case=False)]
    #print(a)
    mean_additive=a["additives_n"].mean()
    mean_additive_list.append(mean_additive)

needed_data = pd.DataFrame({
    "country":total_countries,
    "mean_additive":mean_additive_list
})
needed_data

'''

	country	mean_additive
5	United Kingdom	1.259009
6	France	1.930422
15	Spain	0.930324
22	Germany	0.777923
69	United States	2.180608
...	...	...
62678	Iraq	1.500000
63052	Nederland	0.000000
64087	Singapore	1.000000
64096	Indonesia	2.125000
65403	Burkina Faso	1.666667
84 rows × 2 columns
'''