pandas 常用清洗數據(一)
阿新 • • 發佈:2018-11-01
core pandas mean type book date axis csv strip
數據源獲取:
https://www.kaggle.com/datasets
1、
Look at the some basic stats for the ‘imdb_score’ column: data.imdb_score.describe() Select a column: data[‘movie_title’] Select the first 10 rows of a column: data[‘duration’][:10] Select multiple columns: data[[‘budget’,’gross’]] Select all movies over two hourslong: data[data[‘duration’] > 120]
data.country = data.country.fillna(‘’) data.duration = data.duration.fillna(data.duration.mean()) data = pd.read_csv(‘movie_metadata.csv’, dtype={title_year: str}) data[‘movie_title’].str.upper() Similarly, to get rid of trailing whitespace: data[‘movie_title’].str.strip() data= data.rename(columns = {‘title_year’:’release_date’, ‘movie_facebook_likes’:’facebook_likes’})
丟棄帶有NAN的所有項 data.dropna() 丟棄所有元素都是NAN的行 data.dropna(how=‘all‘) 丟棄所有元素都是NAN的列 data.dropna(axis=1,how=‘all‘) #axis = 0 行,=1 列 只保留至少有3個非NAN值的行 data.dropna(thresh=3)
pandas 常用清洗數據(一)