Kaggle-pandas(3)

阿新 • • 發佈：2020-08-03

Summary-functions-and-maps

教程

在上一教程中，我們學習瞭如何從DataFrame或Series中選擇相關資料。正如我們在練習中所展示的，從我們的資料表示中提取正確的資料對於完成工作至關重要。
但是，資料並非總是以我們想要的格式從記憶體中出來的。有時，我們必須自己做一些工作以將其重新格式化以解決當前的任務。本教程將介紹我們可以應用於資料以獲取“恰到好處”輸入的各種操作。

Pandas提供了許多簡單的“摘要功能”（不是官方名稱），它們以某種有用的方式重組了資料。例如，考慮一下describe（）方法：

reviews.points.describe()

Output：

此方法生成給定列的屬性的高階摘要。它可識別型別，這意味著其輸出根據輸入的資料型別而變化。上面的輸出僅對數字資料有意義；對於字串資料，這是我們得到的：

如果要獲取有關DataFrame或Series中某一列的某些特定的簡單摘要統計資訊，通常有一個有用的pandas函式可以實現此目的。
例如，要檢視分配的分數的平均值（例如，平均評級的葡萄酒表現如何），我們可以使用mean（）函式：

reviews.points.mean()

要檢視唯一值的列表，我們可以使用unique（）函式：

reviews.taster_name.unique()

要檢視唯一值的列表以及它們在資料集中出現的頻率，我們可以使用value_counts（）方法：

reviews.taster_name.value_counts()

Maps

對映是一個從數學中借來的術語，表示一個函式，該函式採用一組值並將它們“對映”到另一組值。在資料科學中，我們經常需要根據現有資料建立新的表示形式，或者將資料從現在的格式轉換為我們希望其在以後使用的格式。地圖是處理這項工作的要素，這對完成工作極為重要！
您將經常使用兩種對映方法。
map（）是第一個，並且稍微簡單一些。例如，假設我們想將收到的葡萄酒的分數修正為0。我們可以這樣做：

review_points_mean = reviews.points.mean()
reviews.points.map( 
lambda p: p - review_points_mean)

傳遞給map（）的函式應該期望Series中的單個值（在上面的示例中為點值），並返回該值的轉換版本。 map（）返回一個新的Series，其中所有值都已由您的函式轉換。
如果我們要通過在每一行上呼叫自定義方法來轉換整個DataFrame，則apply（）是等效的方法。

如：

def remean_points(row):
    row.points = row.points - review_points_mean
    return row

reviews.apply(remean_points, axis='columns')

如果我們使用axis ='index'呼叫了reviews.apply（），則需要傳遞一個函式來轉換每一列，而不是傳遞函式來轉換每一行。
請注意，map（）和apply（）分別返回新的，轉換後的Series和DataFrames。他們不會修改被呼叫的原始資料。如果我們檢視評論的第一行，我們可以看到它仍然具有其原始積分值。

練習

What is the median of thepointscolumn in thereviewsDataFrame?

median_points = reviews["points"].median()

# Check your answer
q1.check()

What countries are represented in the dataset? (Your answer should not include any duplicates.)

countries = reviews["country"].unique()

# Check your answer
q2.check()

How often does each country appear in the dataset? Create a Seriesreviews_per_countrymapping countries to the count of reviews of wines from that country.

reviews_per_country = reviews["country"].value_counts ()
print(reviews_per_country)
# Check your answer
q3.check()

Create variablecentered_pricecontaining a version of thepricecolumn with the mean price subtracted.

(Note: this 'centering' transformation is a common preprocessing step before applying various machine learning algorithms.)

mid=reviews["price"].mean()
centered_price = reviews["price"].map(lambda x: x-mid)

# Check your answer
q4.check()

I'm an economical wine buyer. Which wine is the "best bargain"? Create a variable bargain_wine with the title of the wine with the highest points-to-price ratio in the dataset.

bargain_idx = (reviews.points / reviews.price).idxmax()
bargain_wine = reviews.loc[bargain_idx, 'title']
# Check your answer
q5.check()

There are only so many words you can use when describing a bottle of wine. Is a wine more likely to be "tropical" or "fruity"? Create a Seriesdescriptor_countscounting how many times each of these two words appears in thedescriptioncolumn in the dataset.

n_trop = reviews.description.map(lambda desc: "tropical" in desc).sum()
n_fruity = reviews.description.map(lambda desc: "fruity" in desc).sum()
descriptor_counts = pd.Series([n_trop, n_fruity], index=['tropical', 'fruity'])
print(descriptor_counts)
# Check your answer
q6.check()

We'd like to host these wine reviews on our website, but a rating system ranging from 80 to 100 points is too hard to understand - we'd like to translate them into simple star ratings. A score of 95 or higher counts as 3 stars, a score of at least 85 but less than 95 is 2 stars. Any other score is 1 star.

Also, the Canadian Vintners Association bought a lot of ads on the site, so any wines from Canada should automatically get 3 stars, regardless of points.

Create a seriesstar_ratingswith the number of stars corresponding to each review in the dataset.

def help(row):
    if(row["country"]=="Canada"):
        return 3
    if(row["points"]>=95):
        return 3
    elif(row["points"]>=85):
        return 2
    else:
        return 1
   
   
   
        
star_ratings =  reviews.apply(help,axis='columns')

print(star_ratings)


# Check your answer
q7.check()

Kaggle-pandas(3)

Summary-functions-and-maps

教程

Maps

練習

Kaggle-pandas(3)

Kaggle-pandas(1)

Kaggle-pandas(2)

爬蟲與Python：（四）爬蟲進階擴充套件之Pandas——3.資料結構Series

Windows環境下安裝EPDFree和pandas（包含epd_free-7.3-2安裝包下載）

3.pandas的簡單查詢

暑期訓練3 Gym - 102309A APA of Orz Pandas 棧，逆波蘭表示式，模擬

3.5.1 pandas基礎

Pandas系列教程（3）Pandas資料查詢

3-Pandas層次化索引&拼接

資料分析Pandas庫學習筆記(3)

Python學習之Excel處理-3-之pandas

（資料科學學習手札124）pandas 1.3版本主要更新內容一覽

pandas速成筆記(3)-join/groupby操作

Pandas將字典dict轉為Dataframe的3種方法總結

【pandas官方文件-使用者指南】3.必要基礎功能

【譯】MongoDB Shema 設計的6條經驗法則 3

Redis為什麼是單執行緒、及高併發快的3大原因詳解

IOS 圖片存放3種方式

向您生動地講解Spring AOP 原始碼（3）

Kaggle-pandas(3)

Summary-functions-and-maps

教程

Maps

練習

相關推薦