集體智慧程式設計--提供過濾

阿新 • • 發佈：2018-12-27

# 基於物品進行過濾：
#      首先把｛使用者1｛物品A：得分，物品B：得分。。。｝｝轉換為｛物品A｛使用者1：得分，使用者2：得分。。。｝｝
#      根據上面轉化的表格，可以根據歐式距或者皮爾遜來計算出不同物體之間的相似度（具體計算是計算不同物體同一個使用者的得分差值的平方和的根，
#      也可以根據皮爾遜）
#      最後可以根據某一個使用者未評過分的物體根據使用者評過分的物體*使用者對評分過物體的評分 求和來計算

資源下載

critics = {'Lisa Rose': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.5,
                         'Just My Luck': 3.0, 'Superman Returns': 3.5, 'You, Me and Dupree': 2.5,
                         'The Night Listener': 3.0},
           'Gene Seymour': {'Lady in the Water': 3.0, 'Snakes on a Plane': 3.5,
                            'Just My Luck': 1.5, 'Superman Returns': 5.0, 'The Night Listener': 3.0,
                            'You, Me and Dupree': 3.5},
           'Michael Phillips': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.0,
                                'Superman Returns': 3.5, 'The Night Listener': 4.0},
           'Claudia Puig': {'Snakes on a Plane': 3.5, 'Just My Luck': 3.0,
                            'The Night Listener': 4.5, 'Superman Returns': 4.0,
                            'You, Me and Dupree': 2.5},
           'Mick LaSalle': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0,
                            'Just My Luck': 2.0, 'Superman Returns': 3.0, 'The Night Listener': 3.0,
                            'You, Me and Dupree': 2.0},
           'Jack Matthews': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0,
                             'The Night Listener': 3.0, 'Superman Returns': 5.0, 'You, Me and Dupree': 3.5},
           'Toby': {'Snakes on a Plane': 4.5, 'You, Me and Dupree': 1.0, 'Superman Returns': 4.0}}

from math import sqrt


# Returns a distance-based similarity score for person1 and person2
def sim_distance(prefs, person1, person2):
    # Get the list of shared_items
    si = {}
    for item in prefs[person1]:
        if item in prefs[person2]: si[item] = 1

    # if they have no ratings in common, return 0
    if len(si) == 0: return 0

    # Add up the squares of all the differences
    sum_of_squares = sum([pow(prefs[person1][item] - prefs[person2][item], 2)
                          for item in prefs[person1] if item in prefs[person2]])

    return 1 / (1 + sum_of_squares)


# Returns the Pearson correlation coefficient for p1 and p2
def sim_pearson(prefs, p1, p2):
    # Get the list of mutually rated items
    si = {}
    for item in prefs[p1]:
        if item in prefs[p2]: si[item] = 1

    # if they are no ratings in common, return 0
    if len(si) == 0: return 0

    # Sum calculations
    n = len(si)

    # Sums of all the preferences
    sum1 = sum([prefs[p1][it] for it in si])
    sum2 = sum([prefs[p2][it] for it in si])

    # Sums of the squares
    sum1Sq = sum([pow(prefs[p1][it], 2) for it in si])
    sum2Sq = sum([pow(prefs[p2][it], 2) for it in si])

    # Sum of the products
    pSum = sum([prefs[p1][it] * prefs[p2][it] for it in si])

    # Calculate r (Pearson score)
    num = pSum - (sum1 * sum2 / n)
    den = sqrt((sum1Sq - pow(sum1, 2) / n) * (sum2Sq - pow(sum2, 2) / n))
    if den == 0: return 0

    r = num / den

    return r


# Returns the best matches for person from the prefs dictionary.
# Number of results and similarity function are optional params.
def topMatches(prefs, person, n=5, similarity=sim_pearson):
    scores = [(similarity(prefs, person, other), other)
              for other in prefs if other != person]
    scores.sort()
    scores.reverse()
    return scores[0:n]


# Gets recommendations for a person by using a weighted average
# of every other user's rankings
def getRecommendations(prefs, person, similarity=sim_pearson):
    totals = {}
    simSums = {}
    for other in prefs:
        # don't compare me to myself
        if other == person: continue
        sim = similarity(prefs, person, other)

        # ignore scores of zero or lower
        if sim <= 0: continue
        for item in prefs[other]:

            # only score movies I haven't seen yet
            if item not in prefs[person] or prefs[person][item] == 0:
                # Similarity * Score
                totals.setdefault(item, 0)
                totals[item] += prefs[other][item] * sim
                # Sum of similarities
                simSums.setdefault(item, 0)
                simSums[item] += sim

    # Create the normalized list
    rankings = [(total / simSums[item], item) for item, total in totals.items()]

    # Return the sorted list
    rankings.sort()
    rankings.reverse()
    return rankings


def transformPrefs(prefs):
    result = {}
    for person in prefs:
        for item in prefs[person]:
            result.setdefault(item, {})

            # Flip item and person
            result[item][person] = prefs[person][item]
    return result


def calculateSimilarItems(prefs, n=10):
    # Create a dictionary of items showing which other items they
    # are most similar to.
    result = {}
    # Invert the preference matrix to be item-centric
    itemPrefs = transformPrefs(prefs)
    c = 0
    for item in itemPrefs:
        # Status updates for large datasets
        c += 1
        if c % 100 == 0: print
        "%d / %d" % (c, len(itemPrefs))
        # Find the most similar items to this one
        scores = topMatches(itemPrefs, item, n=n, similarity=sim_distance)
        result[item] = scores
    return result


def getRecommendedItems(prefs, itemMatch, user):
    userRatings = prefs[user]
    scores = {}
    totalSim = {}
    # Loop over items rated by this user
    for (item, rating) in userRatings.items():

        # Loop over items similar to this one
        for (similarity, item2) in itemMatch[item]:

            # Ignore if this user has already rated this item
            if item2 in userRatings: continue
            # Weighted sum of rating times similarity
            scores.setdefault(item2, 0)
            scores[item2] += similarity * rating
            # Sum of all the similarities
            totalSim.setdefault(item2, 0)
            totalSim[item2] += similarity

    # Divide each total score by total weighting to get an average
    rankings = [(score / totalSim[item], item) for item, score in scores.items()]

    # Return the rankings from highest to lowest
    rankings.sort()
    rankings.reverse()
    return rankings


def loadMovieLens(path='/data/movielens'):
    # Get movie titles
    movies = {}
    for line in open(path + '/u.item',encoding='iso-8859-15'):
        (id, title) = line.split('|')[0:2]
        movies[id] = title

    # Load data
    prefs = {}
    for line in open(path + '/u.data'):
        (user, movieid, rating, ts) = line.split('\t')
        prefs.setdefault(user, {})
        prefs[user][movies[movieid]] = float(rating)
    return prefs
print(loadMovieLens('ml-100k')['87'])

集體智慧程式設計--提供過濾

# 基於物品進行過濾： # 首先把｛使用者1｛物品A：得分，物品B：得分。。。｝｝轉換為｛物品A｛使用者1：得分，使用者2：得分。。。｝｝ # 根據上面轉化的表格，可以根據歐式距或者皮爾遜來計算出不同物體之間的相似度（具體計算是計算不同物體同一個使用者的得分差值的平方和的根， #

集體智慧程式設計學習筆記（2.1）提供推薦

第二章提供推薦（一）協作型過濾 Collaborative Filtering 如果想了解商品、影片或網站的推薦性資訊，最沒有技術含量的方法是向朋友們詢問，其中一部分人的品味會比其他人高一些，通過觀察這些人是否通常也和我們一樣喜歡同樣的東西，可以逐步對這些情況有所瞭解

《集體智慧程式設計》摘要

提供推薦皮爾遜相關度評價適合於資料不規範的情況比如某一影評者的打分總是對於平均水平偏離很大（比如總是偏低），此時用歐幾里得距離計算出來的差別不大，無法很好地分類原理是根據不同人的評分將影片對映為空間中的點，然後對於大量的點擬合一

《集體智慧程式設計》閱讀筆記

本書從實際業務、應用場景出發，介紹機器學習演算法。提供推薦主要從如何尋找相似使用者、通過相似使用者對使用者進行商品推薦以及相似商品進行講述。每一位使用者對部分商品如影片有評價分數，根據兩個人對同一商品的打分情況可以判斷兩使用者相似情況。判斷相似程度有歐幾里得距離、皮爾遜相關

《集體智慧程式設計》學習筆記（一）

第二章提供推薦 1、蒐集偏好先構造一個簡單的資料集： #使用者對不同電影的評分 critics={'Lisa Rose': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.5, 'Just My Luck': 3.0, 'Superman

C4.5決策樹學習(基於集體智慧程式設計程式碼)

我在上一篇實驗報告中有總結了ID3,C4.5,CART各決策樹的不同,其中,有關於ID3和C4.5的不同,見文章. 上篇文章可能並沒有側重於這兩種的不同,於是我仔細研究了一下,並採用《集體智慧程式設計》一書中的有資訊熵和決策樹的程式碼,見github地址,自行進行資訊增益率的計算. 我的理解

看《集體智慧程式設計》二三章總結

第二章提供推薦推薦的方式主要分為兩類，一種是基於物品的推薦，一種是基於使用者的推薦。基於物品的推薦，就是計算物品之間的相似度，例如物品A和物品B相似，假如使用者購買了物品A，則使用者極有可能購買物品B。基於使用者的推薦就是找到相似的使用者，例如“使用者A購買了商品A和商

集體智慧程式設計-皮爾遜相關係數程式碼理解

剛開始看關於皮爾遜相關係數計算的程式碼，把我看得是暈頭轉向，不過在學習完概率論的課程後，發現結合公式再來看程式碼就會比較簡單了。期望公式 E(x)=1n∑i=1nxi 方差公式 var(x)=

集體智慧程式設計第四章[搜尋引擎與排名]總結

爬蟲程式我們開啟一個url，返回一個html檔案，它的格式類似下面的內容： <!DOCTYPE html PUBLIC "-//W2C//DTD XHTML 1.0 Transitioln//EN""http://www.w2.org/TR/xhtm

集體智慧程式設計——搜尋與排名-Python實現

學習構建一個簡易的搜尋引擎，步驟如下：網頁抓取：從一個或一組特定的網頁開始，根據網頁內部連結逐步追蹤到其他網頁。這樣遞迴進行爬取，直到到達一定深度或達到一定數量為止。建立索引：建立資料表，包含文件中所有單詞的位置資訊，文件本身不一定要儲存到資料庫中，索引

《集體智慧程式設計》第12章演算法總結個人筆記

第12章演算法總結 12.1 貝葉斯分類器優點：訓練、查詢速度快；支援增量式訓練；易解釋缺點：無法處理基於特徵組合所產生的變化結果 12.2 決策樹分類器優點：易解釋；容易

集體智慧程式設計-K均值聚類程式碼理解

K均值聚類，先人工製造幾個中心點，根據資料尋找距離每個中心點最近的所有例項點，用所有例項點的均值代替中心點，如此反覆，直到所有的例項點都被歸類到正確的中心點。例子對於下面的例項點人工構造兩個中心點，對於每個中心點尋找距離其最近的所有例項點，用距離

讀書筆記---《集體智慧程式設計》第3章：發現群組

1.關於聚類的理解聚類實際上就是分類，對一些樣本（樣品）進行歸類分組。本章第一個例子是對99篇部落格進行聚類，也就是說每一篇部落格便是一個樣本。要分類就要有分類的標準（指標）。比如把人按地區、身高、體重分類，那地區、身高、體重就是指標。抽象地說，對樣本X，設

集體智慧程式設計5-優化演算法-爬山法、模擬退火、遺傳演算法

最優化演算法的思想在於，我們往往並不需要得到最優解，而是得到一個近似最優解，來節省時間的開銷。 * 隨機演算法為了解決遍歷引發的時間問題，有時候在沒有嚴格要求的情況下，可以通過隨機去一定的點，比較這些取的點數，總能找到一個近似最優解的情況。

《集體智慧程式設計》程式碼勘誤：第六章

一：勘誤 classifier類中： def fprob(self, f, cat): if self.catcount(cat) == 0: return 0 #notice: rember change int to double or float

集體智慧程式設計第二三章學習總結

2 基於物品的協同過濾：應用場景，當我們在豆瓣只看過一部看過電影《泰囧》並且認為評分還不錯（此時網站還沒有收集使用者足夠多的資訊，無法用基於使用者的協同過濾推薦），下次登陸豆瓣的時候會推薦《港囧》，這裡使用的方法就是基於物品的協同過濾。假如有很多很多電影，我們找到很多人的觀看記錄和評價記錄。比如電影《港囧》我

【集體智慧程式設計學習筆記】統計訂閱源中的單詞數

幾乎所有的部落格都可以線上閱讀，或者通過RSS訂閱源進行閱讀。RSS訂閱源是一個包含部落格及其所有文章條目資訊的簡單的XML文件。程式中使用了feedparser第三方模組，可以輕鬆地從任何RSS或Atom訂閱源中得到標題、連結和文章的條目。完整程式碼如下：

《集體智慧程式設計》第4章搜尋與排名個人筆記

第4章搜尋與排名 1、基於內容的排名單詞頻度：位於查詢條件中的單詞在文件中出現的次數能有助於我們判斷文件的相關程度。文件位置：文件的主題有可能會出現在靠近文件的開始處。搜尋引擎可以對待查單詞在文件中出現越早的情況給予越高的評價。單詞距離：如果查

“集體智慧程式設計”之第七章：決策樹

什麼是決策樹？如果將決策樹和上一章的分類器一起講述，那麼決策樹這種演算法也是用於對物品分類的，書有一個非常簡單的例子，能幫助我理解什麼是決策樹。給你一個水果，你可以通過以下方式判斷出這是一個什麼水果。可以看出，決策樹上就是一個又一個

【集體智慧程式設計】第三章、發現群組

一、前言本章中，我們將學習到如下內容：從各種不同的來源中構造演算法所需的資料；兩種不同的聚類演算法；更多有關距離度量（distance metrics）的知識；簡單的圖形視覺化程式碼，用以觀察所生成的群組；最後，我們還會學習如何將異常複雜的資料集投影到二維空間中。聚類

集體智慧程式設計--提供過濾

相關推薦