u 代表一個user ,上述流程是一個最樸素的基於使用者的推薦流程。但是這個在實際當中效率太低下,實際中的基於使用者推薦流程如下:
最主要區別就是首先先找到相似使用者集合,然後跟相似使用者集合相關的item 稱為候選集。
package recommender; import java.io.File; import java.util.List; import org.apache.mahout.cf.taste.impl.model.file.FileDataModel; import org.apache.mahout.cf.taste.impl.neighborhood.NearestNUserNeighborhood; import org.apache.mahout.cf.taste.impl.recommender.GenericUserBasedRecommender; import org.apache.mahout.cf.taste.impl.similarity.PearsonCorrelationSimilarity; import org.apache.mahout.cf.taste.model.DataModel; import org.apache.mahout.cf.taste.neighborhood.UserNeighborhood; import org.apache.mahout.cf.taste.recommender.RecommendedItem; import org.apache.mahout.cf.taste.recommender.Recommender; import org.apache.mahout.cf.taste.similarity.UserSimilarity; class RecommenderIntro { private RecommenderIntro() { } public static void main(String[] args) throws Exception { File modelFile = null; if (args.length > 0) modelFile = new File(args[0]); if (modelFile == null || !modelFile.exists()) modelFile = new File("E:\\hello.txt"); if (!modelFile.exists()) { System.err .println("Please, specify name of file, or put file 'input.csv' into current directory!"); System.exit(1); } DataModel model = new FileDataModel(modelFile); UserSimilarity similarity = new PearsonCorrelationSimilarity(model); UserNeighborhood neighborhood = new NearestNUserNeighborhood(2, similarity, model); Recommender recommender = new GenericUserBasedRecommender(model, neighborhood, similarity); recommender.refresh(null); List<RecommendedItem> recommendations = recommender.recommend(1, 3); for (RecommendedItem recommendation : recommendations) { System.out.println(recommendation); } } }
Mahout中也有類似實現如下:critics = { 'bob':{'A':5.0,'B':3.0,'C':2.5}, 'alice':{'A':5.0,'C':3.0}} from math import sqrt def sim_pearson(prefs, p1, p2): # Get the list of mutually rated items si = {} for item in prefs[p1]: if item in prefs[p2]: si[item] = 1 print si # if they are no ratings in common, return 0 if len(si) == 0: return 0 # Sum calculations n = len(si) # Sums of all the preferences sum1 = sum([prefs[p1][it] for it in si]) sum2 = sum([prefs[p2][it] for it in si]) # Sums of the squares sum1Sq = sum([pow(prefs[p1][it], 2) for it in si]) sum2Sq = sum([pow(prefs[p2][it], 2) for it in si]) # Sum of the products pSum = sum([prefs[p1][it] * prefs[p2][it] for it in si]) # Calculate r (Pearson score) num = pSum - (sum1 * sum2 / n) den = sqrt((sum1Sq - pow(sum1, 2) / n) * (sum2Sq - pow(sum2, 2) / n)) if den == 0: return 0 r = num / den return r if __name__=="__main__": print critics['bob'] print(sim_pearson(critics,'bob','alice'))
package org.apache.mahout.cf.taste.impl.similarity;
import org.apache.mahout.cf.taste.common.TasteException;
import org.apache.mahout.cf.taste.common.Weighting;
import org.apache.mahout.cf.taste.model.DataModel;
import com.google.common.base.Preconditions;
* <p>
* An implementation of the Pearson correlation. For users X and Y, the following values are calculated:
* </p>
* <ul>
* <li>sumX2: sum of the square of all X's preference values</li>
* <li>sumY2: sum of the square of all Y's preference values</li>
* <li>sumXY: sum of the product of X and Y's preference value for all items for which both X and Y express a
* preference</li>
* </ul>
* <p>
* The correlation is then:
* <p>
* {@code sumXY / sqrt(sumX2 * sumY2)}
* </p>
* <p>
* Note that this correlation "centers" its data, shifts the user's preference values so that each of their
* means is 0. This is necessary to achieve expected behavior on all data sets.
* </p>
* <p>
* This correlation implementation is equivalent to the cosine similarity since the data it receives
* is assumed to be centered -- mean is 0. The correlation may be interpreted as the cosine of the angle
* between the two vectors defined by the users' preference values.
* </p>
* <p>
* For cosine similarity on uncentered data, see {@link UncenteredCosineSimilarity}.
* </p>
public final class PearsonCorrelationSimilarity extends AbstractSimilarity {
* @throws IllegalArgumentException if {@link DataModel} does not have preference values
public PearsonCorrelationSimilarity(DataModel dataModel) throws TasteException {
this(dataModel, Weighting.UNWEIGHTED);
* @throws IllegalArgumentException if {@link DataModel} does not have preference values
public PearsonCorrelationSimilarity(DataModel dataModel, Weighting weighting) throws TasteException {
super(dataModel, weighting, true);
Preconditions.checkArgument(dataModel.hasPreferenceValues(), "DataModel doesn't have preference values");
double computeResult(int n, double sumXY, double sumX2, double sumY2, double sumXYdiff2) {
if (n == 0) {
return Double.NaN;
// Note that sum of X and sum of Y don't appear here since they are assumed to be 0;
// the data is assumed to be centered.
double denominator = Math.sqrt(sumX2) * Math.sqrt(sumY2);
if (denominator == 0.0) {
// One or both parties has -all- the same ratings;
// can't really say much similarity under this measure
return Double.NaN;
return sumXY / denominator;
首先,皮爾遜相關係數沒有考慮兩個user Preference 重合的個數,這可能是在推薦引擎中使用的弱點,從上圖例子來說就是user1 和user5 對三個item表達了類似的Preference但是user1和user4的相似性更高,這有點反直覺的現象。
第二,如果兩個user 只對同一個item 表達了Preference,那麼這兩個user 無法計算皮爾遜相關係數,如上圖的user1 和user3。
最後,假如user5 對所有的item Preference都是3.0 ,同樣的該相似性計算是沒有定義的(參考公式4 發現分母為0)。