1. 程式人生 > >Chapter 6: Dimensionality Reduction: Squashing the Data Pancake with PCA

Chapter 6: Dimensionality Reduction: Squashing the Data Pancake with PCA

  • Suggestion
    it is best not to apply PCA to raw countss (word counts, music play
    counts, movie viewing counts, etc.)。

    The reason for this is that such counts often contain large outliers. As we know, PCA looks for linear correlations within the features.
    Correlation and variance statistics are very sensitive to large outliers; a single large number could change the statistics a lot. So, it is a good idea to first trim the data of large values (“Frequency-Based Filtering”), or apply a scaling transform like tf-idf (Chapter 4) or the log transform (“Log Transformation”).