1. 程式人生 > >scikit-learn:4. 數據集預處理(clean數據、reduce降維、expand增維、generate特征提取)

scikit-learn:4. 數據集預處理(clean數據、reduce降維、expand增維、generate特征提取)

ova trac ict mea res additive track oval mmc

本文參考:http://scikit-learn.org/stable/data_transforms.html


本篇主要講數據預處理,包含四部分:

數據清洗、數據降維(PCA類)、數據增維(Kernel類)、提取自己定義特征。

哇哈哈。還是關註預處理比較靠譜。

。。

重要的不翻譯:scikit-learn providesa library of transformers, which mayclean (see Preprocessing data), reduce (seeUnsupervised dimensionality reduction), expand (see

Kernel Approximation) or generate (see Feature extraction) feature representations.


fit、transform、fit_transform三者差別:

fit:從訓練集中學習模型的參數(比如,方差、中位數等;也可能是不同的詞匯表)

transform:將訓練集/測試集中的數據轉換為fit學到的參數的維度上(測試集的方差、中位數等;測試集在fit得到的詞匯表下的向量值等)。

fit_transform:同一時候進行fit和transform操作。

Like other estimators, these are represented by classes with

fit method, which learns model parameters (e.g. mean and standard deviation for normalization) from a training set, and a transform method which applies this transformation model to unseen data. fit_transform may be more convenient and efficient for modelling and transforming the training data simultaneously.


八大塊內容。翻譯會在之後慢慢更新:

4.1. Pipeline and FeatureUnion: combining estimators

4.1.1. Pipeline: chaining estimators

4.1.2. FeatureUnion: composite feature spaces

翻譯之後的文章,參考:http://blog.csdn.net/mmc2015/article/details/46991465

4.2. Feature extraction

4.2.3. Text feature extraction

翻譯之後的文章,參考:http://blog.csdn.net/mmc2015/article/details/46997379

4.2.4. Image feature extraction

翻譯之後的文章,參考:http://blog.csdn.net/mmc2015/article/details/46992105


4.3. Preprocessing data

翻譯之後的文章。參考:http://blog.csdn.net/mmc2015/article/details/47016313

4.3.1. Standardization, or mean removal and variance scaling

4.3.2. Normalization

4.3.3. Binarization

4.3.4. Encoding categorical features

4.3.5. Imputation of missing values

4.4. Unsupervised dimensionality reduction

翻譯之後的文章,參考:http://blog.csdn.net/mmc2015/article/details/47066239

4.4.1. PCA: principal component analysis

4.4.2. Random projections

4.4.3. Feature agglomeration (特征聚集)

4.5. Random Projection

翻譯之後的文章,參考:http://blog.csdn.net/mmc2015/article/details/47067003

4.5.1. The Johnson-Lindenstrauss lemma

4.5.2. Gaussian random projection

4.5.3. Sparse random projection

4.6. Kernel Approximation

翻譯之後的文章,參考:http://blog.csdn.net/mmc2015/article/details/47068223

4.6.1. Nystroem Method for Kernel Approximation

4.6.2. Radial Basis Function Kernel

4.6.3. Additive Chi Squared Kernel

4.6.4. Skewed Chi Squared Kernel

4.6.5. Mathematical Details

4.7. Pairwise metrics, Affinities and Kernels

翻譯之後的文章。參考:http://blog.csdn.net/mmc2015/article/details/47068895

4.7.1. Cosine similarity

4.7.2. Linear kernel

4.7.3. Polynomial kernel

4.7.4. Sigmoid kernel

4.7.5. RBF kernel

4.7.6. Chi-squared kernel

4.8. Transforming the prediction target (y)

翻譯之後的文章。參考:http://blog.csdn.net/mmc2015/article/details/47069869

4.8.1. Label binarization

4.8.2. Label encoding




scikit-learn:4. 數據集預處理(clean數據、reduce降維、expand增維、generate特征提取)