1. 程式人生 > >Chapter 7: Nonlinear Featurization via K-Means Model Stacking

Chapter 7: Nonlinear Featurization via K-Means Model Stacking

本章講解內容涉及到2個知識點:“manifold learning”和“model stacking”。

manifold learning

manifold learning:與PCA不同(linear dimensionality reduction),實際上是一種nonlinear dimensionality reduction方法,利用manifold learning可以將如下的非線性特徵空間,unroll,從而達到“降維”的目的。manifold learning主要用於visulization中。

model stacking

1、model stacking 步驟
本章中的model stacking,base layer為“clustering”,top layer為“logistic regression”。
step1:將data分為train_data,test_data;
step2:將train_data分為2份,train_data_1和train_data_2,train_data_1用於訓練clustering,利用訓練好的clustering model,預測train_data_2的label。
step3:在將train_data_2投入clustering model後,可以採用model prediction的結果作為train_data_2的new feature,將其與original feature 進行合併,形成train_data_2的final feature。new feature的形式有2種:1)one-hot cluster feature,如:假如data point屬於j cluster,則new feature的形式為:除第j位為1,其它位均為0,vector的維度為k;2)利將data point與各個cluster centorid 距離的倒數 形成的vector ,作為data point的new feature,如果cluster數量k較大,可選與其最近的前p個cluster的“距離的倒數”,形成vector,作為data point的new feature。
step4:將train_data_2的final feature作為top layer “logistic regression”的input,訓練logistic regression。
step5:當要預測test data的label時,首先將test data輸入clustering model,形成其final feature,然後,將其final label輸入logisitc regression,預測其label。
Note that

:在上述clustering fitting過程中,我們並不關注cluster的真正個數,我們只需要cover them。(Unlike in the classic clustering setup, we are not concerned with discovering the “true”number of clusters; we only need to cover them.)

2、key intuition for model stacking
Model Stacking has become an increasingly popular technique in recent years. Nonlinear classifiers are expensive to train and maintain. The key intuition with stacking is to push the nonlinearities into the features and use a very simple, usually linear model as the last layer. The featurizer can be trained offline, which means that one can use expensive models that require more computation power or memory but generate useful features. The simple model at the top level can be quickly adapted to the changing distributions of online data. This is a great trade-off between accuracy and speed, and this strategy is often used in applications like targeted advertising that require fast adaptation to changing data distributions。

Confused:

If the data is distributed uniformly throughout the space, then picking the right k boils down to a sphere-packing problem. In d dimensions, one could fit roughly 1/rd spheres of radius r. Each k-means cluster is a sphere, and the radius is the maximum error of representing points in that sphere with the centroid. So, if we are willing to tolerate a maximum approximation error of r per data point, then the number of clusters is O(1/rd

), where d is the dimension of the original feature space of the data.