Lasso linear model實例 | Proliferation index | 評估單細胞的增殖指數
背景:We developed a cell-cycle scoring approach that uses expression data to compute an index for every cell that scores the cell according to its expression of cell-cycle genes. In brief, our approach proceeded through four steps. (A) We reduced dimensionality of the dataset to the cell-cycle relevant genes. (B) In this subspace we performed, as a first approximation, a simple K-means clustering to separate non cycling from cycling cells and (C) we used this clustering as a reference to learn a function that takes the gene expression as the input and returns a cell-cycle score as an output. (D) We used this function to calculate a score for each single cell.
數據是每個細胞的基因表達矩陣,需求是根據基因表達信息計算每一個細胞的增殖指數(依據是細胞周期基因)。
我們常規能想到的就是建立一個線性模型,每一個細胞周期基因當做一個變量,輸出一個數值,就是增殖指數,然後正則化到0~1.
問題是這樣的話,每個基因前面的系數怎麽確定?所以建議一個簡單的方程是不可行的,我們必須要做有監督學習模型。那麽有監督的數據怎麽來呢?我們的數據沒有lable啊。
下面就是文章中的方法:
我們需要計算增殖指數的數據沒有lable,那我們就手動為其建立lable。
通過簡單的kmeans聚類,我們就可以篩選出增殖指數高的細胞類群,以此為訓練集,來構建監督學習模型。
然後用建好的模型再來對我們的數據進行預測,得到每一個細胞的增殖指數。
We started by selecting a wide selection of genes related to cell-cycle and proliferation. We used the PANTHER GO database and selected all the genes that were described by one of the following terms: DNA metabolic process, DNA replication, mitosis, regulation of cell cycle, cell cycle, cytokinesis, histone, DNA-directed DNA polymerase, DNA polymerase processivity factor, centromere DNAbinding protein. We restricted our features to those genes. Genes that were detected at less than 10 molecules in the dataset were removed. We calculated the pairwise correlation coefficient matrix, and selected the genes that were strongly correlated (99th percentile of the matrix) with at least 12 other genes. The genes passing the filters described above were used for clustering cells using K-means (Python scikit-learn implementation, on log-centered data, default parameters) with the rationale that the main axis of variation expected would span across dividing and non-dividing cells. Then a linear regression model with L1-norm regularization was fitted that used a learning function which took expression data of a cell and categorized into two classes, 1 when a cell belongs to the cycling cluster and 0 when it did not. Importantly, to avoid both overfitting the score on the first approximation clusters and also to obtain a more generalizable model, we used a strong regularization (5 times the one determined by cross-validation; alpha = 0.01).
This procedure was used for both the mouse and human embryonic dataset. The function learnt on the human embryonic dataset was also used to determine the proliferation index of the hPSCs.
當然文章的處理更加細心:
1. 首先從PANTHER GO數據庫選出cell cycle相關的基因;
2. 計算了每個基因的相關性,去掉了獨立存在的基因;
3. K-means聚類分三類,得到學習數據
4. linear regression model with L1-norm,為防止過擬合,參數設得比較嚴格。
這種方法從機器學習的角度給了一個大致的增殖指數,肯定不會錯,但是應該也不會太準,但是用於比較不同細胞的增殖差異還是足夠的。
如果想要ground truth,就必須要得到實驗上更嚴格的數據來源,比如高度增殖的細胞和完全不增殖的細胞的基因表達數據。
代碼:ipynb-lamanno2016-proliferation.ipynb
代碼註釋已經比較完善,後續會進行總結分析,並擴展延伸到其他應用上。
所以這種模型通用性還是比較強的。
比如拿細胞雕亡和細胞衰老相關的基因來計算每個細胞的衰老程度。
核心問題是如何選擇出合適的gene list!對於有的指標很難選出合適的gene list。
Lasso linear model實例 | Proliferation index | 評估單細胞的增殖指數