1. 程式人生 > >Lasso linear model實例 | Proliferation index | 評估單細胞的增殖指數

Lasso linear model實例 | Proliferation index | 評估單細胞的增殖指數

獨立 -m take 其他應用 round ever 方法 air ssi

背景:We developed a cell-cycle scoring approach that uses expression data to compute an index for every cell that scores the cell according to its expression of cell-cycle genes. In brief, our approach proceeded through four steps. (A) We reduced dimensionality of the dataset to the cell-cycle relevant genes. (B) In this subspace we performed, as a first approximation, a simple K-means clustering to separate non cycling from cycling cells and (C) we used this clustering as a reference to learn a function that takes the gene expression as the input and returns a cell-cycle score as an output. (D) We used this function to calculate a score for each single cell.








We started by selecting a wide selection of genes related to cell-cycle and proliferation. We used the PANTHER GO database and selected all the genes that were described by one of the following terms: DNA metabolic process, DNA replication, mitosis, regulation of cell cycle, cell cycle, cytokinesis, histone, DNA-directed DNA polymerase, DNA polymerase processivity factor, centromere DNAbinding protein. We restricted our features to those genes. Genes that were detected at less than 10 molecules in the dataset were removed. We calculated the pairwise correlation coefficient matrix, and selected the genes that were strongly correlated (99th percentile of the matrix) with at least 12 other genes. The genes passing the filters described above were used for clustering cells using K-means (Python scikit-learn implementation, on log-centered data, default parameters) with the rationale that the main axis of variation expected would span across dividing and non-dividing cells. Then a linear regression model with L1-norm regularization was fitted that used a learning function which took expression data of a cell and categorized into two classes, 1 when a cell belongs to the cycling cluster and 0 when it did not. Importantly, to avoid both overfitting the score on the first approximation clusters and also to obtain a more generalizable model, we used a strong regularization (5 times the one determined by cross-validation; alpha = 0.01).

This procedure was used for both the mouse and human embryonic dataset. The function learnt on the human embryonic dataset was also used to determine the proliferation index of the hPSCs.


1. 首先從PANTHER GO數據庫選出cell cycle相關的基因;

2. 計算了每個基因的相關性,去掉了獨立存在的基因;

3. K-means聚類分三類,得到學習數據

4. linear regression model with L1-norm,為防止過擬合,參數設得比較嚴格。


如果想要ground truth,就必須要得到實驗上更嚴格的數據來源,比如高度增殖的細胞和完全不增殖的細胞的基因表達數據。





核心問題是如何選擇出合適的gene list!對於有的指標很難選出合適的gene list。

Lasso linear model實例 | Proliferation index | 評估單細胞的增殖指數