Radiomics에서 Feature를 선택하는 것은 핵심 중의 핵심이다.
열심히 영상을 다듬고 영상에 대한 여러 value를 뽑아 놓아도 feature selection을 잘못하면 그동안의 노력이 물거품이 되기 때문이다.
Feature selection에는 여러 가지 방안들이 제시되어 왔는데 가장 많이 사용되는 방법들을 정리해보고자 한다.
In omics experiments, one of the ultimate goals is the identification of features(biomarkers) that are different between treatment groups.
One of the very common problems in omics data is that the sample size is small but huge number of features which can lead to over-fitting.
What can be alternative methods to overcome this problem?
The first paradigm
- LASSO : based on classification approaches and compares the least absolute shrinkage and selection operator.
- Ridge regression
- Elastic Net feature selection methods
The second paradigm
- using a linear models framework : individual features are modeled separately ignoring the correlation structure among features.
Omics data analysing 순서
⇨ original feature subsets ⇨ classification approach
Pre-screening
1. t-test
2. Hardy-Weinberg equilibrium tests
3. non-statistical biological considerations
⇨ These methods help the efficient classification of samples into groups, rather than feature selection.
This paper uses the Type I and Type II errors to measure the accuracy(?)
✅Simulation
- 100 samples, 12 significant features out of 1,000 features comparing the performance of LASSO, Elastic Net, ridge regression, principal components regression, and other methods used for feature selection.
👍 Elastic Net: showed the lowest mean squared error of prediction
✅biosignature for Lyme disease prediction
- 202 treatment, 259 control group sample size.
• The number of features before pre-screening = 2,262
• The number of features after pre-screening = 95
- LASSO, Classification Tree, Linear discriminant analysis were applied.
• LASSO: performed the best in ROC
• Elastic Net: had lower MSEP than SVM, superior to stepwise selection.
LASSO, ridge regression & Elastic Net
- penalised regression models.
① Ridge regression: closed form solution for the standard linear models with normal errors and results in shrunk regression coefficients (none of which is equal to zero)
⇨ Ridge regression can be used as a prediction tool, but not as a feature selector.
② LASSO: does not allow a closed form solution. it uses shrinkage to estimate which set of the regression coefficients have a value of zero and can therefore be eliminated.
⇨ One of the limitations of this method is that the number of variables that can be selected has to be smaller or equal to the sample size n.
⇨ LASSO often select only a single feature in a set of highly correlated features.
③ Elastic Net: addressed the drawbacks of the LASSO and ridge regression methods.
This method is a weighted combination of both LASSO and ridge regression penalties.
위 내용을 작성할 때 Kirpich, Alexander et al. “Variable selection in omics data: A practical evaluation of small sample sizes.” PloS one vol. 13,6 e0197910. 21 Jun. 2018 해당 논문을 참고하였음.
댓글
댓글 쓰기