Radiomics: Feature selection 2

앞서 Radiomics에서 많이 사용되고 있는 Feature selection 방법에 대해서 이야기 하였다. 이번에는 조금 더 세분화하여 설명해보도록 하겠다.

14 feature selection methods & 12 classification methods in terms of predictive performance and stability.

Methods

❗ Radiomic Features

A total of 440 radiomic features were used and divided into 4 feature groups.

1) tumor intensity

- intensity of histogram

2) shape

- 3D geometric properties of the tumor

3) texture

- GLCM: gray level co-occurrence matrices

- GLRLM: gray level run length matrices

⇨ quantified the intra-tumor heterogeneity

4) wavelet features

- transformed domain representations of the intensity and textural features.

❗ Datasets

• survival time > 2 years ⇨ 1

• survival time < 2 years ⇨ 0

- 310 lung cancer patients in training cohort, and 154 patients in validation cohort.

- All features were normalised using Z-score normalisation.

❗ Feature Selection Methods

- 14 feature selection methods based on filter approaches were used.

- 선정기준: simplicity, computational efficiency, popularity in literature

Fisher score
Relief
T-score
Chi-square
Wilcoxon
Gini index
Mutual information maximisation
Mutual information feature selection
Minimum redundancy maximum relevance
Conditional informax feature extraction
Joint mutual information
Conditional mutual information maximisation
Interaction capping
Double input symmetric relevance

❗ Classifiers

- 12 machine learning based classification methods were considered.

- supervised learning task로 training set, validation set으로 나눔

- 10 fold cross validation was used

- predictive performance evaluation: AUC

Bagging
Bayesian
Boosting
Decision trees
Discriminant analysis
Generalised linear models
Multiple adaptive regression splines
Nearest neighbours
Neural networks
Partial least square and principle component regression
Random forests
Support vector machines

Analysis

Predictive Performance of Feature Selection Methods

- feature의 개수를 (n = 5, 10, 15, 20, ..., 50) 점차 늘려가며 AUC 값들의 중앙값 계산

Results

a total of 440 radiomic features were extracted from the segmented tumor regions

Predictive performance of feature selection and classification methods

• AUC was used for assessing predictive performance of different feature selection and classification methods.

✅ Classification

👍 Random Forest showed the highest predictive performance as a classifier.

(AUC = 0.66 ± 0.03)

👎 Decision Tree had the lowest predictive performance.

(AUC = 0.54 ± 0.04)

✅ Feature selection

👍 Wilcoxon test based methods showed the highest predictive performance

(AUC = 0.65 ± 0.02)

👎 Chi-square & Conditional informax feature extraction displayed the lowest predictive performance. (AUC = 0.60 ± 0.03)

Stability of the feature selection and classification methods

✅ Feature selection

👍 Mutual Information Maximisation was the most stable (stability = 0.94 ± 0.02)

👍 Relief was the second best (stability = 0.91 ± 0.05)

👎 GINI(GINI index), JMI(Joint mutual information), CHSQ(Chi-square), DISR(Double input symmetric relevance), CIFE(Conditional informax feature extraction) showed relatively low stability.

✅ Classification

- RSD(Relative standard deviation) were used for measuring empirical stability.

👍 Bayesian classifier was the best (RSD = 0.86%)

👍 Generalised linear models was the second best (RSD = 2.19%)

👍 Partial least square and principle component regression was the third best (RSD = 2.24%)

👎 Boosting had the lowest stability among the classification methods.

Stability and Predictive Performance

✅ 👍 Feature selection methods

Wilcoxon (stability = 0.84 ± 0.05, AUC = 0.65 ± 0.02)

Mutual information feature selection (stability = 0.8 ± 0.03, AUC = 0.63 ± 0.03)

Minimum redundancy maximum relevance (stability = 0.74 ± 0.03, AUC = 0.63 ± 0.03)

Fisher score (stability = 0.78 ± 0.08, AUC = 0.62 ± 0.04)

are preferred as their stability and predictive performance was higher than corresponding median values(stability=0.735, AUC=0.615) across all feature selection methods.

✅ 👍 Classification methods

RF (RSD = 3.52%, AUC = 0.66 ± 0.03)

BY (RSD = 0.86%, AUC = 0.64 ± 0.05)

BAG (RSD = 5.56%, AUC = 0.64 ± 0.03)

GLM (RSD = 2.19%, AUC = 0.63 ± 0.02)

PLSR (RSD = 2.24%, AUC = 0.63 ± 0.02)

showed that the stability and predictive performance was higher than the corresponding median values(RSD = 5.93%, AUC = 0.61).

Experimental Factors Affecting the Radiomics Based Survival Prediction

- 3 experimental factors (feature selection methods, classification methods, and the number of selected features) 의 effect를 quantify하기 위해 AUC score에 대한 ANOVA 실시

- ANOVA result: all 3 factors and their interactions are significant.

- Classification method was the most dominant source of variability (34.21%)

- Feature selection accounted for 6.25%

- Classification X Feature selection interaction explained 23.03%

- Size of the selected feature subset only shared 1.65% of the total variance

Discussion

Feature selection methods는 크게 3 카테고리로 나눌 수 있음

(1) filter methods

- This paper only investigated filter methods as these are classifier independent.

• simple feature ranking methods based on some heuristic scoring criterion

• computationally efficient

• high generalisability and scalability

(2) wrapper methods

• classifier dependent

⇨ may produce feature subsets that are overly specific to the classifiers, hence low generalisability

• search through the whole feature space and identify a relevant and non-redundant feature subset.

• computationally expensive

(3) embedded methods

• classifier dependent

⇨ lacks in the generalisability

• incorporate feature selection as a part of training process

• computationally efficient as compared to the wrappers.

Filter Methods

- J : scoring criterion (relevance index)

- Y : class labels

- X : set of all features

- Xk : the feature to be evaluated

- S : the set of already selected features

위 내용을 작성할 때 Parmar, C., Grossmann, P., Bussink, J. et al. Machine Learning methods for Quantitative Radiomic Biomarkers. Sci Rep 5, 13087 (2015). 해당 논문을 참고하였음.

Bioinformatics with Park-Kleis

이 블로그 검색

Radiomics: Feature selection 2

태그

댓글

댓글 쓰기

이 블로그의 인기 게시물

통계 기초 : 확률 분포 - (1) 이항분포

Odds Ratio and Relative Risk ; 오즈비와 상대위험도

Odds Ratio and Relative Risk ; 오즈비와 상대위험도 part. 2