Radiomics: Feature selection 2 기본 콘텐츠로 건너뛰기

Radiomics: Feature selection 2

 앞서 Radiomics에서 많이 사용되고 있는 Feature selection 방법에 대해서 이야기 하였다. 이번에는 조금 더 세분화하여 설명해보도록 하겠다.

14 feature selection methods & 12 classification methods in terms of predictive performance and stability.


Methods

❗ Radiomic Features

A total of 440 radiomic features were used and divided into 4 feature groups.

1) tumor intensity

    - intensity of histogram

2) shape

    - 3D geometric properties of the tumor

3) texture

    - GLCM: gray level co-occurrence matrices

    - GLRLM: gray level run length matrices

     ⇨ quantified the intra-tumor heterogeneity

4) wavelet features

    - transformed domain representations of the intensity and textural features.


❗ Datasets

    • survival time > 2 years ⇨ 1

    • survival time < 2 years ⇨ 0

   - 310 lung cancer patients in training cohort, and 154 patients in validation cohort.

   - All features were normalised using Z-score normalisation.


❗ Feature Selection Methods

 - 14 feature selection methods based on filter approaches were used.

 - 선정기준: simplicity, computational efficiency, popularity in literature

  • Fisher score
  • Relief
  • T-score
  • Chi-square
  • Wilcoxon
  • Gini index
  • Mutual information maximisation
  • Mutual information feature selection
  • Minimum redundancy maximum relevance
  • Conditional informax feature extraction
  • Joint mutual information
  • Conditional mutual information maximisation
  • Interaction capping
  • Double input symmetric relevance


❗ Classifiers

 - 12 machine learning based classification methods were considered.

 - supervised learning task로 training set, validation set으로 나눔

 - 10 fold cross validation was used

 - predictive performance evaluation: AUC

  • Bagging
  • Bayesian
  • Boosting
  • Decision trees
  • Discriminant analysis
  • Generalised linear models
  • Multiple adaptive regression splines
  • Nearest neighbours
  • Neural networks
  • Partial least square and principle component regression
  • Random forests
  • Support vector machines




Analysis

Predictive Performance of Feature Selection Methods

    - feature의 개수를 (n = 5, 10, 15, 20, ..., 50) 점차 늘려가며 AUC 값들의 중앙값 계산


Results

  a total of 440 radiomic features were extracted from the segmented tumor regions

Predictive performance of feature selection and classification methods

• AUC was used for assessing predictive performance of different feature selection and classification methods.

Classification

   👍 Random Forest showed the highest predictive performance as a classifier.

      (AUC = 0.66 ± 0.03)

   👎 Decision Tree had the lowest predictive performance.

      (AUC = 0.54 ± 0.04)

Feature selection

   👍 Wilcoxon test based methods showed the highest predictive performance

      (AUC = 0.65 ± 0.02)

   👎 Chi-square & Conditional informax feature extraction displayed the lowest predictive performance. (AUC = 0.60 ± 0.03)

Stability of the feature selection and classification methods

✅ Feature selection

   👍 Mutual Information Maximisation was the most stable (stability = 0.94 ± 0.02)

   👍 Relief was the second best (stability = 0.91 ± 0.05)

   👎 GINI(GINI index), JMI(Joint mutual information), CHSQ(Chi-square), DISR(Double input symmetric relevance), CIFE(Conditional informax feature extraction) showed relatively low stability.

✅ Classification

    - RSD(Relative standard deviation) were used for measuring empirical stability.

   👍 Bayesian classifier was the best (RSD = 0.86%)

   👍 Generalised linear models was the second best (RSD = 2.19%)

   👍 Partial least square and principle component regression was the third best (RSD = 2.24%)

   👎 Boosting had the lowest stability among the classification methods.


Stability and Predictive Performance



✅ 👍 Feature selection methods

 Wilcoxon (stability = 0.84 ± 0.05, AUC = 0.65 ± 0.02)

 Mutual information feature selection (stability = 0.8 ± 0.03, AUC = 0.63 ± 0.03)

 Minimum redundancy maximum relevance (stability = 0.74 ± 0.03, AUC = 0.63 ± 0.03)

 Fisher score (stability = 0.78 ± 0.08, AUC = 0.62 ± 0.04)

are preferred as their stability and predictive performance was higher than corresponding median values(stability=0.735, AUC=0.615) across all feature selection methods.

✅ 👍 Classification methods

 RF (RSD = 3.52%, AUC = 0.66 ± 0.03)

 BY (RSD = 0.86%, AUC = 0.64 ± 0.05)

 BAG (RSD = 5.56%, AUC = 0.64 ± 0.03)

 GLM (RSD = 2.19%, AUC = 0.63 ± 0.02)

 PLSR (RSD = 2.24%, AUC = 0.63 ± 0.02)

showed that the stability and predictive performance was higher than the corresponding median values(RSD = 5.93%, AUC = 0.61).


Experimental Factors Affecting the Radiomics Based Survival Prediction

 - 3 experimental factors (feature selection methods, classification methods, and the number of selected features) 의 effect를 quantify하기 위해 AUC score에 대한 ANOVA 실시

 - ANOVA result: all 3 factors and their interactions are significant.

 - Classification method was the most dominant source of variability (34.21%)

 - Feature selection accounted for 6.25%

 - Classification X Feature selection interaction explained 23.03%

 - Size of the selected feature subset only shared 1.65% of the total variance



Discussion

 Feature selection methods는 크게 3 카테고리로 나눌 수 있음

(1) filter methods 

     - This paper only investigated filter methods as these are classifier independent.

    • simple feature ranking methods based on some heuristic scoring criterion

    • computationally efficient

    • high generalisability and scalability

(2) wrapper methods

    • classifier dependent

    ⇨ may produce feature subsets that are overly specific to the classifiers, hence low            generalisability

    • search through the whole feature space and identify a relevant and non-redundant          feature subset.

     computationally expensive 

(3) embedded methods

    • classifier dependent

    ⇨ lacks in the generalisability

    • incorporate feature selection as a part of training process 

     computationally efficient as compared to the wrappers. 


Filter Methods

 - J : scoring criterion (relevance index)

 - Y : class labels

 - X : set of all features

 - Xk : the feature to be evaluated

 - S : the set of already selected features


위 내용을 작성할 때 Parmar, C., Grossmann, P., Bussink, J. et al. Machine Learning methods for Quantitative Radiomic Biomarkers. Sci Rep 5, 13087 (2015). 해당 논문을 참고하였음.


댓글

이 블로그의 인기 게시물

Radiomics: Feature selection 3

  Demircioğlu, Aydin PhD  Benchmarking Feature Selection Methods in Radiomics, Investigative Radiology: January 18, 2022 - Volume - Issue - doi: 10.1097/RLI.0000000000000855 High dimensionality of the datasets and small sample sizes are critical problems in radiomics. Therefore, removing redundant features and irrelevant features is needed. Overall, per dataset,  30 different feature selection methods + 10 classifiers + 70 hyperparameter settings After each feature selection method, 1, 2, ..., 64 features were selected. Altogether, 14,700=30✕70 ✕7 models were fitted, each with a 10-fold cross-validation . More complex methods are more unstable than simpler feature selection methods. LASSO performed best when analysing the predictive performance , though it showed only average feature stability . Longer training times and higher computational complexity of the feature selection method do not mean for high predictive performance necessarily. Obtaining a more stable mode...

일치도 통계와 paired t-test

Why using a paired t test to assess agreement is problematic? by Nikolaos Pandis https://doi.org/10.1016/j.ajodo.2021.07.001 Agreement 를 평가함에 있어 paired t-test를 사용하는 논문들이 몇 있다.  임상논문에서 의료기기가 측정한 것의 일치성, 혹은 의료행위자 A와 B가 측정한 것이 비슷한지를 측정하는 일들이 꽤 많은데, 여전히 많은 논문들에서 paired t-test에서 p>0.05 라는 통계 결과를 얻었을 때 '두 기기에서 측정한 수치는 일치한다.' 혹은 '의사A와 의사B가 측정한 수치는 일치한다.' 라는 결과를 내린다. 통계를 배울 때, "짝지어진 두 모집단의 차이를 보고 싶을 때는 paired t-test를 사용한다." 라고 많이들 배우는데, 아마 이렇게 배우기(?) 때문에 '그럼 paired t-test의 p-value가 0.05보다 크면 두 집단 간 차이가 없다는 것이겠네?'라고 많은 사람들의 생각이 이어지는듯하다. 그러나 내가 통계적으로 살펴보고 싶은 것이 "Agreement"라면 paired t-test를 사용하는 것은 잘못 되었다. 그 이유에 대해서는 다음 두 개의 시나리오를 이용해 설명해보도록 하겠다. 시나리오A와 시나리오B에는 시간 차이를 두고 같은 subject를 측정한 Time1 수치와 Time2 수치가 있다. 시나리오A와 시나리오B 모두 Time1과 Time2에서 측정된 수치의 평균은 10.45로 동일하다. ✔️먼저 시나리오A 를 살펴보자. 시나리오A에서 Time1과 Time2의 평균은 10.45로 동일하므로, 차이 d의 평균도 0이고 따라서 paired t-test를 진행하면 p-value가 1로 나올 것이다.  그럼 Time1과 Time2가 동일한 수치를 냈다고 결론지을 수 있는가? 시나리오A의 각 subject를 대상으로 시간 차이...

통계검정 : (1) 두 모비율의 추정과 가설검정

 지지율, 실업률, 불량률과 같이 모집단의 비율(p)을 추정하는 문제에 대해 생각해보자. 모집단이 두 개의 배반사건(찬성, 반대)으로 구성되어 있을 때, 찬성 모비율을 p, 반대 모비율을 (1-p)라 칭한다. $$ \widehat{p}=\frac{X}{n} , E(\widehat{p})=p, Var(\widehat{p})=\frac{p(1-p)}{n} $$  이때, 모집단에서 n개의 표본을 뽑으면 찬성자수 X는  표본수 n, 성공률이 p인  이항분포 B(n, p)를 따른다. $$ X \sim B(n, p) $$ E(X)=np, Var(X)=np(1-p) 이므로, $$ E(\frac{X}{n})=p, Var(\frac{X}{n})=\frac{1}{n^{2}}Var(X)=\frac{1}{n^{2}}np(1-p)=\frac{p(1-p)}{n} $$ 자세한 증명은  http://www.stat.yale.edu/Courses/1997-98/101/binom.htm  를 참고하면 된다. 표본크기가 충분히 크다면 표본비율은 정규분포를 따른다. $$ Z = \frac{\widehat{p}-p_{0}}{\sqrt{p_{0}(1-p_{0})/n}} , Z \sim N(0, 1) $$