Lung cancer is the leading cause of cancer-related deaths worldwide among both men and women [1]. In 2024, an estimated 234,580 people will be diagnosed with lung cancer in the United States [2]. Overall, one in 16 men and 1 in 17 women will be diagnosed with lung cancer in their lifetime, and approximately 125,000 die annually from lung cancer in the US.
The primary objective of this research is to implement multivariate machine learning methods to improve the accuracy of patient outcome prediction. Patients’ clinical and demographic information are integrated along with quantitative features about the tumor characteristics extracted from patients’ lung CT scans. Univariate Cox proportional hazard (Cox PH) and multivariate models are utilized to identify significant features within the combined collection of data in the integrated dataset. A reduced feature set is created using identified factors. A reduced feature set is then employed in an ensemble learning algorithm for survival analysis, the so-called Random Survival Forest (RSF). RSF model can capture non-linear complex relationships within survival data. It is a nonparametric alternative method to conventional linear survival models [3].
Lung 1, a publicly available dataset of lung cancer patients, is used in this study. The dataset was split into training and testing sets, with 80% allocated to the training set and 20% allocated to the testing set. RSF was trained using the reduced feature set created from patients’ information in Lung 1. The probability of 5-year survival for the patients in the test set was predicted. Lastly, Kaplan-Meier [4] curves were generated to compare the observed survival probabilities with the predicted ones by the proposed method. The outcomes are promising and encourage us to extend the proposed method and apply it to larger datasets of lung cancer patients.
Ensemble Learning for Survival Analysis of Lung Cancer Patients using Random Survival Forest
Category
Student Abstract Submission