Emad Elsebakhi, Ognian Asparouhov and Rashid Al-Ali
Currently, due to the availability of massive biomedical data on each individual, both healthcare and life Science is becoming data-driven. The input-attributes are structured/un-structured data with many challenges, including sparse-binary attributes with imbalanced outcomes, non-unique distributed structure and high-dimensional data, which hamper efforts to make a clinical decision in clinical practice. In recent decades, considerable effort has been made toward overcoming most of these challenges, but still there is an essential need for significant improvements in this field, especially after integrating both omics and phenotype data for future personalized medicine. These challenges motivate us to use the state-of-the-art of big data analytics and large-scale machine learning frameworks to confront most of the challenges and provide proper clinical solutions to assess physicians in clinical practice at the bedside and subsequently provide high quality care while reducing its cost.
This research proposes a new recursive screening incremental ranking machine learning paradigm to empower the desired classifiers, especially for imbalanced training data, to create suitable data-driven clusters without prior information and later reduce the dimensionality of large biomedical data sets. The new framework combines many binary-attributes based on two criteria: (i) the minimum power value for each combination and (ii) the classification power of such a combination. Next, these sets of combined attributes are investigated by physicians to select the proper set of rules that make clinical sense and subsequently to use the result to empower the desired healthcare event (binary or multinomial target) at the bedside. After empowering the target class categories, we select the k-significant risk drivers with a suitable volume of data and high correlation to the desire outcome, and next, we establish the proper segmentation using AND-OR associative relationships. Finally, we use the propensity score to handle the imbalanced data, and next, we build break-through machine learning/data mining predictive models based on functional networks’ maximum-likelihood and Newton-Raphson iterative matrix computation mechanism to expedite the implementations within high performance computing platforms, such as scalable MapReduce HDFS, Spark MLlib, and Google Sibyl.
Comparative studies with both simulated and real-life biomedical databases are carried out for identifying specific biomedical and healthcare outcomes, such as asthma, breast cancer, gene mutations selection and genomic association studies for specific complex diseases. Results have shown that the proposed incremental learning scheme empower the new classifier with reliable and stable performance. The new classifier outperforms the current existing predictive models in both high quality outcome and less expensive in execution time, especially, with imbalanced and sparse with high-dimensional big biomedical data. We recommend future work to be conducted using real-life integrated clinic-genomic big data with genome-wide association studies for future personalized medicine.
分享此文章