Title |
DECISION TREE LEARNING AND REGRESSION MODELS TO PREDICT ENDOCRINE DISRUPTOR CHEMICALS - A BIG DATA ANALYTICS APPROACH WITH HADOOP AND APACHE SPARK |
| Int J Mach Intell Vol:7 Iss:1 (2016-06-07) : 469-473 |
Authors |
RENJITH PAULOSE, K. JEGATHEESAN, B. GOPAL SAMY |
Published on |
07 Jun 2016 Pages : 469-473 Article Id : BIA0002886 Views : 1002 Downloads : 631 |
|
Abstract |
Full Text |
PDF | XML |
PubMed XML |
CNKI |
Cited By |
Open Access | Research Article
Predictive toxicology calls for innovative and flexible approaches to mine and analyse the mounting quantity and complexity of data used in it. Classification and regression based machine learning algorithms are used in this study in order to computationally predict chemical’s affinity towards endocrine hormones. As a result of the modelling complexity and existing big sized toxicity datasets generated by various irrelevant descriptors, missing values, noisy data and skewed distribution, we are motivated to use machine learning and big data analytics in toxicity prediction. This paper reports results of a qualitative and quantitative toxicity prediction of endocrine disrupting chemicals. Datasets of Estrogen Receptor (ER) and Androgen Receptor (AR) disrupting chemicals along with their Binding Affinity values were used for building the predictive models. Fragment counts of dataset chemicals were generated using Kier Hall Smarts Descriptor that exploit electro-topological state (e-state) indices. Chemical data after fingerprint calculations were loaded into Hadoop Distributed File System (HDFS) for parallel processing. Decision tree learning classifier algorithm was applied using Apache Spark big data processing framework to qualitatively predict endocrine disruptor and non-disruptor chemicals. ER and AR predictive models over training datasets demonstrated 89.5% and 90.03% accuracy in toxicity prediction whereas corresponding models on their test datasets showed 81.25% and 73.33% prediction accuracies respectively. Linear regression algorithm built using R statistical software was used to quantitatively predict the log Relative Binding Affinity (logRBA) of chemicals towards Androgen and Estrogen Receptors. This study details the power of Decision Tree Learning algorithm in chemical toxicity prediction on a Hadoop parallel computing environment that can be leveraged to explore advanced machine learning technologies for getting high accuracy in chemical toxicity prediction.
|