238 research outputs found

    A comparison of machine learning algorithms for chemical toxicity classification using a simulated multi-scale data model

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Bioactivity profiling using high-throughput <it>in vitro </it>assays can reduce the cost and time required for toxicological screening of environmental chemicals and can also reduce the need for animal testing. Several public efforts are aimed at discovering patterns or classifiers in high-dimensional bioactivity space that predict tissue, organ or whole animal toxicological endpoints. Supervised machine learning is a powerful approach to discover combinatorial relationships in complex <it>in vitro/in vivo </it>datasets. We present a novel model to simulate complex chemical-toxicology data sets and use this model to evaluate the relative performance of different machine learning (ML) methods.</p> <p>Results</p> <p>The classification performance of Artificial Neural Networks (ANN), K-Nearest Neighbors (KNN), Linear Discriminant Analysis (LDA), NaĂŻve Bayes (NB), Recursive Partitioning and Regression Trees (RPART), and Support Vector Machines (SVM) in the presence and absence of filter-based feature selection was analyzed using K-way cross-validation testing and independent validation on simulated <it>in vitro </it>assay data sets with varying levels of model complexity, number of irrelevant features and measurement noise. While the prediction accuracy of all ML methods decreased as non-causal (irrelevant) features were added, some ML methods performed better than others. In the limit of using a large number of features, ANN and SVM were always in the top performing set of methods while RPART and KNN (k = 5) were always in the poorest performing set. The addition of measurement noise and irrelevant features decreased the classification accuracy of all ML methods, with LDA suffering the greatest performance degradation. LDA performance is especially sensitive to the use of feature selection. Filter-based feature selection generally improved performance, most strikingly for LDA.</p> <p>Conclusion</p> <p>We have developed a novel simulation model to evaluate machine learning methods for the analysis of data sets in which in vitro bioassay data is being used to predict in vivo chemical toxicology. From our analysis, we can recommend that several ML methods, most notably SVM and ANN, are good candidates for use in real world applications in this area.</p

    KnowTox: pipeline and case study for confident prediction of potential toxic effects of compounds in early phases of development

    Get PDF
    Risk assessment of newly synthesised chemicals is a prerequisite for regulatory approval. In this context, in silico methods have great potential to reduce time, cost, and ultimately animal testing as they make use of the ever-growing amount of available toxicity data. Here, KnowTox is presented, a novel pipeline that combines three different in silico toxicology approaches to allow for confident prediction of potentially toxic effects of query compounds, i.e. machine learning models for 88 endpoints, alerts for 919 toxic substructures, and computational support for read-across. It is mainly based on the ToxCast dataset, containing after preprocessing a sparse matrix of 7912 compounds tested against 985 endpoints. When applying machine learning models, applicability and reliability of predictions for new chemicals are of utmost importance. Therefore, first, the conformal prediction technique was deployed, comprising an additional calibration step and per definition creating internally valid predictors at a given significance level. Second, to further improve validity and information efficiency, two adaptations are suggested, exemplified at the androgen receptor antagonism endpoint. An absolute increase in validity of 23% on the in-house dataset of 534 compounds could be achieved by introducing KNNRegressor normalisation. This increase in validity comes at the cost of efficiency, which could again be improved by 20% for the initial ToxCast model by balancing the dataset during model training. Finally, the value of the developed pipeline for risk assessment is discussed using two in-house triazole molecules. Compared to a single toxicity prediction method, complementing the outputs of different approaches can have a higher impact on guiding toxicity testing and de-selecting most likely harmful development-candidate compounds early in the development process

    The Temporal and Frequent Pattern Mining Analysis and Machine Learning Forecasting on Mobile Sourced Urban Air Pollutants

    Get PDF
    Ground-level ozone and atmospheric fine particles (PM2.5) have been recognized as critical air pollutants that act as important contributors to the toxicity of anthropogenic air pollution in urban areas. To limit the adverse impacts on public health and ecosystems of ground-level ozone and PM2.5, it is necessary and imperative to identify a practical and effective way to predict the upcoming pollution concentration levels accurately. Under this need, various research was conducted aiming to perform the forecasting of ground-level ozone and PM2.5 that mainly utilized the time-series and neural network analysis. In the meantime, machine learning is also adopted in analysis and forecasting in existing research, which is, however, associated with some limitations that are not easily overcome. (1) The majority of existing forecasting models are highly dependent on time-series inputs without considering the influencing factors of the air pollutants. While a relatively accurate prediction may be provided, the influencing factors of the air pollution level caused by real-world complexity are neglected. (2) The existing forecasting models are mainly focused on the short-term estimation, while some of them need to use the previous prediction as a part of the input, which increased the system complexity and decreased the computational efficiency and accuracy. (3) The accurate annual hourly air pollution level forecasting ability is seldomly achieved. The objective of this research is to propose a systematical methodology to forecast the long-term hourly future air pollution concentration levels through historical data considering the concentration influencing factors. To achieve this research goal, a series of methodologies to analyze the historical air pollution concentration by temporal characteristics and frequent pattern data mining algorithms are introduced. The association rules of air pollution concentration levels and the influencing factors are revealed. A systematical air pollution level forecasting approach based on supervised machine learning algorithms with the ability to predict the annual hourly value is proposed and evaluated. To quantify and validate the results, a case study was conducted in the Houston region with the collection and analysis of ten years of historical environmental, meteorological, and transportation-related data. From the results of this research, (1) the complex correlations between the influencing factors and air pollution concentration levels are quantified and presented. (2) The association rules between each dependant and independent parameters are calculated. (3) The supervised machine learning algorithm pool is created and evaluated. And (4), an accurate long-term hourly air pollution level machine learning forecasting procedure is proposed. The innovative methodology of this research is advanced in computation complexity with high accuracy when compared with the existing models, which could be easily applied to similar regions for various types of air pollution concentration level forecasting

    Review of QSAR Models and Software Tools for Predicting Developmental and Reproductive Toxicity

    Get PDF
    This report provides a state-of-the-art review of available computational models for developmental and reproductive toxicity, including Quantitative Structure-Activity Relationship (QSARs) and related estimation methods such as decision tree approaches and expert systems. At present, there are relatively few models for developmental and reproductive toxicity endpoints, and those available have limited applicability domains. This situation is partly due to the biological complexity of the endpoint, which covers many incompletely understood mechanisms of action, and partly due to the paucity and heterogeneity of high quality data suitable for model development. In contrast, there is an extensive and growing range of software and literature models for predicting endocrine-related activities, in particular models for oestrogen and androgen activity. There is a considerable need to further develop and characterise in silico models for developmental and reproductive toxicity, and to explore their applicability in a regulatory setting.JRC.DG.I.6-Systems toxicolog

    Machine Learning Approach to Simulate Soil CO\u3csub\u3e2\u3c/sub\u3e Fluxes under Cropping Systems

    Get PDF
    With the growing number of datasets to describe greenhouse gas (GHG) emissions, there is an opportunity to develop novel predictive models that require neither the expense nor time required to make direct field measurements. This study evaluates the potential for machine learning (ML) approaches to predict soil GHG emissions without the biogeochemical expertise that is required to use many current models for simulating soil GHGs. There are ample data from field measurements now publicly available to test new modeling approaches. The objective of this paper was to develop and evaluate machine learning (ML) models using field data (soil temperature, soil moisture, soil classification, crop type, fertilization type, and air temperature) available in the Greenhouse gas Reduction through Agricultural Carbon Enhancement network (GRACEnet) database to simulate soil CO2 fluxes with different fertilization methods. Four machine learning algorithms—K nearest neighbor regression (KNN), support vector regression (SVR), random forest (RF) regression, and gradient boosted (GB) regression—were used to develop the models. The GB regression model outperformed all the other models on the training dataset with R2 = 0.88, MAE = 2177.89 g C ha−1 day−1, and RMSE 4405.43 g C ha−1 day−1. However, the RF and GB regression models both performed optimally on the unseen test dataset with R2 = 0.82. Machine learning tools were useful for developing predictors based on soil classification, soil temperature and air temperature when a large database like GRACEnet is available, but these were not highly predictive variables in correlation analysis. This study demonstrates the suitability of using tree-based ML algorithms for predictive modeling of CO2 fluxes, but no biogeochemical processes can be described with such models

    Integrative Chemical–Biological Read-Across Approach for Chemical Hazard Classification

    Get PDF
    Traditional read-across approaches typically rely on the chemical similarity principle to predict chemical toxicity; however, the accuracy of such predictions is often inadequate due to the underlying complex mechanisms of toxicity. Here we report on the development of a hazard classification and visualization method that draws upon both chemical structural similarity and comparisons of biological responses to chemicals measured in multiple short-term assays (”biological” similarity). The Chemical-Biological Read-Across (CBRA) approach infers each compound's toxicity from those of both chemical and biological analogs whose similarities are determined by the Tanimoto coefficient. Classification accuracy of CBRA was compared to that of classical RA and other methods using chemical descriptors alone, or in combination with biological data. Different types of adverse effects (hepatotoxicity, hepatocarcinogenicity, mutagenicity, and acute lethality) were classified using several biological data types (gene expression profiling and cytotoxicity screening). CBRA-based hazard classification exhibited consistently high external classification accuracy and applicability to diverse chemicals. Transparency of the CBRA approach is aided by the use of radial plots that show the relative contribution of analogous chemical and biological neighbors. Identification of both chemical and biological features that give rise to the high accuracy of CBRA-based toxicity prediction facilitates mechanistic interpretation of the models

    Towards the development of a reliable reconfigurable real-time operating system on FPGAs

    Get PDF
    In the last two decades, Field Programmable Gate Arrays (FPGAs) have been rapidly developed from simple “glue-logic” to a powerful platform capable of implementing a System on Chip (SoC). Modern FPGAs achieve not only the high performance compared with General Purpose Processors (GPPs), thanks to hardware parallelism and dedication, but also better programming flexibility, in comparison to Application Specific Integrated Circuits (ASICs). Moreover, the hardware programming flexibility of FPGAs is further harnessed for both performance and manipulability, which makes Dynamic Partial Reconfiguration (DPR) possible. DPR allows a part or parts of a circuit to be reconfigured at run-time, without interrupting the rest of the chip’s operation. As a result, hardware resources can be more efficiently exploited since the chip resources can be reused by swapping in or out hardware tasks to or from the chip in a time-multiplexed fashion. In addition, DPR improves fault tolerance against transient errors and permanent damage, such as Single Event Upsets (SEUs) can be mitigated by reconfiguring the FPGA to avoid error accumulation. Furthermore, power and heat can be reduced by removing finished or idle tasks from the chip. For all these reasons above, DPR has significantly promoted Reconfigurable Computing (RC) and has become a very hot topic. However, since hardware integration is increasing at an exponential rate, and applications are becoming more complex with the growth of user demands, highlevel application design and low-level hardware implementation are increasingly separated and layered. As a consequence, users can obtain little advantage from DPR without the support of system-level middleware. To bridge the gap between the high-level application and the low-level hardware implementation, this thesis presents the important contributions towards a Reliable, Reconfigurable and Real-Time Operating System (R3TOS), which facilitates the user exploitation of DPR from the application level, by managing the complex hardware in the background. In R3TOS, hardware tasks behave just like software tasks, which can be created, scheduled, and mapped to different computing resources on the fly. The novel contributions of this work are: 1) a novel implementation of an efficient task scheduler and allocator; 2) implementation of a novel real-time scheduling algorithm (FAEDF) and two efficacious allocating algorithms (EAC and EVC), which schedule tasks in real-time and circumvent emerging faults while maintaining more compact empty areas. 3) Design and implementation of a faulttolerant microprocessor by harnessing the existing FPGA resources, such as Error Correction Code (ECC) and configuration primitives. 4) A novel symmetric multiprocessing (SMP)-based architectures that supports shared memory programing interface. 5) Two demonstrations of the integrated system, including a) the K-Nearest Neighbour classifier, which is a non-parametric classification algorithm widely used in various fields of data mining; and b) pairwise sequence alignment, namely the Smith Waterman algorithm, used for identifying similarities between two biological sequences. R3TOS gives considerably higher flexibility to support scalable multi-user, multitasking applications, whereby resources can be dynamically managed in respect of user requirements and hardware availability. Benefiting from this, not only the hardware resources can be more efficiently used, but also the system performance can be significantly increased. Results show that the scheduling and allocating efficiencies have been improved up to 2x, and the overall system performance is further improved by ~2.5x. Future work includes the development of Network on Chip (NoC), which is expected to further increase the communication throughput; as well as the standardization and automation of our system design, which will be carried out in line with the enablement of other high-level synthesis tools, to allow application developers to benefit from the system in a more efficient manner

    Classification models for 2,4-D formulations in damaged Enlist crops through the application of FTIR spectroscopy and machine learning algorithms

    Get PDF
    With new 2,4-Dichlorophenoxyacetic acid (2,4-D) tolerant crops, increases in off-target movement events are expected. New formulations may mitigate these events, but standard lab techniques are ineffective in identifying these 2,4-D formulations. Using Fourier-transform infrared spectroscopy and machine learning algorithms, research was conducted to classify 2,4-D formulations in treated herbicide-tolerant soybeans and cotton and observe the influence of leaf treatment status and collection timing on classification accuracy. Pooled Classification models using k-nearest neighbor classified 2,4-D formulations with over 65% accuracy in cotton and soybean. Tissue collected 14 DAT and 21 DAT for cotton and soybean respectively produced higher accuracies than the pooled model. Tissue directly treated with 2,4-D also performed better than the pooled model. Lastly, models using timing and treatment status as factors resulted in higher accuracies, with cotton 14 DAT New Growth and Treated models and 28 DAT and 21 DAT Treated soybean models achieving the best accuracies
    • 

    corecore