475 research outputs found

    Fermi's Sibyl: Mining the gamma-ray sky for dark matter subhaloes

    Full text link
    Dark matter annihilation signals coming from Galactic subhaloes may account for a small fraction of unassociated point sources detected in the Second Fermi-LAT catalogue (2FGL). To investigate this possibility, we present Sibyl, a Random Forest classifier that offers predictions on class memberships for unassociated Fermi-LAT sources at high Galactic latitudes using gamma-ray features extracted from the 2FGL. Sibyl generates a large ensemble of classification trees that are trained to vote on whether a particular object is an active galactic nucleus (AGN) or a pulsar. After training on a list of 908 identified/associated 2FGL sources, Sibyl reaches individual accuracy rates of up to 97.7% for AGNs and 96.5% for pulsars. Predictions for the 269 unassociated 2FGL sources at |b| > 10 degrees suggest that 216 are potential AGNs and 16 are potential pulsars (with majority votes greater than 70%). The remaining 37 objects are inconclusive, but none is an extreme outlier. These results could guide future quests for dark matter Galactic subhaloes.Comment: 5 pages, 3 figures, 2 tables, accepted for publication in MNRAS. Complete tables can be retrieved at http://www.gae.ucm.es/~mirabal/sibyl.htm

    A Random Forest model for predicting allosteric and functional sites on proteins

    Get PDF
    We thank the Scottish Universities Life Sciences Alliance (SULSA) for funding to JBOM and for PB’s PhD studentship under NJW’s supervision.We created a computational method to identify allosteric sites using a machine learning method trained and tested on protein structures containing bound ligand molecules. The Random Forest machine learning approach was adopted to build our three-way predictive model. Based on descriptors collated for each ligand and binding site, the classification model allows us to assign protein cavities as allosteric, regular or orthosteric, and hence to identify allosteric sites. 43 structural descriptors per complex were derived and were used to characterize individual protein-ligand binding sites belonging to the three classes, allosteric, regular and orthosteric. We carried out a separate validation on a further unseen set of protein structures containing the ligand 2-(N-cyclohexylamino) ethane sulfonic acid (CHES).PostprintPeer reviewe

    Classification tools for carotenoid content estimation in Manihot esculenta via metabolomics and machine learning

    Get PDF
    Cassava genotypes (Manihot esculenta Crantz) with high pro-vitamin A activity have been identified as a strategy to reduce the prevalence of deficiency of this vitamin. The color variability of cassava roots, which can vary from white to red, is related to the presence of several carotenoid pigments. The present study has shown how CIELAB color measurement on cassava roots tissue can be used as a non-destructive and very fast technique to quantify the levels of carotenoids in cassava root samples, avoiding the use of more expensive analytical techniques for compound quantification, such as UV-visible spectrophotometry and the HPLC. For this, we used machine learning techniques, associating the colorimetric data (CIELAB) with the data obtained by UV-vis and HPLC, to obtain models of prediction of carotenoids for this type of biomass. Best values of R2 (above 90%) were observed for the predictive variable TCC determined by UV-vis spectrophotometry. When we tested the machine learning models using the CIELAB values as inputs, for the total carotenoids contents quantified by HPLC, the Partial Least Squares (PLS), Support Vector Machines, and Elastic Net models presented the best values of R2 (above 40%) and Root-Mean-Square Error (RMSE). For the carotenoid quantification by UV-vis spectrophotometry, R2 (around 60%) and RMSE values (around 6.5) are more satisfactory. Ridge regression and Elastic Network showed the best results. It can be concluded that the use colorimetric technique (CIELAB) associated with UV-vis/HPLC and statistical techniques of prognostic analysis through machine learning can predict the content of total carotenoids in these samples, with good precision and accuracy.CAPES -Coordenação de Aperfeiçoamento de Pessoal de Nível Superior(407323/2013-9)info:eu-repo/semantics/publishedVersio

    Random forest for gene selection and microarray data classification

    Get PDF
    A random forest method has been selected to perform both gene selection and classification of the microarray data. In this embedded method, the selection of smallest possible sets of genes with lowest error rates is the key factor in achieving highest classification accuracy. Hence, improved gene selection method using random forest has been proposed to obtain the smallest subset of genes as well as biggest subset of genes prior to classification. The option for biggest subset selection is done to assist researchers who intend to use the informative genes for further research. Enhanced random forest gene selection has performed better in terms of selecting the smallest subset as well as biggest subset of informative genes with lowest out of bag error rates through gene selection. Furthermore, the classification performed on the selected subset of genes using random forest has lead to lower prediction error rates compared to existing method and other similar available methods

    Predicting postoperative complications for gastric cancer patients using data mining

    Get PDF
    Gastric cancer refers to the development of malign cells that can grow in any part of the stomach. With the vast amount of data being collected daily in healthcare environments, it is possible to develop new algorithms which can support the decision-making processes in gastric cancer patients treatment. This paper aims to predict, using the CRISP-DM methodology, the outcome from the hospitalization of gastric cancer patients who have undergone surgery, as well as the occurrence of postoperative complications during surgery. The study showed that, on one hand, the RF and NB algorithms are the best in the detection of an outcome of hospitalization, taking into account patients’ clinical data. On the other hand, the algorithms J48, RF, and NB offer better results in predicting postoperative complications.FCT - Fundação para a Ciência e a Tecnologia (UID/CEC/00319/2013

    A comprehensive transcript index of the human genome generated using microarrays and computational approaches

    Get PDF
    BACKGROUND: Computational and microarray-based experimental approaches were used to generate a comprehensive transcript index for the human genome. Oligonucleotide probes designed from approximately 50,000 known and predicted transcript sequences from the human genome were used to survey transcription from a diverse set of 60 tissues and cell lines using ink-jet microarrays. Further, expression activity over at least six conditions was more generally assessed using genomic tiling arrays consisting of probes tiled through a repeat-masked version of the genomic sequence making up chromosomes 20 and 22. RESULTS: The combination of microarray data with extensive genome annotations resulted in a set of 28,456 experimentally supported transcripts. This set of high-confidence transcripts represents the first experimentally driven annotation of the human genome. In addition, the results from genomic tiling suggest that a large amount of transcription exists outside of annotated regions of the genome and serves as an example of how this activity could be measured on a genome-wide scale. CONCLUSIONS: These data represent one of the most comprehensive assessments of transcriptional activity in the human genome and provide an atlas of human gene expression over a unique set of gene predictions. Before the annotation of the human genome is considered complete, however, the previously unannotated transcriptional activity throughout the genome must be fully characterized

    Machine vision-assisted analysis of structure-localization relationships in a combinatorial library of prospective bioimaging probes

    Full text link
    With a combinatorial library of bioimaging probes, it is now possible to use machine vision to analyze the contribution of different building blocks of the molecules to their cell-associated visual signals. For this purpose, cell-permeant, fluorescent styryl molecules were synthesized by condensation of 168 aldehyde with 8 pyridinium/quinolinium building blocks. Images of cells incubated with fluorescent molecules were acquired with a high content screening instrument. Chemical and image feature analysis revealed how variation in one or the other building block of the styryl molecules led to variations in the molecules' visual signals. Across each pair of probes in the library, chemical similarity was significantly associated with spectral and total signal intensity similarity. However, chemical similarity was much less associated with similarity in subcellular probe fluorescence patterns. Quantitative analysis and visual inspection of pairs of images acquired from pairs of styryl isomers confirm that many closely-related probes exhibit different subcellular localization patterns. Therefore, idiosyncratic interactions between styryl molecules and specific cellular components greatly contribute to the subcellular distribution of the styryl probes' fluorescence signal. These results demonstrate how machine vision and cheminformatics can be combined to analyze the targeting properties of bioimaging probes, using large image data sets acquired with automated screening systems. © 2009 International Society for Advancement of CytometryPeer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/63004/1/20713_ftp.pd

    Greedy and linear ensembles of machine learning methods outperform single approaches for QSPR regression problems

    Get PDF
    The application of Machine Learning to cheminformatics is a large and active field of research, but there exist few papers which discuss whether ensembles of different Machine Learning methods can improve upon the performance of their component methodologies. Here we investigated a variety of methods, including kernel-based, tree, linear, neural networks, and both greedy and linear ensemble methods. These were all tested against a standardised methodology for regression with data relevant to the pharmaceutical development process. Thinvestigation focused on QSPR problems within drug-like chemical space. We aimed to investigate which methods perform best, and how the ‘wisdom of crowds’ principle can be applied to ensemble predictors. It was found that no single method performs best for all problems, but that a dynamic, well-structured ensemble predictor would perform very well across the board, usually providing an improvement in performance over the best single method. Its use of weighting factors allows the greedy ensemble to acquire a bigger contribution from the better performing models, and this helps the greedy ensemble generally to outperform the simpler linear ensemble. Choice of data pre-processing methodology was found to be crucial to performance of each method too.PostprintPeer reviewe

    A Classification Study of Respiratory Syncytial Virus (RSV) Inhibitors by Variable Selection with Random Forest

    Get PDF
    Experimental pEC50s for 216 selective respiratory syncytial virus (RSV) inhibitors are used to develop classification models as a potential screening tool for a large library of target compounds. Variable selection algorithm coupled with random forests (VS-RF) is used to extract the physicochemical features most relevant to the RSV inhibition. Based on the selected small set of descriptors, four other widely used approaches, i.e., support vector machine (SVM), Gaussian process (GP), linear discriminant analysis (LDA) and k nearest neighbors (kNN) routines are also employed and compared with the VS-RF method in terms of several of rigorous evaluation criteria. The obtained results indicate that the VS-RF model is a powerful tool for classification of RSV inhibitors, producing the highest overall accuracy of 94.34% for the external prediction set, which significantly outperforms the other four methods with the average accuracy of 80.66%. The proposed model with excellent prediction capacity from internal to external quality should be important for screening and optimization of potential RSV inhibitors prior to chemical synthesis in drug development
    corecore