49 research outputs found

    Iterative Random Forests to detect predictive and stable high-order interactions

    Get PDF
    Genomics has revolutionized biology, enabling the interrogation of whole transcriptomes, genome-wide binding sites for proteins, and many other molecular processes. However, individual genomic assays measure elements that interact in vivo as components of larger molecular machines. Understanding how these high-order interactions drive gene expression presents a substantial statistical challenge. Building on Random Forests (RF), Random Intersection Trees (RITs), and through extensive, biologically inspired simulations, we developed the iterative Random Forest algorithm (iRF). iRF trains a feature-weighted ensemble of decision trees to detect stable, high-order interactions with same order of computational cost as RF. We demonstrate the utility of iRF for high-order interaction discovery in two prediction problems: enhancer activity in the early Drosophila embryo and alternative splicing of primary transcripts in human derived cell lines. In Drosophila, among the 20 pairwise transcription factor interactions iRF identifies as stable (returned in more than half of bootstrap replicates), 80% have been previously reported as physical interactions. Moreover, novel third-order interactions, e.g. between Zelda (Zld), Giant (Gt), and Twist (Twi), suggest high-order relationships that are candidates for follow-up experiments. In human-derived cells, iRF re-discovered a central role of H3K36me3 in chromatin-mediated splicing regulation, and identified novel 5th and 6th order interactions, indicative of multi-valent nucleosomes with specific roles in splicing regulation. By decoupling the order of interactions from the computational cost of identification, iRF opens new avenues of inquiry into the molecular mechanisms underlying genome biology

    Microarray Data Mining and Gene Regulatory Network Analysis

    Get PDF
    The novel molecular biological technology, microarray, makes it feasible to obtain quantitative measurements of expression of thousands of genes present in a biological sample simultaneously. Genome-wide expression data generated from this technology are promising to uncover the implicit, previously unknown biological knowledge. In this study, several problems about microarray data mining techniques were investigated, including feature(gene) selection, classifier genes identification, generation of reference genetic interaction network for non-model organisms and gene regulatory network reconstruction using time-series gene expression data. The limitations of most of the existing computational models employed to infer gene regulatory network lie in that they either suffer from low accuracy or computational complexity. To overcome such limitations, the following strategies were proposed to integrate bioinformatics data mining techniques with existing GRN inference algorithms, which enables the discovery of novel biological knowledge. An integrated statistical and machine learning (ISML) pipeline was developed for feature selection and classifier genes identification to solve the challenges of the curse of dimensionality problem as well as the huge search space. Using the selected classifier genes as seeds, a scale-up technique is applied to search through major databases of genetic interaction networks, metabolic pathways, etc. By curating relevant genes and blasting genomic sequences of non-model organisms against well-studied genetic model organisms, a reference gene regulatory network for less-studied organisms was built and used both as prior knowledge and model validation for GRN reconstructions. Networks of gene interactions were inferred using a Dynamic Bayesian Network (DBN) approach and were analyzed for elucidating the dynamics caused by perturbations. Our proposed pipelines were applied to investigate molecular mechanisms for chemical-induced reversible neurotoxicity

    Statistical Data Modeling and Machine Learning with Applications

    Get PDF
    The modeling and processing of empirical data is one of the main subjects and goals of statistics. Nowadays, with the development of computer science, the extraction of useful and often hidden information and patterns from data sets of different volumes and complex data sets in warehouses has been added to these goals. New and powerful statistical techniques with machine learning (ML) and data mining paradigms have been developed. To one degree or another, all of these techniques and algorithms originate from a rigorous mathematical basis, including probability theory and mathematical statistics, operational research, mathematical analysis, numerical methods, etc. Popular ML methods, such as artificial neural networks (ANN), support vector machines (SVM), decision trees, random forest (RF), among others, have generated models that can be considered as straightforward applications of optimization theory and statistical estimation. The wide arsenal of classical statistical approaches combined with powerful ML techniques allows many challenging and practical problems to be solved. This Special Issue belongs to the section “Mathematics and Computer Science”. Its aim is to establish a brief collection of carefully selected papers presenting new and original methods, data analyses, case studies, comparative studies, and other research on the topic of statistical data modeling and ML as well as their applications. Particular attention is given, but is not limited, to theories and applications in diverse areas such as computer science, medicine, engineering, banking, education, sociology, economics, among others. The resulting palette of methods, algorithms, and applications for statistical modeling and ML presented in this Special Issue is expected to contribute to the further development of research in this area. We also believe that the new knowledge acquired here as well as the applied results are attractive and useful for young scientists, doctoral students, and researchers from various scientific specialties

    Mining of textual databases within the product development process

    Get PDF

    Application of machine learning and deep learning for proteomics data analysis

    Get PDF

    Dynamic Data Mining: Methodology and Algorithms

    No full text
    Supervised data stream mining has become an important and challenging data mining task in modern organizations. The key challenges are threefold: (1) a possibly infinite number of streaming examples and time-critical analysis constraints; (2) concept drift; and (3) skewed data distributions. To address these three challenges, this thesis proposes the novel dynamic data mining (DDM) methodology by effectively applying supervised ensemble models to data stream mining. DDM can be loosely defined as categorization-organization-selection of supervised ensemble models. It is inspired by the idea that although the underlying concepts in a data stream are time-varying, their distinctions can be identified. Therefore, the models trained on the distinct concepts can be dynamically selected in order to classify incoming examples of similar concepts. First, following the general paradigm of DDM, we examine the different concept-drifting stream mining scenarios and propose corresponding effective and efficient data mining algorithms. • To address concept drift caused merely by changes of variable distributions, which we term pseudo concept drift, base models built on categorized streaming data are organized and selected in line with their corresponding variable distribution characteristics. • To address concept drift caused by changes of variable and class joint distributions, which we term true concept drift, an effective data categorization scheme is introduced. A group of working models is dynamically organized and selected for reacting to the drifting concept. Secondly, we introduce an integration stream mining framework, enabling the paradigm advocated by DDM to be widely applicable for other stream mining problems. Therefore, we are able to introduce easily six effective algorithms for mining data streams with skewed class distributions. In addition, we also introduce a new ensemble model approach for batch learning, following the same methodology. Both theoretical and empirical studies demonstrate its effectiveness. Future work would be targeted at improving the effectiveness and efficiency of the proposed algorithms. Meantime, we would explore the possibilities of using the integration framework to solve other open stream mining research problems

    Genetic algorithm-neural network: feature extraction for bioinformatics data.

    Get PDF
    With the advance of gene expression data in the bioinformatics field, the questions which frequently arise, for both computer and medical scientists, are which genes are significantly involved in discriminating cancer classes and which genes are significant with respect to a specific cancer pathology. Numerous computational analysis models have been developed to identify informative genes from the microarray data, however, the integrity of the reported genes is still uncertain. This is mainly due to the misconception of the objectives of microarray study. Furthermore, the application of various preprocessing techniques in the microarray data has jeopardised the quality of the microarray data. As a result, the integrity of the findings has been compromised by the improper use of techniques and the ill-conceived objectives of the study. This research proposes an innovative hybridised model based on genetic algorithms (GAs) and artificial neural networks (ANNs), to extract the highly differentially expressed genes for a specific cancer pathology. The proposed method can efficiently extract the informative genes from the original data set and this has reduced the gene variability errors incurred by the preprocessing techniques. The novelty of the research comes from two perspectives. Firstly, the research emphasises on extracting informative features from a high dimensional and highly complex data set, rather than to improve classification results. Secondly, the use of ANN to compute the fitness function of GA which is rare in the context of feature extraction. Two benchmark microarray data have been taken to research the prominent genes expressed in the tumour development and the results show that the genes respond to different stages of tumourigenesis (i.e. different fitness precision levels) which may be useful for early malignancy detection. The extraction ability of the proposed model is validated based on the expected results in the synthetic data sets. In addition, two bioassay data have been used to examine the efficiency of the proposed model to extract significant features from the large, imbalanced and multiple data representation bioassay data
    corecore