7 research outputs found

    Computational Tool for Applications of Sparse Canonical Correlation Analysis on Biological Data

    Get PDF
    Sparse canonical correlation analysis (sparse CCA) is a method for identifying sparse linear combinations of the two sets of variables that are highly correlated with each other, given that those two sets of measurements are available on the same set of observations. Recently, sparse CCA has become a popular method for analyzing genomic data, where the number of features is large compared to that of observations. Analyzing a set of data using sparse CCA requires multiple steps, including data cleaning, normalizing, and using the right programming packages. To make sparse CCA accessible for all researchers regardless of their statistical background, a user-friendly computational tool should be created to assist them in walking through the analysis. After the tool is successfully implemented, a few sets of data will be used as case studies for testing efficiency of the sparse CCA computational tool. Eventually, the tool will be added to the computational website hosted by the Center for Translational Environmental Health Research, which currently hosts services for sequencing classification and differential expression analysis

    Modeling the next generation sequencing sample processing pipeline for the purposes of classification

    Get PDF
    BACKGROUND: A key goal of systems biology and translational genomics is to utilize high-throughput measurements of cellular states to develop expression-based classifiers for discriminating among different phenotypes. Recent developments of Next Generation Sequencing (NGS) technologies can facilitate classifier design by providing expression measurements for tens of thousands of genes simultaneously via the abundance of their mRNA transcripts. Because NGS technologies result in a nonlinear transformation of the actual expression distributions, their application can result in data that are less discriminative than would be the actual expression levels themselves, were they directly observable. RESULTS: Using state-of-the-art distributional modeling for the NGS processing pipeline, this paper studies how that pipeline, via the resulting nonlinear transformation, affects classification and feature selection. The effects of different factors are considered and NGS-based classification is compared to SAGE-based classification and classification directly on the raw expression data, which is represented by a very high-dimensional model previously developed for gene expression. As expected, the nonlinear transformation resulting from NGS processing diminishes classification accuracy; however, owing to a larger number of reads, NGS-based classification outperforms SAGE-based classification. CONCLUSIONS: Having high numbers of reads can mitigate the degradation in classification performance resulting from the effects of NGS technologies. Hence, when performing a RNA-Seq analysis, using the highest possible coverage of the genome is recommended for the purposes of classification

    Optimal Model-Based Approaches for Predictive Inference in Biology

    Get PDF
    Predictive modeling of the dynamic, multivariate, non-linear, stochastic systems of biology is a difficult enterprise. High throughput measurement techniques are enabling new approaches to computational biology, but the small number of samples typically available relative to the number of features measured make additional sources of information critical for accurate predictions. In this dissertation, we offer an approach to incorporate biological pathway knowledge into a predictive stochastic model for genetic regulatory networks. In addition, we propose a statistical model for shotgun sequencing and use computational approximation strategies to derive optimal estimators for classification. We perform comparisons of classifiers trained using this framework to other existing classification rules including non-linear support vector machines. Using both synthetic and real sequencing data, our classifiers delivered lower classification error rates than existing classification techniques. In addition, we demonstrate using prior knowledge to construct the classifier through properly constructed prior distributions and several scenarios where this increases classification performance. This research establishes a flexible framework to generate optimal estimators with respect to statistical biological models. By demonstrating the role and power of computation in unlocking these estimators, we point future research efforts towards this computationally intensive approach for the computational biology field

    Stock Market Random Forest-Text Mining (SMRF-TM) Approach to Analyse Critical Indicators of Stock Market Movements

    Get PDF
    The Stock Market is a significant sector of a country’s economy and has a crucial role in the growth of commerce and industry. Hence, discovering efficient ways to analyse and visualise stock market data is considered a significant issue in modern finance. The use of data mining techniques to predict stock market movements has been extensively studied using historical market prices but such approaches are constrained to make assessments within the scope of existing information, and thus they are not able to model any random behaviour of the stock market or identify the causes behind events. One area of limited success in stock market prediction comes from textual data, which is a rich source of information. Analysing textual data related to the Stock Market may provide better understanding of random behaviours of the market. Text Mining combined with the Random Forest algorithm offers a novel approach to the study of critical indicators, which contribute to the prediction of stock market abnormal movements. In this thesis, a Stock Market Random Forest-Text Mining system (SMRF-TM) is developed and is used to mine the critical indicators related to the 2009 Dubai stock market debt standstill. Random forest and expectation maximisation are applied to classify the extracted features into a set of meaningful and semantic classes, thus extending current approaches from three to eight classes: critical down, down, neutral, up, critical up, economic, social and political. The study demonstrates that Random Forest has outperformed other classifiers and has achieved the best accuracy in classifying the bigram features extracted from the corpus
    corecore