Search CORE

7 research outputs found

Modeling the next generation sequencing sample processing pipeline for the purposes of classification

Author
Publication venue: BioMed Central
Publication date
Field of study

Springer - Publisher Connector

Modeling the next generation sequencing sample processing pipeline for the purposes of classification

Author: A Mortazavi
B Langmead
BE Boser
C Alkan
C Cortes
Charles D Johnson
DC Hoyle
DR Bentley
DW Craig
Edward R Dougherty
ER Dougherty
ER Mardis
F Hach
H Li
H Li
I Shmulevich
Ivan Ivanov
J Hua
J Li
JC Marioni
JH Bullard
L Bianchetti
L Jiang
LA Dalton
M Sultan
MD Robinson
MD Robinson
MD Robinson
MD Robinson
Mohammadmahdi R Yousefi
MR Yousefi
Noushin Ghaffari
PL Auer
R Li
RO Duda
S Anders
S Attoor
SM Rumble
SM Wang
W Sun
Y Sun
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Computational Tool for Applications of Sparse Canonical Correlation Analysis on Biological Data

Author: Koonchanok Ratanond
Publication venue
Publication date: 21/09/2018
Field of study

Sparse canonical correlation analysis (sparse CCA) is a method for identifying sparse linear combinations of the two sets of variables that are highly correlated with each other, given that those two sets of measurements are available on the same set of observations. Recently, sparse CCA has become a popular method for analyzing genomic data, where the number of features is large compared to that of observations. Analyzing a set of data using sparse CCA requires multiple steps, including data cleaning, normalizing, and using the right programming packages. To make sparse CCA accessible for all researchers regardless of their statistical background, a user-friendly computational tool should be created to assist them in walking through the analysis. After the tool is successfully implemented, a few sets of data will be used as case studies for testing efficiency of the sparse CCA computational tool. Eventually, the tool will be added to the computational website hosted by the Center for Translational Environmental Health Research, which currently hosts services for sequencing classification and differential expression analysis

Texas A&M Repository

Modeling the next generation sequencing sample processing pipeline for the purposes of classification

Author: Charles D Johnson
Edward R Dougherty
Ghaffari Noushin
Ivan Ivanov
Mohammadmahdi R Yousefi
Noushin Ghaffari
Publication venue: Wiley
Publication date: 01/01/2013
Field of study

BACKGROUND: A key goal of systems biology and translational genomics is to utilize high-throughput measurements of cellular states to develop expression-based classifiers for discriminating among different phenotypes. Recent developments of Next Generation Sequencing (NGS) technologies can facilitate classifier design by providing expression measurements for tens of thousands of genes simultaneously via the abundance of their mRNA transcripts. Because NGS technologies result in a nonlinear transformation of the actual expression distributions, their application can result in data that are less discriminative than would be the actual expression levels themselves, were they directly observable. RESULTS: Using state-of-the-art distributional modeling for the NGS processing pipeline, this paper studies how that pipeline, via the resulting nonlinear transformation, affects classification and feature selection. The effects of different factors are considered and NGS-based classification is compared to SAGE-based classification and classification directly on the raw expression data, which is represented by a very high-dimensional model previously developed for gene expression. As expected, the nonlinear transformation resulting from NGS processing diminishes classification accuracy; however, owing to a larger number of reads, NGS-based classification outperforms SAGE-based classification. CONCLUSIONS: Having high numbers of reads can mitigate the degradation in classification performance resulting from the effects of NGS technologies. Hence, when performing a RNA-Seq analysis, using the highest possible coverage of the genome is recommended for the purposes of classification

Crossref

Springer

Springer - Publisher Connector

PubMed Central

Texas A&M Repository

Optimal Model-Based Approaches for Predictive Inference in Biology

Author: Knight Jason Matthew
Publication venue
Publication date: 21/09/2015
Field of study

Predictive modeling of the dynamic, multivariate, non-linear, stochastic systems of biology is a difficult enterprise. High throughput measurement techniques are enabling new approaches to computational biology, but the small number of samples typically available relative to the number of features measured make additional sources of information critical for accurate predictions. In this dissertation, we offer an approach to incorporate biological pathway knowledge into a predictive stochastic model for genetic regulatory networks. In addition, we propose a statistical model for shotgun sequencing and use computational approximation strategies to derive optimal estimators for classification. We perform comparisons of classifiers trained using this framework to other existing classification rules including non-linear support vector machines. Using both synthetic and real sequencing data, our classifiers delivered lower classification error rates than existing classification techniques. In addition, we demonstrate using prior knowledge to construct the classifier through properly constructed prior distributions and several scenarios where this increases classification performance. This research establishes a flexible framework to generate optimal estimators with respect to statistical biological models. By demonstrating the role and power of computation in unlocking these estimators, we point future research efforts towards this computationally intensive approach for the computational biology field

Texas A&M Repository

Stock Market Random Forest-Text Mining (SMRF-TM) Approach to Analyse Critical Indicators of Stock Market Movements

Author: ELAGAMY MAZEN NABIL
Publication venue
Publication date: 01/01/2017
Field of study

The Stock Market is a significant sector of a country’s economy and has a crucial role in the growth of commerce and industry. Hence, discovering efficient ways to analyse and visualise stock market data is considered a significant issue in modern finance. The use of data mining techniques to predict stock market movements has been extensively studied using historical market prices but such approaches are constrained to make assessments within the scope of existing information, and thus they are not able to model any random behaviour of the stock market or identify the causes behind events. One area of limited success in stock market prediction comes from textual data, which is a rich source of information. Analysing textual data related to the Stock Market may provide better understanding of random behaviours of the market. Text Mining combined with the Random Forest algorithm offers a novel approach to the study of critical indicators, which contribute to the prediction of stock market abnormal movements. In this thesis, a Stock Market Random Forest-Text Mining system (SMRF-TM) is developed and is used to mine the critical indicators related to the 2009 Dubai stock market debt standstill. Random forest and expectation maximisation are applied to classify the extracted features into a set of meaningful and semantic classes, thus extending current approaches from three to eight classes: critical down, down, neutral, up, critical up, economic, social and political. The study demonstrates that Random Forest has outperformed other classifiers and has achieved the best accuracy in classifying the bigram features extracted from the corpus

STORE - Staffordshire Online Repository