261 research outputs found
stam – a Bioconductor compliant R package for structured analysis of microarray data
BACKGROUND: Genome wide microarray studies have the potential to unveil novel disease entities. Clinically homogeneous groups of patients can have diverse gene expression profiles. The definition of novel subclasses based on gene expression is a difficult problem not addressed systematically by currently available software tools. RESULTS: We present a computational tool for semi-supervised molecular disease entity detection. It automatically discovers molecular heterogeneities in phenotypically defined disease entities and suggests alternative molecular sub-entities of clinical phenotypes. This is done using both gene expression data and functional gene annotations. We provide stam, a Bioconductor compliant software package for the statistical programming environment R. We demonstrate that our tool detects gene expression patterns, which are characteristic for only a subset of patients from an established disease entity. We call such expression patterns molecular symptoms. Furthermore, stam finds novel sub-group stratifications of patients according to the absence or presence of molecular symptoms. CONCLUSION: Our software is easy to install and can be applied to a wide range of datasets. It provides the potential to reveal so far indistinguishable patient sub-groups of clinical relevance
Microarray Based Diagnosis Profits from Better Documentation of Gene Expression Signatures
Microarray gene expression signatures hold great promise to improve diagnosis and prognosis of disease. However, current documentation standards of such signatures do not allow for an unambiguous application to study-external patients. This hinders independent evaluation, effectively delaying the use of signatures in clinical practice. Data from eight publicly available clinical microarray studies were analyzed and the consistency of study-internal with study-external diagnoses was evaluated. Study-external classifications were based on documented information only. Documenting a signature is conceptually different from reporting a list of genes. We show that even the exact quantitative specification of a classification rule alone does not define a signature unambiguously. We found that discrepancy between study-internal and study-external diagnoses can be as frequent as 30% (worst case) and 18% (median). By using the proposed documentation by value strategy, which documents quantitative preprocessing information, the median discrepancy was reduced to 1%. The process of evaluating microarray gene expression diagnostic signatures and bringing them to clinical practice can be substantially improved and made more reliable by better documentation of the signatures
A novel approach to remote homology detection: jumping alignments
Spang R, Rehmsmeier M, Stoye J. A novel approach to remote homology detection: jumping alignments. Journal of Computational Biology. 2002;9(5):747-760.We describe a new algorithm for protein classification and the detection of remote homologs. The rationale is to exploit both vertical and horizontal information of a multiple alignment in a well-balanced manner. This is in contrast to established methods such as profiles and profile hidden Markov models which focus on vertical information as they model the columns of the alignment independently and to family pairwise search which focuses on horizontal information as it treats given sequences separately. In our setting, we want to select from a given database of "candidate sequences" those proteins that belong to a given superfamily. In order to do so, each candidate sequence is separately tested against a multiple alignment of the known members of the superfamily by means of a new jumping alignment algorithm. This algorithm is an extension of the Smith-Waterman algorithm and computes a local alignment of a single sequence and a multiple alignment. In contrast to traditional methods, however, this alignment is not based on a summary of the individual columns of the multiple alignment. Rather, the candidate sequence is at each position aligned to one sequence of the multiple alignment, called the "reference sequence". In addition, the reference sequence may change within the alignment, while each such jump is penalized. To evaluate the discriminative quality of the jumping alignment algorithm, we compare it to profiles, profile hidden Markov models, and family pairwise search on a subset of the SCOP database of protein domains. The discriminative quality is assessed by median false positive counts (med-FP-counts). For moderate med-FP-counts, the number of successful searches with our method is considerably higher than with the competing methods
Automated in-silico detection of cell populations in flow cytometry readouts and its application to leukemia disease monitoring
BACKGROUND: Identification of minor cell populations, e.g. leukemic blasts within blood samples, has become increasingly important in therapeutic disease monitoring. Modern flow cytometers enable researchers to reliably measure six and more variables, describing cellular size, granularity and expression of cell-surface and intracellular proteins, for thousands of cells per second. Currently, analysis of cytometry readouts relies on visual inspection and manual gating of one- or two-dimensional projections of the data. This procedure, however, is labor-intensive and misses potential characteristic patterns in higher dimensions. RESULTS: Leukemic samples from patients with acute lymphoblastic leukemia at initial diagnosis and during induction therapy have been investigated by 4-color flow cytometry. We have utilized multivariate classification techniques, Support Vector Machines (SVM), to automate leukemic cell detection in cytometry. Classifiers were built on conventionally diagnosed training data. We assessed the detection accuracy on independent test data and analyzed marker expression of incongruently classified cells. SVM classification can recover manually gated leukemic cells with 99.78% sensitivity and 98.87% specificity. CONCLUSION: Multivariate classification techniques allow for automating cell population detection in cytometry readouts for diagnostic purposes. They potentially reduce time, costs and arbitrariness associated with these procedures. Due to their multivariate classification rules, they also allow for the reliable detection of small cell populations
A least angle regression model for the prediction of canonical and non-canonical miRNA-mRNA interactions
microRNAs (miRNAs) are short non-coding RNAs with regulatory functions in various biological processes including cell differentiation, development and oncogenic transformation. They can bind to mRNA transcripts of protein-coding genes and repress their translation or lead to mRNA degradation. Conversely, the transcription of miRNAs is regulated by proteins including transcription factors, co-factors, and messenger molecules in signaling pathways, yielding a bidirectional regulatory network of gene and miRNA expression. We describe here a least angle regression approach for uncovering the functional interplay of gene and miRNA regulation based on paired gene and miRNA expression profiles. First, we show that gene expression profiles can indeed be reconstructed from the expression profiles of miRNAs predicted to be regulating the specific gene. Second, we propose a two-step model where in the first step, sequence information is used to constrain the possible set of regulating miRNAs and in the second step, this constraint is relaxed to find regulating miRNAs that do not rely on perfect seed binding. Finally, a bidirectional network comprised of miRNAs regulating genes and genes regulating miRNAs is built from our previous regulatory predictions. After applying the method to a human cancer cell line data set, an analysis of the underlying network reveals miRNAs known to be associated with cancer when dysregulated are predictors of genes with functions in apoptosis. Among the predicted and newly identified targets that lack a classical miRNA seed binding site of a specific oncomir, miR-19b-1, we found an over-representation of genes with functions in apoptosis, which is in accordance with the previous finding that this miRNA is the key oncogenic factor in the mir-17-92 cluster. In addition, we found genes involved in DNA recombination and repair that underline its importance in maintaining the integrity of the cell
Exact likelihood computation in Boolean networks with probabilistic time delays, and its application in signal network reconstruction
Motivation: For biological pathways, it is common to measure a gene expression time series after various knockdowns of genes that are putatively involved in the process of interest. These interventional time-resolved data are most suitable for the elucidation of dynamic causal relationships in signaling networks. Even with this kind of data it is still a major and largely unsolved challenge to infer the topology and interaction logic of the underlying regulatory network. Results: In this work, we present a novel model-based approach involving Boolean networks to reconstruct small to medium-sized regulatory networks. In particular, we solve the problem of exact likelihood computation in Boolean networks with probabilistic exponential time delays. Simulations demonstrate the high accuracy of our approach. We apply our method to data of Ivanova et al. (2006), where RNA interference knockdown experiments were used to build a network of the key regulatory genes governing mouse stem cell maintenance and differentiation. In contrast to previous analyses of that data set, our method can identify feedback loops and provides new insights into the interplay of some master regulators in embryonic stem cell development. Availability and implementation: The algorithm is implemented in the statistical language R. Code and documentation are available at Bioinformatics online. Contact: [email protected] or [email protected] Supplementary information: Supplementary Materials are available at Bioinfomatics onlin
Modelling cancer progression using Mutual Hazard Networks
Motivation: Cancer progresses by accumulating genomic events, such as mutations and copy number alterations, whose chronological order is key to understanding the disease but difficult to observe. Instead, cancer progression models use co-occurence patterns in cross-sectional data to infer epistatic interactions between events and thereby uncover their most likely order of occurence. State-of-the-art progression models, however, are limited by mathematical tractability and only allow events to interact in directed acyclic graphs, to promote but not inhibit subsequent events, or to be mutually exclusive in distinct groups that cannot overlap.
Results: Here we propose Mutual Hazard Networks (MHN), a new Machine Learning algorithm to infer cyclic progression models from cross-sectional data. MHN model events by their spontaneous rate of fixation and by multiplicative effects they exert on the rates of successive events. MHN compared favourably to acyclic models in cross-validated model fit on four datasets tested. In application to the glioblastoma dataset from The Cancer Genome Atlas, MHN proposed a novel interaction in line with consecutive biopsies: IDH1 mutations are early events that promote subsequent fixation of TP53 mutations.
Availability Implementation and data are available at https://github.com/RudiSchill/MHN
Low-rank tensor methods for Markov chains with applications to tumor progression models
Cancer progression can be described by continuous-time Markov chains whose state space grows exponentially in the number of somatic mutations. The age of a tumor at diagnosis is typically unknown. Therefore, the quantity of interest is the time-marginal distribution over all possible genotypes of tumors, defined as the transient distribution integrated over an exponentially distributed observation time. It can be obtained as the solution of a large linear system. However, the sheer size of this system renders classical solvers infeasible. We consider Markov chains whose transition rates are separable functions, allowing for an efficient low-rank tensor representation of the linear system’s operator. Thus we can reduce the computational complexity from exponential to linear. We derive a convergent iterative method using low-rank formats whose result satisfies the normalization constraint of a distribution. We also perform numerical experiments illustrating that the marginal distribution is well approximated with low rank
- …