31 research outputs found
Probabilistic Random Walk Models for Comparative Network Analysis
Graph-based systems and data analysis methods have become critical tools in many
fields as they can provide an intuitive way of representing and analyzing interactions between
variables. Due to the advances in measurement techniques, a massive amount of
labeled data that can be represented as nodes on a graph (or network) have been archived
in databases. Additionally, novel data without label information have been gradually generated
and archived. Labeling and identifying characteristics of novel data is an important
first step in utilizing the valuable data in an effective and meaningful way. Comparative
network analysis is an effective computational means to identify and predict the properties
of the unlabeled data by comparing the similarities and differences between well-studied
and less-studied networks. Comparative network analysis aims to identify the matching
nodes and conserved subnetworks across multiple networks to enable a prediction of the
properties of the nodes in the less-studied networks based on the properties of the matching
nodes in the well-studied networks (i.e., transferring knowledge between networks).
One of the fundamental and important questions in comparative network analysis is
how to accurately estimate node-to-node correspondence as it can be a critical clue in
analyzing the similarities and differences between networks. Node correspondence is a
comprehensive similarity that integrates various types of similarity measurements in a
balanced manner. However, there are several challenges in accurately estimating the node
correspondence for large-scale networks. First, the scale of the networks is a critical issue.
As networks generally include a large number of nodes, we have to examine an extremely
large space and it can pose a computational challenge due to the combinatorial nature of
the problem. Furthermore, although there are matching nodes and conserved subnetworks
in different networks, structural variations such as node insertions and deletions make it difficult to integrate a topological similarity.
In this dissertation, novel probabilistic random walk models are proposed to accurately
estimate node-to-node correspondence between networks. First, we propose a context-sensitive
random walk (CSRW) model. In the CSRW model, the random walker analyzes
the context of the current position of the random walker and it can switch the random
movement to either a simultaneous walk on both networks or an individual walk on one
of the networks. The context-sensitive nature of the random walker enables the method
to effectively integrate different types of similarities by dealing with structural variations.
Second, we propose the CUFID (Comparative network analysis Using the steady-state
network Flow to IDentify orthologous proteins) model. In the CUFID model, we construct
an integrated network by inserting pseudo edges between potential matching nodes in
different networks. Then, we design the random walk protocol to transit more frequently
between potential matching nodes as their node similarity increases and they have more
matching neighboring nodes. We apply the proposed random walk models to comparative
network analysis problems: global network alignment and network querying. Through
extensive performance evaluations, we demonstrate that the proposed random walk models
can accurately estimate node correspondence and these can lead to improved and reliable
network comparison results
A verified genomic reference sample for assessing performance of cancer panels detecting small variants of low allele frequency
BackgroundOncopanel genomic testing, which identifies important somatic variants, is increasingly common in medical practice and especially in clinical trials. Currently, there is a paucity of reliable genomic reference samples having a suitably large number of pre-identified variants for properly assessing oncopanel assay analytical quality and performance. The FDA-led Sequencing and Quality Control Phase 2 (SEQC2) consortium analyze ten diverse cancer cell lines individually and their pool, termed Sample A, to develop a reference sample with suitably large numbers of coding positions with known (variant) positives and negatives for properly evaluating oncopanel analytical performance.ResultsIn reference Sample A, we identify more than 40,000 variants down to 1% allele frequency with more than 25,000 variants having less than 20% allele frequency with 1653 variants in COSMIC-related genes. This is 5-100x more than existing commercially available samples. We also identify an unprecedented number of negative positions in coding regions, allowing statistical rigor in assessing limit-of-detection, sensitivity, and precision. Over 300 loci are randomly selected and independently verified via droplet digital PCR with 100% concordance. Agilent normal reference Sample B can be admixed with Sample A to create new samples with a similar number of known variants at much lower allele frequency than what exists in Sample A natively, including known variants having allele frequency of 0.02%, a range suitable for assessing liquid biopsy panels.ConclusionThese new reference samples and their admixtures provide superior capability for performing oncopanel quality control, analytical accuracy, and validation for small to large oncopanels and liquid biopsy assays.Peer reviewe
Efficient representation of uncertainty in multiple sequence alignments using directed acyclic graphs
Background
A standard procedure in many areas of bioinformatics is to use a single multiple sequence alignment (MSA) as the basis for various types of analysis. However, downstream results may be highly sensitive to the alignment used, and neglecting the uncertainty in the alignment can lead to significant bias in the resulting inference. In recent years, a number of approaches have been developed for probabilistic sampling of alignments, rather than simply generating a single optimum. However, this type of probabilistic information is currently not widely used in the context of downstream inference, since most existing algorithms are set up to make use of a single alignment.
Results
In this work we present a framework for representing a set of sampled alignments as a directed acyclic graph (DAG) whose nodes are alignment columns; each path through this DAG then represents a valid alignment. Since the probabilities of individual columns can be estimated from empirical frequencies, this approach enables sample-based estimation of posterior alignment probabilities. Moreover, due to conditional independencies between columns, the graph structure encodes a much larger set of alignments than the original set of sampled MSAs, such that the effective sample size is greatly increased.
Conclusions
The alignment DAG provides a natural way to represent a distribution in the space of MSAs, and allows for existing algorithms to be efficiently scaled up to operate on large sets of alignments. As an example, we show how this can be used to compute marginal probabilities for tree topologies, averaging over a very large number of MSAs. This framework can also be used to generate a statistically meaningful summary alignment; example applications show that this summary alignment is consistently more accurate than the majority of the alignment samples, leading to improvements in downstream tree inference.
Implementations of the methods described in this article are available at http://statalign.github.io/WeaveAlign webcite
PROPER: global protein interaction network alignment through percolation matching
Background The alignment of protein-protein interaction (PPI) networks enables us to uncover the relationships between different species, which leads to a deeper understanding of biological systems. Network alignment can be used to transfer biological knowledge between species. Although different PI-network alignment algorithms were introduced during the last decade, developing an accurate and scalable algorithm that can find alignments with high biological and structural similarities among PPI networks is still challenging. Results In this paper, we introduce a new global network alignment algorithm for PPI networks called PROPER. Compared to other global network alignment methods, our algorithm shows higher accuracy and speed over real PPI datasets and synthetic networks. We show that the PROPER algorithm can detect large portions of conserved biological pathways between species. Also, using a simple parsimonious evolutionary model, we explain why PROPER performs well based on several different comparison criteria. Conclusions We highlight that PROPER has high potential in further applications such as detecting biological pathways, finding protein complexes and PPI prediction. The PROPER algorithm is available at http://proper.epfl.ch
An expanded evaluation of protein function prediction methods shows an improvement in accuracy
Background: A major bottleneck in our understanding of the molecular underpinnings of life is the assignment of function to proteins. While molecular experiments provide the most reliable annotation of proteins, their relatively low throughput and restricted purview have led to an increasing role for computational function prediction. However, assessing methods for protein function prediction and tracking progress in the field remain challenging.Results: We conducted the second critical assessment of functional annotation (CAFA), a timed challenge to assess computational methods that automatically assign protein function. We evaluated 126 methods from 56 research groups for their ability to predict biological functions using Gene Ontology and gene-disease associations using Human Phenotype Ontology on a set of 3681 proteins from 18 species. CAFA2 featured expanded analysis compared with CAFA1, with regards to data set size, variety, and assessment metrics. To review progress in the field, the analysis compared the best methods from CAFA1 to those of CAFA2.Conclusions: The top-performing methods in CAFA2 outperformed those from CAFA1. This increased accuracy can be attributed to a combination of the growing number of experimental annotations and improved methods for function prediction. The assessment also revealed that the definition of top-performing algorithms is ontology specific, that different performance metrics can be used to probe the nature of accurate predictions, and the relative diversity of predictions in the biological process and human phenotype ontologies. While there was methodological improvement between CAFA1 and CAFA2, the interpretation of results and usefulness of individual methods remain context-dependent