7,887 research outputs found

    Computing and visually analyzing mutual information in molecular co-evolution

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Selective pressure in molecular evolution leads to uneven distributions of amino acids and nucleotides. In fact one observes correlations among such constituents due to a large number of biophysical mechanisms (folding properties, electrostatics, ...). To quantify these correlations the mutual information -after proper normalization - has proven most effective. The challenge is to navigate the large amount of data, which in a study for a typical protein cannot simply be plotted.</p> <p>Results</p> <p>To visually analyze mutual information we developed a matrix visualization tool that allows different views on the mutual information matrix: filtering, sorting, and weighting are among them. The user can interactively navigate a huge matrix in real-time and search e.g., for patterns and unusual high or low values. A computation of the mutual information matrix for a sequence alignment in FASTA-format is possible. The respective stand-alone program computes in addition proper normalizations for a null model of neutral evolution and maps the mutual information to <it>Z</it>-scores with respect to the null model.</p> <p>Conclusions</p> <p>The new tool allows to compute and visually analyze sequence data for possible co-evolutionary signals. The tool has already been successfully employed in evolutionary studies on HIV1 protease and acetylcholinesterase. The functionality of the tool was defined by users using the tool in real-world research. The software can also be used for visual analysis of other matrix-like data, such as information obtained by DNA microarray experiments. The package is platform-independently implemented in <monospace>Java</monospace> and free for academic use under a GPL license.</p

    An Empirically Derived Three-Dimensional Laplace Resonance in the Gliese 876 Planetary System

    Get PDF
    We report constraints on the three-dimensional orbital architecture for all four planets known to orbit the nearby M dwarf Gliese 876 based solely on Doppler measurements and demanding long-term orbital stability. Our dataset incorporates publicly available radial velocities taken with the ELODIE and CORALIE spectrographs, HARPS, and Keck HIRES as well as previously unpublished HIRES velocities. We first quantitatively assess the validity of the planets thought to orbit GJ 876 by computing the Bayes factors for a variety of different coplanar models using an importance sampling algorithm. We find that a four-planet model is preferred over a three-planet model. Next, we apply a Newtonian MCMC algorithm to perform a Bayesian analysis of the planet masses and orbits using an n-body model in three-dimensional space. Based on the radial velocities alone, we find that a 99% credible interval provides upper limits on the mutual inclinations for the three resonant planets (Φcb<6.20∘\Phi_{cb}<6.20^\circ for the "c" and "b" pair and Φbe<28.5∘\Phi_{be}<28.5^\circ for the "b" and "e" pair). Subsequent dynamical integrations of our posterior sample find that the GJ 876 planets must be roughly coplanar (Φcb<2.60∘\Phi_{cb}<2.60^\circ and Φbe<7.87∘\Phi_{be}<7.87^\circ), suggesting the amount of planet-planet scattering in the system has been low. We investigate the distribution of the respective resonant arguments of each planet pair and find that at least one argument for each planet pair and the Laplace argument librate. The libration amplitudes in our three-dimensional orbital model supports the idea of the outer-three planets having undergone significant past disk migration.Comment: 19 pages, 11 figures, 8 tables. Accepted to MNRAS. Posterior samples available at https://github.com/benelson/GJ87

    Computational Molecular Coevolution

    Get PDF
    A major goal in computational biochemistry is to obtain three-dimensional structure information from protein sequence. Coevolution represents a biological mechanism through which structural information can be obtained from a family of protein sequences. Evolutionary relationships within a family of protein sequences are revealed through sequence alignment. Statistical analyses of these sequence alignments reveals positions in the protein family that covary, and thus appear to be dependent on one another throughout the evolution of the protein family. These covarying positions are inferred to be coevolving via one of two biological mechanisms, both of which imply that coevolution is facilitated by inter-residue contact. Thus, high-quality multiple sequence alignments and robust coevolution-inferring statistics can produce structural information from sequence alone. This work characterizes the relationship between coevolution statistics and sequence alignments and highlights the implicit assumptions and caveats associated with coevolutionary inference. An investigation of sequence alignment quality and coevolutionary-inference methods revealed that such methods are very sensitive to the systematic misalignments discovered in public databases. However, repairing the misalignments in such alignments restores the predictive power of coevolution statistics. To overcome the sensitivity to misalignments, two novel coevolution-inferring statistics were developed that show increased contact prediction accuracy, especially in alignments that contain misalignments. These new statistics were developed into a suite of coevolution tools, the MIpToolset. Because systematic misalignments produce a distinctive pattern when analyzed by coevolution-inferring statistics, a new method for detecting systematic misalignments was created to exploit this phenomenon. This new method called ``local covariation\u27\u27 was used to analyze publicly-available multiple sequence alignment databases. Local covariation detected putative misalignments in a database designed to benchmark sequence alignment software accuracy. Local covariation was incorporated into a new software tool, LoCo, which displays regions of potential misalignment during alignment editing assists in their correction. This work represents advances in multiple sequence alignment creation and coevolutionary inference

    Art/Sci Nexus, 9 Evenings Revisited

    Get PDF
    Following the exhibition of Hybrid Bodies at KKW in 2016 Andrew Carnie and I were invited back to act as mentors to a group of young artists and scientists from all over Europe undertaking a week long workshop designed to lead to new art/science collaborations. We were also invited to present the Hybrid Bodies project at a one day public event preceding the workshop

    A review of random matrix theory with an application to biological data

    Get PDF
    Random matrix theory (RMT) is an area of study that has applications in a wide variety of scientific disciplines. The foundation of RMT is based on the analysis of the eigenvalue behavior of matrices. The eigenvalues of a random matrix (a matrix with stochastic entries) will behave differently than the eigenvalues from a matrix with non-random properties. Studying this bifurcation of the eigenvalue behavior provides the means to which system-specific signals can be distinguished from randomness. In particular, RMT provides an algorithmic approach to objectively remove noise from matrices with embedded signals. Major advances in data acquisition capabilities have changed the way research is conducted in many fields. Biological sciences have been revolutionized with the advent of high-throughput techniques that enable genome-wide measurements and a systems-level approach to biology. These new techniques are very promising, yet they produce a massive influx of data, which presents unique data processing challenges. A major task researchers are confronted with is how to properly filter out inherent noise from the data, while not losing valuable information. Studies have shown that RMT is an effective method to objectively process biological data. In this thesis, the underpinnings of RMT are explained and the function of the RMT algorithm used for data filtering is described. A survey of network analysis tools is also included as a way to provide insight on how to begin a rigorous, mathematical analysis of networks. Furthermore, the results of applying the RMT algorithm to a set of miRNA data extracted from the Bos taurus (domestic cow) are provided. The results of applying the RMT algorithm to the data are provided along with an implementation of the resulting network into a network analysis tool. These preliminary results demonstrate the facility of RMT coupled with network analysis tools as a basis for biological discovery --Abstract, page iii

    Utilizing gene co-expression networks for comparative transcriptomic analyses

    Get PDF
    The development of high-throughput technologies such as microarray and next-generation RNA sequencing (RNA-seq) has generated numerous transcriptomic data that can be used for comparative transcriptomics studies. Transcriptomes obtained from different species can reveal differentially expressed genes that underlie species-specific traits. It also has the potential to identify genes that have conserved gene expression patterns. However, differential expression alone does not provide information about how the genes relate to each other in terms of gene expression or if groups of genes are correlated in similar ways across species, tissues, etc. This makes gene expression networks, such as co-expression networks, valuable in terms of finding similarities or differences between genes based on their relationships with other genes. The desired outcome of this research was to develop methods for comparative transcriptomics, specifically for comparing gene co-expression networks (GCNs), either within or between any set of organisms. These networks represent genes as nodes in the network, and pairs of genes may be connected by an edge representing the strength of the relationship between the pairs. We begin with a review of currently utilized techniques available that can be used or adapted to compare gene co-expression networks. We also work to systematically determine the appropriate number of samples needed to construct reproducible gene co-expression networks for comparison purposes. In order to systematically compare these replicate networks, software to visualize the relationship between replicate networks was created to determine when the consistency of the networks begins to plateau and if this is affected by factors such as tissue type and sample size. Finally, we developed a tool called Juxtapose that utilizes gene embedding to functionally interpret the commonalities and differences between a given set of co-expression networks constructed using transcriptome datasets from various organisms. A set of transcriptome datasets were utilized from publicly available sources as well as from collaborators. GTEx and Gene Expression Omnibus (GEO) RNA-seq datasets were used for the evaluation of the techniques proposed in this research. Skeletal cell datasets of closely related species and more evolutionarily distant organisms were also analyzed to investigate the evolutionary relationships of several skeletal cell types. We found evidence that data characteristics such as tissue origin, as well as the method used to construct gene co-expression networks, can substantially impact the number of samples required to generate reproducible networks. In particular, if a threshold is used to construct a gene co-expression network for downstream analyses, the number of samples used to construct the networks is an important consideration as many samples may be required to generate networks that have a reproducible edge order when sorted by edge weight. We also demonstrated the capabilities of our proposed method for comparing GCNs, Juxtapose, showing that it is capable of consistently matching up genes in identical networks, and it also reflects the similarity between different networks using cosine distance as a measure of gene similarity. Finally, we applied our proposed method to skeletal cell networks and find evidence of conserved gene relationships within skeletal GCNs from the same species and identify modules of genes with similar embeddings across species that are enriched for biological processes involved in cartilage and osteoblast development. Furthermore, smaller sub-networks of genes reflect the phylogenetic relationships of the species analyzed using our gene embedding strategy to compare the GCNs. This research has produced methodologies and tools that can be used for evolutionary studies and generalizable to scenarios other than cross-species comparisons, including co-expression network comparisons across tissues or conditions within the same species

    A Tale of Two Approaches: Comparing Top-Down and Bottom-Up Strategies for Analyzing and Visualizing High-Dimensional Data

    Get PDF
    The proliferation of high-throughput and sensory technologies in various fields has led to a considerable increase in data volume, complexity, and diversity. Traditional data storage, analysis, and visualization methods are struggling to keep pace with the growth of modern data sets, necessitating innovative approaches to overcome the challenges of managing, analyzing, and visualizing data across various disciplines. One such approach is utilizing novel storage media, such as deoxyribonucleic acid~(DNA), which presents efficient, stable, compact, and energy-saving storage option. Researchers are exploring the potential use of DNA as a storage medium for long-term storage of significant cultural and scientific materials. In addition to novel storage media, scientists are also focussing on developing new techniques that can integrate multiple data modalities and leverage machine learning algorithms to identify complex relationships and patterns in vast data sets. These newly-developed data management and analysis approaches have the potential to unlock previously unknown insights into various phenomena and to facilitate more effective translation of basic research findings to practical and clinical applications. Addressing these challenges necessitates different problem-solving approaches. Researchers are developing novel tools and techniques that require different viewpoints. Top-down and bottom-up approaches are essential techniques that offer valuable perspectives for managing, analyzing, and visualizing complex high-dimensional multi-modal data sets. This cumulative dissertation explores the challenges associated with handling such data and highlights top-down, bottom-up, and integrated approaches that are being developed to manage, analyze, and visualize this data. The work is conceptualized in two parts, each reflecting the two problem-solving approaches and their uses in published studies. The proposed work showcases the importance of understanding both approaches, the steps of reasoning about the problem within them, and their concretization and application in various domains
    • …
    corecore