438 research outputs found

    Maximal information component analysis: a novel non-linear network analysis method.

    Get PDF
    BackgroundNetwork construction and analysis algorithms provide scientists with the ability to sift through high-throughput biological outputs, such as transcription microarrays, for small groups of genes (modules) that are relevant for further research. Most of these algorithms ignore the important role of non-linear interactions in the data, and the ability for genes to operate in multiple functional groups at once, despite clear evidence for both of these phenomena in observed biological systems.ResultsWe have created a novel co-expression network analysis algorithm that incorporates both of these principles by combining the information-theoretic association measure of the maximal information coefficient (MIC) with an Interaction Component Model. We evaluate the performance of this approach on two datasets collected from a large panel of mice, one from macrophages and the other from liver by comparing the two measures based on a measure of module entropy, Gene Ontology (GO) enrichment, and scale-free topology (SFT) fit. Our algorithm outperforms a widely used co-expression analysis method, weighted gene co-expression network analysis (WGCNA), in the macrophage data, while returning comparable results in the liver dataset when using these criteria. We demonstrate that the macrophage data has more non-linear interactions than the liver dataset, which may explain the increased performance of our method, termed Maximal Information Component Analysis (MICA) in that case.ConclusionsIn making our network algorithm more accurately reflect known biological principles, we are able to generate modules with improved relevance, particularly in networks with confounding factors such as gene by environment interactions

    PETA: Evaluating the Impact of Protein Transfer Learning with Sub-word Tokenization on Downstream Applications

    Full text link
    Large protein language models are adept at capturing the underlying evolutionary information in primary structures, offering significant practical value for protein engineering. Compared to natural language models, protein amino acid sequences have a smaller data volume and a limited combinatorial space. Choosing an appropriate vocabulary size to optimize the pre-trained model is a pivotal issue. Moreover, despite the wealth of benchmarks and studies in the natural language community, there remains a lack of a comprehensive benchmark for systematically evaluating protein language model quality. Given these challenges, PETA trained language models with 14 different vocabulary sizes under three tokenization methods. It conducted thousands of tests on 33 diverse downstream datasets to assess the models' transfer learning capabilities, incorporating two classification heads and three random seeds to mitigate potential biases. Extensive experiments indicate that vocabulary sizes between 50 and 200 optimize the model, whereas sizes exceeding 800 detrimentally affect the model's representational performance. Our code, model weights and datasets are available at https://github.com/ginnm/ProteinPretraining.Comment: 46 pages, 4figures, 9 table

    Hierarchical Dirichlet Process-Based Models For Discovery of Cross-species Mammalian Gene Expression

    Get PDF
    An important research problem in computational biology is theidentification of expression programs, sets of co-activatedgenes orchestrating physiological processes, and thecharacterization of the functional breadth of these programs. Theuse of mammalian expression data compendia for discovery of suchprograms presents several challenges, including: 1) cellularinhomogeneity within samples, 2) genetic and environmental variationacross samples, and 3) uncertainty in the numbers of programs andsample populations. We developed GeneProgram, a new unsupervisedcomputational framework that uses expression data to simultaneouslyorganize genes into overlapping programs and tissues into groups toproduce maps of inter-species expression programs, which are sortedby generality scores that exploit the automatically learnedgroupings. Our method addresses each of the above challenges byusing a probabilistic model that: 1) allocates mRNA to differentexpression programs that may be shared across tissues, 2) ishierarchical, treating each tissue as a sample from a population ofrelated tissues, and 3) uses Dirichlet Processes, a non-parametricBayesian method that provides prior distributions over numbers ofsets while penalizing model complexity. Using real gene expressiondata, we show that GeneProgram outperforms several popularexpression analysis methods in recovering biologically interpretablegene sets. From a large compendium of mouse and human expressiondata, GeneProgram discovers 19 tissue groups and 100 expressionprograms active in mammalian tissues. Our method automaticallyconstructs a comprehensive, body-wide map of expression programs andcharacterizes their functional generality. This map can be used forguiding future biological experiments, such as discovery of genesfor new drug targets that exhibit minimal "cross-talk" withunintended organs, or genes that maintain general physiologicalresponses that go awry in disease states. Further, our method isgeneral, and can be applied readily to novel compendia of biologicaldata

    Field-control, phase-transitions, and life's emergence

    Get PDF
    Instances of critical-like characteristics in living systems at each organizational level as well as the spontaneous emergence of computation (Langton), indicate the relevance of self-organized criticality (SOC). But extrapolating complex bio-systems to life's origins, brings up a paradox: how could simple organics--lacking the 'soft matter' response properties of today's bio-molecules--have dissipated energy from primordial reactions in a controlled manner for their 'ordering'? Nevertheless, a causal link of life's macroscopic irreversible dynamics to the microscopic reversible laws of statistical mechanics is indicated via the 'functional-takeover' of a soft magnetic scaffold by organics (c.f. Cairns-Smith's 'crystal-scaffold'). A field-controlled structure offers a mechanism for bootstrapping--bottom-up assembly with top-down control: its super-paramagnetic components obey reversible dynamics, but its dissipation of H-field energy for aggregation breaks time-reversal symmetry. The responsive adjustments of the controlled (host) mineral system to environmental changes would bring about mutual coupling between random organic sets supported by it; here the generation of long-range correlations within organic (guest) networks could include SOC-like mechanisms. And, such cooperative adjustments enable the selection of the functional configuration by altering the inorganic network's capacity to assist a spontaneous process. A non-equilibrium dynamics could now drive the kinetically-oriented system towards a series of phase-transitions with appropriate organic replacements 'taking-over' its functions.Comment: 54 pages, pdf fil

    Delegated causality of complex systems

    No full text
    A notion of delegated causality is introduced here. This subtle kind of causality is dual to interventional causality. Delegated causality elucidates the causal role of dynamical systems at the “edge of chaos”, explicates evident cases of downward causation, and relates emergent phenomena to Gödel’s incompleteness theorem. Apparently rich implications are noticed in biology and Chinese philosophy. The perspective of delegated causality supports cognitive interpretations of self-organization and evolution

    TOSNet : a topic-based optimal subnetwork identification in academic networks

    Get PDF
    Subnetwork identification plays a significant role in analyzing, managing, and comprehending the structure and functions in big networks. Numerous approaches have been proposed to solve the problem of subnetwork identification as well as community detection. Most of the methods focus on detecting communities by considering node attributes, edge information, or both. This study focuses on discovering subnetworks containing researchers with similar or related areas of interest or research topics. A topic- aware subnetwork identification is essential to discover potential researchers on particular research topics and provide qualitywork. Thus, we propose a topic-based optimal subnetwork identification approach (TOSNet). Based on some fundamental characteristics, this paper addresses the following problems: 1)How to discover topic-based subnetworks with a vigorous collaboration intensity? 2) How to rank the discovered subnetworks and single out one optimal subnetwork? We evaluate the performance of the proposed method against baseline methods by adopting the modularity measure, assess the accuracy based on the size of the identified subnetworks, and check the scalability for different sizes of benchmark networks. The experimental findings indicate that our approach shows excellent performance in identifying contextual subnetworks that maintain intensive collaboration amongst researchers for a particular research topic. © 2020 Institute of Electrical and Electronics Engineers Inc.. All rights reserved

    Manifold Learning in Atomistic Simulations: A Conceptual Review

    Full text link
    Analyzing large volumes of high-dimensional data requires dimensionality reduction: finding meaningful low-dimensional structures hidden in their high-dimensional observations. Such practice is needed in atomistic simulations of complex systems where even thousands of degrees of freedom are sampled. An abundance of such data makes gaining insight into a specific physical problem strenuous. Our primary aim in this review is to focus on unsupervised machine learning methods that can be used on simulation data to find a low-dimensional manifold providing a collective and informative characterization of the studied process. Such manifolds can be used for sampling long-timescale processes and free-energy estimation. We describe methods that can work on datasets from standard and enhanced sampling atomistic simulations. Unlike recent reviews on manifold learning for atomistic simulations, we consider only methods that construct low-dimensional manifolds based on Markov transition probabilities between high-dimensional samples. We discuss these techniques from a conceptual point of view, including their underlying theoretical frameworks and possible limitations

    High content imaging of unbiased chemical perturbations reveals that the phenotypic plasticity of the actin cytoskeleton is constrained

    Get PDF
    Although F-actin has a large number of binding partners and regulators, the number of phenotypic states available to the actin cytoskeleton is unknown. Here, we quantified 74 features defining filamentous actin (F-actin) and cellular morphology in >25 million cells after treatment with a library of 114,400 structurally diverse compounds. After reducing the dimensionality of these data, only ∌25 recurrent F-actin phenotypes emerged, each defined by distinct quantitative features that could be machine learned. We identified 2,003 unknown compounds as inducers of actin-related phenotypes, including two that directly bind the focal adhesion protein, talin. Moreover, we observed that compounds with distinct molecular mechanisms could induce equivalent phenotypes and that initially divergent cellular responses could converge over time. These findings suggest a conceptual parallel between the actin cytoskeleton and gene regulatory networks, where the theoretical plasticity of interactions is nearly infinite, yet phenotypes in vivo are constrained into a limited subset of practicable configurations

    New approaches for unsupervised transcriptomic data analysis based on Dictionary learning

    Get PDF
    The era of high-throughput data generation enables new access to biomolecular profiles and exploitation thereof. However, the analysis of such biomolecular data, for example, transcriptomic data, suffers from the so-called "curse of dimensionality". This occurs in the analysis of datasets with a significantly larger number of variables than data points. As a consequence, overfitting and unintentional learning of process-independent patterns can appear. This can lead to insignificant results in the application. A common way of counteracting this problem is the application of dimension reduction methods and subsequent analysis of the resulting low-dimensional representation that has a smaller number of variables. In this thesis, two new methods for the analysis of transcriptomic datasets are introduced and evaluated. Our methods are based on the concepts of Dictionary learning, which is an unsupervised dimension reduction approach. Unlike many dimension reduction approaches that are widely applied for transcriptomic data analysis, Dictionary learning does not impose constraints on the components that are to be derived. This allows for great flexibility when adjusting the representation to the data. Further, Dictionary learning belongs to the class of sparse methods. The result of sparse methods is a model with few non-zero coefficients, which is often preferred for its simplicity and ease of interpretation. Sparse methods exploit the fact that the analysed datasets are highly structured. Indeed, a characteristic of transcriptomic data is particularly their structuredness, which appears due to the connection of genes and pathways, for example. Nonetheless, the application of Dictionary learning in medical data analysis is mainly restricted to image analysis. Another advantage of Dictionary learning is that it is an interpretable approach. Interpretability is a necessity in biomolecular data analysis to gain a holistic understanding of the investigated processes. Our two new transcriptomic data analysis methods are each designed for one main task: (1) identification of subgroups for samples from mixed populations, and (2) temporal ordering of samples from dynamic datasets, also referred to as "pseudotime estimation". Both methods are evaluated on simulated and real-world data and compared to other methods that are widely applied in transcriptomic data analysis. Our methods convince through high performance and overall outperform the comparison methods
    • 

    corecore