4,416 research outputs found

    The Minimum Description Length Principle for Pattern Mining: A Survey

    Full text link
    This is about the Minimum Description Length (MDL) principle applied to pattern mining. The length of this description is kept to the minimum. Mining patterns is a core task in data analysis and, beyond issues of efficient enumeration, the selection of patterns constitutes a major challenge. The MDL principle, a model selection method grounded in information theory, has been applied to pattern mining with the aim to obtain compact high-quality sets of patterns. After giving an outline of relevant concepts from information theory and coding, as well as of work on the theory behind the MDL and similar principles, we review MDL-based methods for mining various types of data and patterns. Finally, we open a discussion on some issues regarding these methods, and highlight currently active related data analysis problems

    A visual analytics approach to feature discovery and subspace exploration in protein flexibility matrices

    Get PDF
    The vast amount of information generated by domain scientists makes the transi- tion from data to knowledge difficult and often impedes important discoveries. For example, the knowledge gained from protein flexibility data sets can speed advances in genetic therapies and drug discovery. However, these models generate so much data that large scale analysis by traditional methods is almost impossible. This hinders biomedical advances. Visual analytics is a new field that can help alleviate this problem. Visual analytics attempts to seamlessly integrate human abilities in pattern recognition, domain knowledge, and synthesis with automatic analysis techniques. I propose a novel, visual analytics pipeline and prototype which eases discovery, com- parison, and exploration in the outputs of complex computational biology datasets. The approach utilizes automatic feature extraction by image segmentation to locate regions of interest in the data, visually presents the features to users in an intuitive way, and provides rich interactions for multi-resolution visual exploration. Functional- ity is also provided for subspace exploration based on automatic similarity calculation and comparative visualizations. The effectiveness of feature discovery and subspace exploration is shown through a user study and user scenarios. Feedback from analysts confirms the suitability of the proposed solution to domain tasks

    Subjectively interesting connecting trees and forests

    Get PDF
    Consider a large graph or network, and a user-provided set of query vertices between which the user wishes to explore relations. For example, a researcher may want to connect research papers in a citation network, an analyst may wish to connect organized crime suspects in a communication network, or an internet user may want to organize their bookmarks given their location in the world wide web. A natural way to do this is to connect the vertices in the form of a tree structure that is present in the graph. However, in sufficiently dense graphs, most such trees will be large or somehow trivial (e.g. involving high degree vertices) and thus not insightful. Extending previous research, we define and investigate the new problem of mining subjectively interesting trees connecting a set of query vertices in a graph, i.e., trees that are highly surprising to the specific user at hand. Using information theoretic principles, we formalize the notion of interestingness of such trees mathematically, taking in account certain prior beliefs the user has specified about the graph. A remaining problem is efficiently fitting a prior belief model. We show how this can be done for a large class of prior beliefs. Given a specified prior belief model, we then propose heuristic algorithms to find the best trees efficiently. An empirical validation of our methods on a large real graphs evaluates the different heuristics and validates the interestingness of the given trees

    A foundation model for atomistic materials chemistry

    Full text link
    Machine-learned force fields have transformed the atomistic modelling of materials by enabling simulations of ab initio quality on unprecedented time and length scales. However, they are currently limited by: (i) the significant computational and human effort that must go into development and validation of potentials for each particular system of interest; and (ii) a general lack of transferability from one chemical system to the next. Here, using the state-of-the-art MACE architecture we introduce a single general-purpose ML model, trained on a public database of 150k inorganic crystals, that is capable of running stable molecular dynamics on molecules and materials. We demonstrate the power of the MACE-MP-0 model - and its qualitative and at times quantitative accuracy - on a diverse set problems in the physical sciences, including the properties of solids, liquids, gases, chemical reactions, interfaces and even the dynamics of a small protein. The model can be applied out of the box and as a starting or "foundation model" for any atomistic system of interest and is thus a step towards democratising the revolution of ML force fields by lowering the barriers to entry.Comment: 119 pages, 63 figures, 37MB PD

    The Influence of Allostery Governing the Changes in Protein Dynamics Upon Substitution

    Get PDF
    The focus of this research is to investigate the effects of allostery on the function/activity of an enzyme, human immunodeficiency virus type 1 (HIV-1) protease, using well-defined statistical analyses of the dynamic changes of the protein and variants with unique single point substitutions 1. The experimental data1 evaluated here only characterized HIV-1 protease with one of its potential target substrates. Probing the dynamic interactions of the residues of an enzyme and its variants can offer insight of the developmental importance for allosteric signaling and their connection to a protein’s function. The realignment of the secondary structure elements can modulate the mobility along with the frequency of residue contacts as well as which residues are making contact together2-5. We postulate that if there are more contacts occurring within a structure the mobility is being constrained and therefore gaining novel contacts can negatively influence the function of a protein. The evolutionary importance of protein dynamics is probed by analyzing the residue positions possessing significant correlations and the relationship between experimental information1 (variant activities). We propose that the correlated dynamics of residues observed to have considerable correlations, if disrupted, can be used to infer the function of HIV-1 protease and its variants. Given the robustness of HIV-1 protease the identification of any significant constraint imposed on the dynamics from a potential allosteric site found to disrupt the catalytic activity of the variant is not plainly evident. We also develop machine learning (ML) algorithms to predict the protein function/activity change caused by a single point substitution by using the DCC of each residue pair. Recognition of any substantial association between the dynamics of specific residues and allosteric communication or mechanism requires detailed examination of the dynamics of HIV-1 protease and its variants. We also explore the non-linear dependency between each pair of residues using Mutual Information (MI) and how it can influence the dynamics of HIV-1 protease and its variants. We suggest that if the residues of a protein receive more or less information than that of the WT it will adversely impact the function of the protein and can be used to support the classification of a variant structure. Furthermore, using the MI of residues obtained from the MD simulations for the HIV-1 protease structure, we build a ML model to predict a protein’s change in function caused by a single point substitution. Effectively the mobility, dynamics, and non-linear features tested in these studies are found to be useful towards the prediction of potentially drug resistant substitutions related to the catalytic efficiency of HIV-1 protease and the variants
    corecore