82 research outputs found

    Cobweb/3: A portable implementation

    Get PDF
    An algorithm is examined for data clustering and incremental concept formation. An overview is given of the Cobweb/3 system and the algorithm on which it is based, as well as the practical details of obtaining and running the system code. The implementation features a flexible user interface which includes a graphical display of the concept hierarchies that the system constructs

    Data mining and database systems: integrating conceptual clustering with a relational database management system.

    Get PDF
    Many clustering algorithms have been developed and improved over the years to cater for large scale data clustering. However, much of this work has been in developing numeric based algorithms that use efficient summarisations to scale to large data sets. There is a growing need for scalable categorical clustering algorithms as, although numeric based algorithms can be adapted to categorical data, they do not always produce good results. This thesis presents a categorical conceptual clustering algorithm that can scale to large data sets using appropriate data summarisations. Data mining is distinguished from machine learning by the use of larger data sets that are often stored in database management systems (DBMSs). Many clustering algorithms require data to be extracted from the DBMS and reformatted for input to the algorithm. This thesis presents an approach that integrates conceptual clustering with a DBMS. The presented approach makes the algorithm main memory independent and supports on-line data mining

    Data mining and database systems : integrating conceptual clustering with a relational database management system

    Get PDF
    Many clustering algorithms have been developed and improved over the years to cater for large scale data clustering. However, much of this work has been in developing numeric based algorithms that use efficient summarisations to scale to large data sets. There is a growing need for scalable categorical clustering algorithms as, although numeric based algorithms can be adapted to categorical data, they do not always produce good results. This thesis presents a categorical conceptual clustering algorithm that can scale to large data sets using appropriate data summarisations. Data mining is distinguished from machine learning by the use of larger data sets that are often stored in database management systems (DBMSs). Many clustering algorithms require data to be extracted from the DBMS and reformatted for input to the algorithm. This thesis presents an approach that integrates conceptual clustering with a DBMS. The presented approach makes the algorithm main memory independent and supports on-line data mining.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Advances in Single Molecule, Real-Time (SMRT) Sequencing

    Get PDF
    PacBio’s single-molecule real-time (SMRT) sequencing technology offers important advantages over the short-read DNA sequencing technologies that currently dominate the market. This includes exceptionally long read lengths (20 kb or more), unparalleled consensus accuracy, and the ability to sequence native, non-amplified DNA molecules. From fungi to insects to humans, long reads are now used to create highly accurate reference genomes by de novo assembly of genomic DNA and to obtain a comprehensive view of transcriptomes through the sequencing of full-length cDNAs. Besides reducing biases, sequencing native DNA also permits the direct measurement of DNA base modifications. Therefore, SMRT sequencing has become an attractive technology in many fields, such as agriculture, basic science, and medical research. The boundaries of SMRT sequencing are continuously being pushed by developments in bioinformatics and sample preparation. This book contains a collection of articles showcasing the latest developments and the breadth of applications enabled by SMRT sequencing technology

    An exploration of methodologies to improve semi-supervised hierarchical clustering with knowledge-based constraints

    Get PDF
    Clustering algorithms with constraints (also known as semi-supervised clustering algorithms) have been introduced to the field of machine learning as a significant variant to the conventional unsupervised clustering learning algorithms. They have been demonstrated to achieve better performance due to integrating prior knowledge during the clustering process, that enables uncovering relevant useful information from the data being clustered. However, the research conducted within the context of developing semi-supervised hierarchical clustering techniques are still an open and active investigation area. Majority of current semi-supervised clustering algorithms are developed as partitional clustering (PC) methods and only few research efforts have been made on developing semi-supervised hierarchical clustering methods. The aim of this research is to enhance hierarchical clustering (HC) algorithms based on prior knowledge, by adopting novel methodologies. [Continues.

    DETECTING MUTUALLY EXCLUSIVE INTERACTIONS IN PROTEIN-PROTEIN INTERACTION MAPS

    Get PDF
    Proteins are responsible for an impressive large variety of functions. To properly understand the significance of protein-protein interactions in the cell it is important to address two problems: first, is identification of the different interactions that are involved in each biological function, and, second, is determining how proteins interact and the consequences of the interaction. The identification of protein interactions by high-throughput experiments has led to the development of a number of methods for their analysis, producing, in the last years, a vast amount of interacting data. However, there are at least two issues that arise from the analysis of such experimental maps, these are, on one side, the significant number of false positives they contain and, on the other, the difficulty in distinguishing whether, when more than one protein interact with the same partner, they can do so simultaneously, i.e. whether their interaction is mutually exclusive. The general strategy we describe is based on the combination of known three-dimensional structures with protein-protein interaction networks to determine which of the multiple interactions or connections that are made by a hub can occur in mutually exclusive fashion, and, in such cases, identify, whenever is possible, the shared similarities in their binding regions, concluding that their interaction has to be mutually exclusive (i.e. not simultaneous) and that the region identified by similarity is indeed the interaction site. We applied this strategy to the interactomes of seven organisms. We show that our methodology allows the identification of mutually exclusive interactions with accuracy higher than 77%. The procedure also allows us to predict which residues are likely to be in the binding interface of the nodes, and in a significant number of cases (between 63% and 75%) we correctly identify at least one of them (5 on average) and this has obvious implications for helping to reduce the search space in docking procedures. The coverage of the method varies substantially for different organisms, as it could be expected, however it does reach 42% for human and more than 36% for yeast averaging at about 25%. These figures are bound to increase with time both thanks to the progress in experimental methods and, possibly, to the increasing reliability of modeling techniques. For this reason, we also introduce here the Estrella server that embodies this strategy, is designed for users interested in validating specific hypotheses about the functional role of a protein-protein interaction and it also allows access to pre-computed data for seven organisms

    Qluster: An easy-to-implement generic workflow for robust clustering of health data

    Get PDF
    The exploration of heath data by clustering algorithms allows to better describe the populations of interest by seeking the sub-profiles that compose it. This therefore reinforces medical knowledge, whether it is about a disease or a targeted population in real life. Nevertheless, contrary to the so-called conventional biostatistical methods where numerous guidelines exist, the standardization of data science approaches in clinical research remains a little discussed subject. This results in a significant variability in the execution of data science projects, whether in terms of algorithms used, reliability and credibility of the designed approach. Taking the path of parsimonious and judicious choice of both algorithms and implementations at each stage, this article proposes Qluster, a practical workflow for performing clustering tasks. Indeed, this workflow makes a compromise between (1) genericity of applications (e.g. usable on small or big data, on continuous, categorical or mixed variables, on database of high-dimensionality or not), (2) ease of implementation (need for few packages, few algorithms, few parameters, ...), and (3) robustness (e.g. use of proven algorithms and robust packages, evaluation of the stability of clusters, management of noise and multicollinearity). This workflow can be easily automated and/or routinely applied on a wide range of clustering projects. It can be useful both for data scientists with little experience in the field to make data clustering easier and more robust, and for more experienced data scientists who are looking for a straightforward and reliable solution to routinely perform preliminary data mining. A synthesis of the literature on data clustering as well as the scientific rationale supporting the proposed workflow is also provided. Finally, a detailed application of the workflow on a concrete use case is provided, along with a practical discussion for data scientists. An implementation on the Dataiku platform is available upon request to the authors
    • …
    corecore