1,157 research outputs found

    Structures in High-Dimensional Data: Intrinsic Dimension and Cluster Analysis

    Get PDF
    With today's improved measurement and data storing technologies it has become common to collect data in search for hypotheses instead of for testing hypotheses---to do exploratory data analysis. Finding patterns and structures in data is the main goal. This thesis deals with two kinds of structures that can convey relationships between different parts of data in a high-dimensional space: manifolds and clusters. They are in a way opposites of each other: a manifold structure shows that it is plausible to connect two distant points through the manifold, a clustering shows that it is plausible to separate two nearby points by assigning them to different clusters. But clusters and manifolds can also be the same: each cluster can be a manifold of its own.The first paper in this thesis concerns one specific aspect of a manifold structure, namely its dimension, also called the intrinsic dimension of the data. A novel estimator of intrinsic dimension, taking advantage of ``the curse of dimensionality'', is proposed and evaluated. It is shown that it has in general less bias than estimators from the literature and can therefore better distinguish manifolds with different dimensions.The second and third paper in this thesis concern cluster analysis of data generated by flow cytometry---a high-throughput single-cell measurement technology. In this area, clustering is performed routinely by manual assignment of data in two-dimensional plots, to identify cell populations. It is a tedious and subjective task, especially since data often has four, eight, twelve or even more dimensions, and the analysts need to decide which two dimensions to look at together, and in which order.In the second paper of the thesis a new pipeline for automated cell population identification is proposed, which can process multiple flow cytometry samples in parallel using a hierarchical model that shares information between the clusterings of the samples, thus making corresponding clusters in different samples similar while allowing for variation in cluster location and shape.In the third and final paper of the thesis, statistical tests for unimodality are investigated as a tool for quality control of automated cell population identification algorithms. It is shown that the different tests have different interpretations of unimodality and thus accept different kinds of clusters as sufficiently close to unimodal

    ENABLING TECHNIQUES FOR EXPRESSIVE FLOW FIELD VISUALIZATION AND EXPLORATION

    Get PDF
    Flow visualization plays an important role in many scientific and engineering disciplines such as climate modeling, turbulent combustion, and automobile design. The most common method for flow visualization is to display integral flow lines such as streamlines computed from particle tracing. Effective streamline visualization should capture flow patterns and display them with appropriate density, so that critical flow information can be visually acquired. In this dissertation, we present several approaches that facilitate expressive flow field visualization and exploration. First, we design a unified information-theoretic framework to model streamline selection and viewpoint selection as symmetric problems. Two interrelated information channels are constructed between a pool of candidate streamlines and a set of sample viewpoints. Based on these information channels, we define streamline information and viewpoint information to select best streamlines and viewpoints, respectively. Second, we present a focus+context framework to magnify small features and reduce occlusion around them while compacting the context region in a full view. This framework parititions the volume into blocks and deforms them to guide streamline repositioning. The desired deformation is formulated into energy terms and achieved by minimizing the energy function. Third, measuring the similarity of integral curves is fundamental to many tasks such as feature detection, pattern querying, streamline clustering and hierarchical exploration. We introduce FlowString that extracts shape invariant features from streamlines to form an alphabet of characters, and encodes each streamline into a string. The similarity of two streamline segments then becomes a specially designed edit distance between two strings. Leveraging the suffix tree, FlowString provides a string-based method for exploratory streamline analysis and visualization. A universal alphabet is learned from multiple data sets to capture basic flow patterns that exist in a variety of flow fields. This allows easy comparison and efficient query across data sets. Fourth, for exploration of vascular data sets, which contain a series of vector fields together with multiple scalar fields, we design a web-based approach for users to investigate the relationship among different properties guided by histograms. The vessel structure is mapped from the 3D volume space to a 2D graph, which allow more efficient interaction and effective visualization on websites. A segmentation scheme is proposed to divide the vessel structure based on a user specified property to further explore the distribution of that property over space

    Unsupervised learning on social data

    Get PDF

    Machine learning methods for genomic high-content screen data analysis applied to deduce organization of endocytic network

    Get PDF
    High-content screens are widely used to get insight on mechanistic organization of biological systems. Chemical and/or genomic interferences are used to modulate molecular machinery, then light microscopy and quantitative image analysis yield a large number of parameters describing phenotype. However, extracting functional information from such high-content datasets (e.g. links between cellular processes or functions of unknown genes) remains challenging. This work is devoted to the analysis of a multi-parametric image-based genomic screen of endocytosis, the process whereby cells uptake cargoes (signals and nutrients) and distribute them into different subcellular compartments. The complexity of the quantitative endocytic data was approached using different Machine Learning techniques, namely, Clustering methods, Bayesian networks, Principal and Independent component analysis, Artificial neural networks. The main goal of such an analysis is to predict possible modes of action of screened genes and also to find candidate genes that can be involved in a process of interest. The degree of freedom for the multidimensional phenotypic space was identified using the data distributions, and then the high-content data were deconvolved into separate signals from different cellular modules. Some of those basic signals (phenotypic traits) were straightforward to interpret in terms of known molecular processes; the other components gave insight into interesting directions for further research. The phenotypic profile of perturbation of individual genes are sparse in coordinates of the basic signals, and, therefore, intrinsically suggest their functional roles in cellular processes. Being a very fundamental process, endocytosis is specifically modulated by a variety of different pathways in the cell; therefore, endocytic phenotyping can be used for analysis of non-endocytic modules in the cell. Proposed approach can be also generalized for analysis of other high-content screens.:Contents Objectives Chapter 1 Introduction 1.1 High-content biological data 1.1.1 Different perturbation types for HCS 1.1.2 Types of observations in HTS 1.1.3 Goals and outcomes of MP HTS 1.1.4 An overview of the classical methods of analysis of biological HT- and HCS data 1.2 Machine learning for systems biology 1.2.1 Feature selection 1.2.2 Unsupervised learning 1.2.3 Supervised learning 1.2.4 Artificial neural networks 1.3 Endocytosis as a system process 1.3.1 Endocytic compartments and main players 1.3.2 Relation to other cellular processes Chapter 2 Experimental and analytical techniques 2.1 Experimental methods 2.1.1 RNA interference 2.1.2 Quantitative multiparametric image analysis 2.2 Detailed description of the endocytic HCS dataset 2.2.1 Basic properties of the endocytic dataset 2.2.2 Control subset of genes 2.3 Machine learning methods 2.3.1 Latent variables models 2.3.2 Clustering 2.3.3 Bayesian networks 2.3.4 Neural networks Chapter 3 Results 3.1 Selection of labeled data for training and validation based on KEGG information about genes pathways 3.2 Clustering of genes 3.2.1 Comparison of clustering techniques on control dataset 3.2.2 Clustering results 3.3 Independent components as basic phenotypes 3.3.1 Algorithm for identification of the best number of independent components 3.3.2 Application of ICA on the full dataset and on separate assays of the screen 3.3.3 Gene annotation based on revealed phenotypes 3.3.4 Searching for genes with target function 3.4 Bayesian network on endocytic parameters 3.4.1 Prediction of pathway based on parameters values using Naïve Bayesian Classifier 3.4.2 General Bayesian Networks 3.5 Neural networks 3.5.1 Autoencoders as nonlinear ICA 3.5.2 siRNA sequence motives discovery with deep NN 3.6 Biological results 3.6.1 Rab11 ZNF-specific phenotype found by ICA 3.6.2 Structure of BN revealed dependency between endocytosis and cell adhesion Chapter 4 Discussion 4.1 Machine learning approaches for discovery of phenotypic patterns 4.1.1 Functional annotation of unknown genes based on phenotypic profiles 4.1.2 Candidate genes search 4.2 Adaptation to other HCS data and generalization Chapter 5 Outlook and future perspectives 5.1 Handling sequence-dependent off-target effects with neural networks 5.2 Transition between machine learning and systems biology models Acknowledgements References Appendix A.1 Full list of cellular and endocytic parameters A.2 Description of independent components of the full dataset A.3 Description of independent components extracted from separate assays of the HC

    Community landscapes: an integrative approach to determine overlapping network module hierarchy, identify key nodes and predict network dynamics

    Get PDF
    Background: Network communities help the functional organization and evolution of complex networks. However, the development of a method, which is both fast and accurate, provides modular overlaps and partitions of a heterogeneous network, has proven to be rather difficult. Methodology/Principal Findings: Here we introduce the novel concept of ModuLand, an integrative method family determining overlapping network modules as hills of an influence function-based, centrality-type community landscape, and including several widely used modularization methods as special cases. As various adaptations of the method family, we developed several algorithms, which provide an efficient analysis of weighted and directed networks, and (1) determine pervasively overlapping modules with high resolution; (2) uncover a detailed hierarchical network structure allowing an efficient, zoom-in analysis of large networks; (3) allow the determination of key network nodes and (4) help to predict network dynamics. Conclusions/Significance: The concept opens a wide range of possibilities to develop new approaches and applications including network routing, classification, comparison and prediction.Comment: 25 pages with 6 figures and a Glossary + Supporting Information containing pseudo-codes of all algorithms used, 14 Figures, 5 Tables (with 18 module definitions, 129 different modularization methods, 13 module comparision methods) and 396 references. All algorithms can be downloaded from this web-site: http://www.linkgroup.hu/modules.ph

    A holistic evaluation concept for long-term structural health monitoring

    Get PDF
    [no abstract

    On the edges of clustering

    Get PDF

    New Fundamental Technologies in Data Mining

    Get PDF
    The progress of data mining technology and large public popularity establish a need for a comprehensive text on the subject. The series of books entitled by "Data Mining" address the need by presenting in-depth description of novel mining algorithms and many useful applications. In addition to understanding each section deeply, the two books present useful hints and strategies to solving problems in the following chapters. The contributing authors have highlighted many future research directions that will foster multi-disciplinary collaborations and hence will lead to significant development in the field of data mining

    Large dataset complexity reduction for classification: An optimization perspective

    Get PDF
    Doctor of PhilosophyComputational complexity in data mining is attributed to algorithms but lies hugely with the data. Different algorithms may exist to solve the same problem, but the simplest is not always the best. At the same time, data of astronomical proportions is rather common, boosted by automation, and the fuller the data, the better resolution of the concept it projects. Paradoxically, it is the computing power that is lacking. Perhaps a fast algorithm can be run on the data, but not the optimal. Even then any modeling is much constrained, involving serial application of many algorithms. The only other way to relieve the computational load is via making the data lighter. Any representative subset has to preserve the data essence suiting, ideally, any algorithm. The reduction should minimize the error of approximation, while trading precision for performance. Data mining is a wide field. We concentrate on classification. In the literature review we present a variety of methods, emphasizing the effort of past decade. Two major objects of reduction are instances and attributes. The data can be also recast into a more economical format. We address sampling, noise reduction, class domain binarization, feature ranking, feature subset selection, feature extraction, and also discretization of continuous features. Achievements are tremendous, but so are possibilities. We improve an existing technique of data cleansing and suggest a way of data condensing as the extension. We also touch on noise reduction. Instance similarity, excepting the class mix, prompts a technique of feature selection. Additionally, we consider multivariate discretization, enabling a compact data representation without the size change. We compare proposed methods with alternative techniques which we introduce new, implement or use available

    Quality-Driven video analysis for the improvement of foreground segmentation

    Full text link
    Tesis Doctoral inédita leída en la Universidad Autónoma de Madrid, Escuela Politécnica Superior, Departamento de Tecnología Electrónica y de las Comunicaciones.Fecha de lectura: 15-06-2018It was partially supported by the Spanish Government (TEC2014-53176-R, HAVideo
    • …
    corecore