306 research outputs found
Algorithmic and Statistical Perspectives on Large-Scale Data Analysis
In recent years, ideas from statistics and scientific computing have begun to
interact in increasingly sophisticated and fruitful ways with ideas from
computer science and the theory of algorithms to aid in the development of
improved worst-case algorithms that are useful for large-scale scientific and
Internet data analysis problems. In this chapter, I will describe two recent
examples---one having to do with selecting good columns or features from a (DNA
Single Nucleotide Polymorphism) data matrix, and the other having to do with
selecting good clusters or communities from a data graph (representing a social
or information network)---that drew on ideas from both areas and that may serve
as a model for exploiting complementary algorithmic and statistical
perspectives in order to solve applied large-scale data analysis problems.Comment: 33 pages. To appear in Uwe Naumann and Olaf Schenk, editors,
"Combinatorial Scientific Computing," Chapman and Hall/CRC Press, 201
Recommended from our members
Seismic data clustering management system
This is the abstract of the paper given at the conference. Copyright @ 2011 The Authors.Over the last years, seismic images have increasingly played a vital role to the study of earthquakes. The large volume of seismic data that has been accumulated has created the need to develop sophisticated systems to manage this kind of data. Seismic interpretation can play a much more active role in the evaluation of large volumes of data by providing at an early stage vital information relating to the framework of potential producing levels. [1] This work presents a novel method to manage and analyse seismic data. The data is initially turned into clustering maps using clustering techniques [2] [3] [4] [5] [6], in order to be analysed on the platform. These clustering maps can then be analysed with the friendly-user interface of Seismic 1 which is based on .Net framework architecture [7]. This feature permits the porting of the application in any Windows – based computer as also to many other Linux based environments, using the Mono project functionality [8], so it can run an application using the No-Touch Deployment [7]. The platform supports two ways of processing seismic data. Firstly, a fast multifunctional version of the classical region-growing segmentation algorithm [9], [10] is applied to various areas of interest permitting their precise definition and labelling. Moreover, this algorithm is assigned to automatically allocate new earthquakes to a particular cluster based upon the magnitude of the centre of gravity of the existing clusters; or create a new cluster if all centers of gravity are above a predefined by the user upper threshold point. Secondly, a visual technique is used to record the behaviour of a cluster of earthquakes in a designated area. In this way, the system functions as a dynamic temporal simulator which depicts sequences of earthquakes on a map [11]
Global Considerations in Hierarchical Clustering Reveal Meaningful Patterns in Data
BACKGROUND: A hierarchy, characterized by tree-like relationships, is a natural method of organizing data in various domains. When considering an unsupervised machine learning routine, such as clustering, a bottom-up hierarchical (BU, agglomerative) algorithm is used as a default and is often the only method applied. METHODOLOGY/PRINCIPAL FINDINGS: We show that hierarchical clustering that involve global considerations, such as top-down (TD, divisive), or glocal (global-local) algorithms are better suited to reveal meaningful patterns in the data. This is demonstrated, by testing the correspondence between the results of several algorithms (TD, glocal and BU) and the correct annotations provided by experts. The correspondence was tested in multiple domains including gene expression experiments, stock trade records and functional protein families. The performance of each of the algorithms is evaluated by statistical criteria that are assigned to clusters (nodes of the hierarchy tree) based on expert-labeled data. Whereas TD algorithms perform better on global patterns, BU algorithms perform well and are advantageous when finer granularity of the data is sought. In addition, a novel TD algorithm that is based on genuine density of the data points is presented and is shown to outperform other divisive and agglomerative methods. Application of the algorithm to more than 500 protein sequences belonging to ion-channels illustrates the potential of the method for inferring overlooked functional annotations. ClustTree, a graphical Matlab toolbox for applying various hierarchical clustering algorithms and testing their quality is made available. CONCLUSIONS: Although currently rarely used, global approaches, in particular, TD or glocal algorithms, should be considered in the exploratory process of clustering. In general, applying unsupervised clustering methods can leverage the quality of manually-created mapping of proteins families. As demonstrated, it can also provide insights in erroneous and missed annotations
Tumor Classification Using High-Order Gene Expression Profiles Based on Multilinear ICA
Motivation. Independent Components Analysis (ICA) maximizes the statistical independence of the representational components of
a training gene expression profiles (GEP) ensemble, but it cannot
distinguish relations between the different factors, or different
modes, and it is not available to high-order GEP Data Mining. In
order to generalize ICA, we introduce Multilinear-ICA and apply it to
tumor classification using high order GEP. Firstly, we introduce the
basis conceptions and operations of tensor and recommend Support
Vector Machine (SVM) classifier and Multilinear-ICA. Secondly,
the higher score genes of original high order GEP are selected by
using t-statistics and tabulate tensors. Thirdly, the tensors are
performed by Multilinear-ICA. Finally, the SVM is used to classify
the tumor subtypes. Results. To show the validity of the proposed method, we apply it
to tumor classification using high order GEP. Though we only use
three datasets, the experimental results show that the method is
effective and feasible. Through this survey, we hope to gain some
insight into the problem of high order GEP tumor classification, in
aid of further developing more effective tumor classification algorithms
A primer on correlation-based dimension reduction methods for multi-omics analysis
The continuing advances of omic technologies mean that it is now more
tangible to measure the numerous features collectively reflecting the molecular
properties of a sample. When multiple omic methods are used, statistical and
computational approaches can exploit these large, connected profiles.
Multi-omics is the integration of different omic data sources from the same
biological sample. In this review, we focus on correlation-based dimension
reduction approaches for single omic datasets, followed by methods for pairs of
omics datasets, before detailing further techniques for three or more omic
datasets. We also briefly detail network methods when three or more omic
datasets are available and which complement correlation-oriented tools. To aid
readers new to this area, these are all linked to relevant R packages that can
implement these procedures. Finally, we discuss scenarios of experimental
design and present road maps that simplify the selection of appropriate
analysis methods. This review will guide researchers navigate the emerging
methods for multi-omics and help them integrate diverse omic datasets
appropriately and embrace the opportunity of population multi-omics.Comment: 30 pages, 2 figures, 6 table
Computational Network Analysis of the Anatomical and Genetic Organizations in the Mouse Brain
Motivation: The mammalian central nervous system (CNS) generates high-level behavior and cognitive functions. Elucidating the anatomical and genetic organizations in the CNS is a key step toward understanding the functional brain circuitry. The CNS contains an enormous number of cell types, each with unique gene expression patterns. Therefore, it is of central importance to capture the spatial expression patterns in the brain. Currently, genome-wide atlas of spatial expression patterns in the mouse brain has been made available, and the data are in the form of aligned 3D data arrays. The sheer volume and complexity of these data pose significant challenges for efficient computational analysis. Results: We employ data reduction and network modeling techniques to explore the anatomical and genetic organizations in the mouse brain. First, to reduce the volume of data, we propose to apply tensor factorization techniques to reduce the data volumes. This tensor formulation treats the stack of 3D volumes as a 4D data array, thereby preserving the mouse brain geometry. We then model the anatomical and genetic organizations as graphical models. To improve the robustness and efficiency of network modeling, we employ stable model selection and efficient sparsity-regularized formulation. Results on network modeling show that our efforts recover known interactions and predicts novel putative correlations
- …