Search CORE

306 research outputs found

Algorithmic and Statistical Perspectives on Large-Scale Data Analysis

Author: Mahoney Michael W.
Publication venue
Publication date: 08/10/2010
Field of study

In recent years, ideas from statistics and scientific computing have begun to interact in increasingly sophisticated and fruitful ways with ideas from computer science and the theory of algorithms to aid in the development of improved worst-case algorithms that are useful for large-scale scientific and Internet data analysis problems. In this chapter, I will describe two recent examples---one having to do with selecting good columns or features from a (DNA Single Nucleotide Polymorphism) data matrix, and the other having to do with selecting good clusters or communities from a data graph (representing a social or information network)---that drew on ideas from both areas and that may serve as a model for exploiting complementary algorithmic and statistical perspectives in order to solve applied large-scale data analysis problems.Comment: 33 pages. To appear in Uwe Naumann and Olaf Schenk, editors, "Combinatorial Scientific Computing," Chapman and Hall/CRC Press, 201

arXiv.org e-Print Archive

CiteSeerX

Recommended from our members

Seismic data clustering management system

Author: Banitsas K
Katsifarakis E
Konstantaras A
Maravelakis E
Skounakis E
Varley M
Publication venue: European Geosciences Union
Publication date: 01/01/2011
Field of study

This is the abstract of the paper given at the conference. Copyright @ 2011 The Authors.Over the last years, seismic images have increasingly played a vital role to the study of earthquakes. The large volume of seismic data that has been accumulated has created the need to develop sophisticated systems to manage this kind of data. Seismic interpretation can play a much more active role in the evaluation of large volumes of data by providing at an early stage vital information relating to the framework of potential producing levels. [1] This work presents a novel method to manage and analyse seismic data. The data is initially turned into clustering maps using clustering techniques [2] [3] [4] [5] [6], in order to be analysed on the platform. These clustering maps can then be analysed with the friendly-user interface of Seismic 1 which is based on .Net framework architecture [7]. This feature permits the porting of the application in any Windows – based computer as also to many other Linux based environments, using the Mono project functionality [8], so it can run an application using the No-Touch Deployment [7]. The platform supports two ways of processing seismic data. Firstly, a fast multifunctional version of the classical region-growing segmentation algorithm [9], [10] is applied to various areas of interest permitting their precise definition and labelling. Moreover, this algorithm is assigned to automatically allocate new earthquakes to a particular cluster based upon the magnitude of the centre of gravity of the existing clusters; or create a new cluster if all centers of gravity are above a predefined by the user upper threshold point. Secondly, a visual technique is used to record the behaviour of a cluster of earthquakes in a designated area. In this way, the system functions as a dynamic temporal simulator which depicts sequences of earthquakes on a map [11]

Brunel University Research Archive

Global Considerations in Hierarchical Clustering Reveal Meaningful Patterns in Data

Author: A Torrente
AK Jain
CF Zorumski
D Boley
D Horn
D Horn
David Horn
G Getz
G Owsianik
H Chipman
J Handl
J Orlowski
JB Kruskal
Ji Zhu
LK Kaczmarek
M Berridge
M Rune
M Steinbach
MB Eisen
Michal Linial
MS Savaresi
N Kaplan
N Slonim
O Alter
O Sasson
P Cimiano
P D'Haeseleer
P Hansen
PJ Planet
Q Ren
R Apweiler
R Cangelosi
R Sharan
R Varshavsky
R Varshavsky
RO Duda
Roy Varshavsky
S Altschul
TK Landauer
TR Golub
Y Benjamini
Y Zhao
Publication venue: Public Library of Science
Publication date: 21/05/2008
Field of study

BACKGROUND: A hierarchy, characterized by tree-like relationships, is a natural method of organizing data in various domains. When considering an unsupervised machine learning routine, such as clustering, a bottom-up hierarchical (BU, agglomerative) algorithm is used as a default and is often the only method applied. METHODOLOGY/PRINCIPAL FINDINGS: We show that hierarchical clustering that involve global considerations, such as top-down (TD, divisive), or glocal (global-local) algorithms are better suited to reveal meaningful patterns in the data. This is demonstrated, by testing the correspondence between the results of several algorithms (TD, glocal and BU) and the correct annotations provided by experts. The correspondence was tested in multiple domains including gene expression experiments, stock trade records and functional protein families. The performance of each of the algorithms is evaluated by statistical criteria that are assigned to clusters (nodes of the hierarchy tree) based on expert-labeled data. Whereas TD algorithms perform better on global patterns, BU algorithms perform well and are advantageous when finer granularity of the data is sought. In addition, a novel TD algorithm that is based on genuine density of the data points is presented and is shown to outperform other divisive and agglomerative methods. Application of the algorithm to more than 500 protein sequences belonging to ion-channels illustrates the potential of the method for inferring overlooked functional annotations. ClustTree, a graphical Matlab toolbox for applying various hierarchical clustering algorithms and testing their quality is made available. CONCLUSIONS: Although currently rarely used, global approaches, in particular, TD or glocal algorithms, should be considered in the exploratory process of clustering. In general, applying unsupervised clustering methods can leverage the quality of manually-created mapping of proteins families. As demonstrated, it can also provide insights in erroneous and missed annotations

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Tumor Classification Using High-Order Gene Expression Profiles Based on Multilinear ICA

Author: Du Ming-gang
Wang Hong
Zhang Shan-Wen
Publication venue: Hindawi Publishing Corporation
Publication date: 01/01/2009
Field of study

Motivation. Independent Components Analysis (ICA) maximizes the statistical independence of the representational components of a training gene expression profiles (GEP) ensemble, but it cannot distinguish relations between the different factors, or different modes, and it is not available to high-order GEP Data Mining. In order to generalize ICA, we introduce Multilinear-ICA and apply it to tumor classification using high order GEP. Firstly, we introduce the basis conceptions and operations of tensor and recommend Support Vector Machine (SVM) classifier and Multilinear-ICA. Secondly, the higher score genes of original high order GEP are selected by using t-statistics and tabulate tensors. Thirdly, the tensors are performed by Multilinear-ICA. Finally, the SVM is used to classify the tumor subtypes. Results. To show the validity of the proposed method, we apply it to tumor classification using high order GEP. Though we only use three datasets, the experimental results show that the method is effective and feasible. Through this survey, we hope to gain some insight into the problem of high order GEP tumor classification, in aid of further developing more effective tumor classification algorithms

Crossref

Directory of Open Access Journals

PubMed Central

A primer on correlation-based dimension reduction methods for multi-omics analysis

Author: Angelopoulos Nicos
Downing Tim
Publication venue
Publication date: 27/05/2023
Field of study

The continuing advances of omic technologies mean that it is now more tangible to measure the numerous features collectively reflecting the molecular properties of a sample. When multiple omic methods are used, statistical and computational approaches can exploit these large, connected profiles. Multi-omics is the integration of different omic data sources from the same biological sample. In this review, we focus on correlation-based dimension reduction approaches for single omic datasets, followed by methods for pairs of omics datasets, before detailing further techniques for three or more omic datasets. We also briefly detail network methods when three or more omic datasets are available and which complement correlation-oriented tools. To aid readers new to this area, these are all linked to relevant R packages that can implement these procedures. Finally, we discuss scenarios of experimental design and present road maps that simplify the selection of appropriate analysis methods. This review will guide researchers navigate the emerging methods for multi-omics and help them integrate diverse omic datasets appropriately and embrace the opportunity of population multi-omics.Comment: 30 pages, 2 figures, 6 table

arXiv.org e-Print Archive

Computational Network Analysis of the Anatomical and Genetic Organizations in the Mouse Brain

Author: Ji Shuiwang
Publication venue: ODU Digital Commons
Publication date: 01/01/2011
Field of study

Motivation: The mammalian central nervous system (CNS) generates high-level behavior and cognitive functions. Elucidating the anatomical and genetic organizations in the CNS is a key step toward understanding the functional brain circuitry. The CNS contains an enormous number of cell types, each with unique gene expression patterns. Therefore, it is of central importance to capture the spatial expression patterns in the brain. Currently, genome-wide atlas of spatial expression patterns in the mouse brain has been made available, and the data are in the form of aligned 3D data arrays. The sheer volume and complexity of these data pose significant challenges for efficient computational analysis. Results: We employ data reduction and network modeling techniques to explore the anatomical and genetic organizations in the mouse brain. First, to reduce the volume of data, we propose to apply tensor factorization techniques to reduce the data volumes. This tensor formulation treats the stack of 3D volumes as a 4D data array, thereby preserving the mouse brain geometry. We then model the anatomical and genetic organizations as graphical models. To improve the robustness and efficiency of network modeling, we employ stable model selection and efficient sparsity-regularized formulation. Results on network modeling show that our efforts recover known interactions and predicts novel putative correlations

Old Dominion University