2,186 research outputs found
Algorithmic and Statistical Perspectives on Large-Scale Data Analysis
In recent years, ideas from statistics and scientific computing have begun to
interact in increasingly sophisticated and fruitful ways with ideas from
computer science and the theory of algorithms to aid in the development of
improved worst-case algorithms that are useful for large-scale scientific and
Internet data analysis problems. In this chapter, I will describe two recent
examples---one having to do with selecting good columns or features from a (DNA
Single Nucleotide Polymorphism) data matrix, and the other having to do with
selecting good clusters or communities from a data graph (representing a social
or information network)---that drew on ideas from both areas and that may serve
as a model for exploiting complementary algorithmic and statistical
perspectives in order to solve applied large-scale data analysis problems.Comment: 33 pages. To appear in Uwe Naumann and Olaf Schenk, editors,
"Combinatorial Scientific Computing," Chapman and Hall/CRC Press, 201
Semantic distillation: a method for clustering objects by their contextual specificity
Techniques for data-mining, latent semantic analysis, contextual search of
databases, etc. have long ago been developed by computer scientists working on
information retrieval (IR). Experimental scientists, from all disciplines,
having to analyse large collections of raw experimental data (astronomical,
physical, biological, etc.) have developed powerful methods for their
statistical analysis and for clustering, categorising, and classifying objects.
Finally, physicists have developed a theory of quantum measurement, unifying
the logical, algebraic, and probabilistic aspects of queries into a single
formalism. The purpose of this paper is twofold: first to show that when
formulated at an abstract level, problems from IR, from statistical data
analysis, and from physical measurement theories are very similar and hence can
profitably be cross-fertilised, and, secondly, to propose a novel method of
fuzzy hierarchical clustering, termed \textit{semantic distillation} --
strongly inspired from the theory of quantum measurement --, we developed to
analyse raw data coming from various types of experiments on DNA arrays. We
illustrate the method by analysing DNA arrays experiments and clustering the
genes of the array according to their specificity.Comment: Accepted for publication in Studies in Computational Intelligence,
Springer-Verla
SUBIC: A Supervised Bi-Clustering Approach for Precision Medicine
Traditional medicine typically applies one-size-fits-all treatment for the
entire patient population whereas precision medicine develops tailored
treatment schemes for different patient subgroups. The fact that some factors
may be more significant for a specific patient subgroup motivates clinicians
and medical researchers to develop new approaches to subgroup detection and
analysis, which is an effective strategy to personalize treatment. In this
study, we propose a novel patient subgroup detection method, called Supervised
Biclustring (SUBIC) using convex optimization and apply our approach to detect
patient subgroups and prioritize risk factors for hypertension (HTN) in a
vulnerable demographic subgroup (African-American). Our approach not only finds
patient subgroups with guidance of a clinically relevant target variable but
also identifies and prioritizes risk factors by pursuing sparsity of the input
variables and encouraging similarity among the input variables and between the
input and target variable
Clustering by soft-constraint affinity propagation: Applications to gene-expression data
Motivation: Similarity-measure based clustering is a crucial problem
appearing throughout scientific data analysis. Recently, a powerful new
algorithm called Affinity Propagation (AP) based on message-passing techniques
was proposed by Frey and Dueck \cite{Frey07}. In AP, each cluster is identified
by a common exemplar all other data points of the same cluster refer to, and
exemplars have to refer to themselves. Albeit its proved power, AP in its
present form suffers from a number of drawbacks. The hard constraint of having
exactly one exemplar per cluster restricts AP to classes of regularly shaped
clusters, and leads to suboptimal performance, {\it e.g.}, in analyzing gene
expression data. Results: This limitation can be overcome by relaxing the AP
hard constraints. A new parameter controls the importance of the constraints
compared to the aim of maximizing the overall similarity, and allows to
interpolate between the simple case where each data point selects its closest
neighbor as an exemplar and the original AP. The resulting soft-constraint
affinity propagation (SCAP) becomes more informative, accurate and leads to
more stable clustering. Even though a new {\it a priori} free-parameter is
introduced, the overall dependence of the algorithm on external tuning is
reduced, as robustness is increased and an optimal strategy for parameter
selection emerges more naturally. SCAP is tested on biological benchmark data,
including in particular microarray data related to various cancer types. We
show that the algorithm efficiently unveils the hierarchical cluster structure
present in the data sets. Further on, it allows to extract sparse gene
expression signatures for each cluster.Comment: 11 pages, supplementary material:
http://isiosf.isi.it/~weigt/scap_supplement.pd
An Overview of DNA Microarray Grid Alignment and Foreground Separation Approaches
This paper overviews DNA microarray grid alignment and foreground separation approaches. Microarray grid alignment and foreground separation are the basic processing steps of DNA microarray images that affect the quality of gene expression information, and hence impact our confidence in any data-derived biological conclusions. Thus, understanding microarray data processing steps becomes critical for performing optimal microarray data analysis. In the past, the grid alignment and foreground separation steps have not been covered extensively in the survey literature. We present several classifications of existing algorithms, and describe the fundamental principles of these algorithms. Challenges related to automation and reliability of processed image data are outlined at the end of this overview paper.</p
- …