Search CORE

360 research outputs found

On Sequence Clustering and Supervised Dimensionality Reduction

Author: Wang Tiexing
Publication venue: SURFACE at Syracuse University
Publication date: 26/06/2020
Field of study

This dissertation studies two machine learning problems: 1) clustering of independent and identically generated random sequences, and 2) dimensionality reduction for classification problems. For sequence clustering, the focus is on large sample performance of classical clustering algorithms, including the k-medoids algorithm and hierarchical agglomerative clustering (HAC) algorithms. Data sequences are generated from unknown continuous distributions that are assumed to form clusters according to some well-defined distance metrics. The goal is to group data sequences according to their underlying distributions with little or no prior knowledge of both the underlying distributions as well as the number of clusters. Upper bounds on the clustering error probability are derived for the k-medoids algorithm and a class of HAC algorithms under mild assumptions on the distribution clusters and distance metrics. For both cases, the error probabilities are shown to decay exponentially fast as the number of samples in each data sequence goes to infinity. The obtained error exponent bound has a simple form when either the Kolmogrov-Smirnov distance or the maximum mean discrepancy is used as the distance metric. Tighter upper bound on the error probability of the single-linkage HAC algorithm is derived by taking advantage of the simplified metric updating scheme. Numerical results are provided to validate the analysis. For dimensionality reduction, the focus is on classification problem where label information in the training data can be leveraged for improved learning performance. A supervised dimensionality reduction method maximizing the difference of average projection energy of samples with different labels is proposed. Both synthetic data and WiFi sensing data are used to validate the effectiveness of the proposed method. The numerical results show that the proposed method outperforms existing supervised dimensionality reduction approaches based on Fisher discriminant analysis (FDA) and Hilbert-Schmidt independent criterion (HSIC). When kernel trick is applied to all three approaches, the performance of the proposed dimensionality reduction method is comparable to FDA and HSIC and is superior over unsupervised principal component analysis

Syracuse University Research Facility and Collaborative Environment

Analyzing and Visualizing State Sequences in R with TraMineR

Author: Alexis Gabadinho
Gilbert Ritschard
Matthias Studer
Nicolas S Müller
Publication venue
Publication date
Field of study

This article describes the many capabilities offered by the TraMineR toolbox for categorical sequence data. It focuses more specifically on the analysis and rendering of state sequences. Addressed features include the description of sets of sequences by means of transversal aggregated views, the computation of longitudinal characteristics of individual sequences and the measure of pairwise dissimilarities. Special emphasis is put on the multiple ways of visualizing sequences. The core element of the package is the state se- quence object in which we store the set of sequences together with attributes such as the alphabet, state labels and the color palette. The functions can then easily retrieve this information to ensure presentation homogeneity across all printed and graphical displays. The article also demonstrates how TraMineRÃ¢ÂÂs outcomes give access to advanced analyses such as clustering and statistical modeling of sequence data.

Research Papers in Economics

Clustering Genes of Common Evolutionary History.

Author: Alvarez N.
Dessimoz C.
Goldman N.
Gori K.
Suchan T.
Publication venue: 'Oxford University Press (OUP)'
Publication date: 17/02/2016
Field of study

Phylogenetic inference can potentially result in a more accurate tree using data from multiple loci. However, if the loci are incongruent-due to events such as incomplete lineage sorting or horizontal gene transfer-it can be misleading to infer a single tree. To address this, many previous contributions have taken a mechanistic approach, by modeling specific processes. Alternatively, one can cluster loci without assuming how these incongruencies might arise. Such "process-agnostic" approaches typically infer a tree for each locus and cluster these. There are, however, many possible combinations of tree distance and clustering methods; their comparative performance in the context of tree incongruence is largely unknown. Furthermore, because standard model selection criteria such as AIC cannot be applied to problems with a variable number of topologies, the issue of inferring the optimal number of clusters is poorly understood. Here, we perform a large-scale simulation study of phylogenetic distances and clustering methods to infer loci of common evolutionary history. We observe that the best-performing combinations are distances accounting for branch lengths followed by spectral clustering or Ward's method. We also introduce two statistical tests to infer the optimal number of clusters and show that they strongly outperform the silhouette criterion, a general-purpose heuristic. We illustrate the usefulness of the approach by 1) identifying errors in a previous phylogenetic analysis of yeast species and 2) identifying topological incongruence among newly sequenced loci of the globeflower fly genus Chiastocheta We release treeCl, a new program to cluster genes of common evolutionary history (http://git.io/treeCl)

arXiv.org e-Print Archive

Serveur académique lausannois

UCL Discovery

PubMed Central

Fuzzy spectral clustering methods for textual data

Author: COZZOLINO IRENE
Publication venue
Publication date: 30/05/2023
Field of study

Nowadays, the development of advanced information technologies has determined an increase in the production of textual data. This inevitable growth accentuates the need to advance in the identification of new methods and tools able to efficiently analyse such kind of data. Against this background, unsupervised classification techniques can play a key role in this process since most of this data is not classified. Document clustering, which is used for identifying a partition of clusters in a corpus of documents, has proven to perform efficiently in the analyses of textual documents and it has been extensively applied in different fields, from topic modelling to information retrieval tasks. Recently, spectral clustering methods have gained success in the field of text classification. These methods have gained popularity due to their solid theoretical foundations which do not require any specific assumption on the global structure of the data. However, even though they prove to perform well in text classification problems, little has been done in the field of clustering. Moreover, depending on the type of documents analysed, it might be often the case that textual documents do not contain only information related to a single topic: indeed, there might be an overlap of contents characterizing different knowledge domains. Consequently, documents may contain information that is relevant to different areas of interest to some degree. The first part of this work critically analyses the main clustering algorithms used for text data, involving also the mathematical representation of documents and the pre-processing phase. Then, three novel fuzzy versions of spectral clustering algorithms for text data are introduced. The first one exploits the use of fuzzy K-medoids instead of K-means. The second one derives directly from the first one but is used in combination with Kernel and Set Similarity (KS2M), which takes into account the Jaccard index. Finally, in the third one, in order to enhance the clustering performance, a new similarity measure S∗ is proposed. This last one exploits the inherent sequential nature of text data by means of a weighted combination between the Spectrum string kernel function and a measure of set similarity. The second part of the thesis focuses on spectral bi-clustering algorithms for text mining tasks, which represent an interesting and partially unexplored field of research. In particular, two novel versions of fuzzy spectral bi-clustering algorithms are introduced. The two algorithms differ from each other for the approach followed in the identification of the document and the word partitions. Indeed, the first one follows a simultaneous approach while the second one a sequential approach. This difference leads also to a diversification in the choice of the number of clusters. The adequacy of all the proposed fuzzy (bi-)clustering methods is evaluated by experiments performed on both real and benchmark data sets

Archivio della ricerca- Università di Roma La Sapienza

Practical Strategies for Discovering Regulatory DNA Sequence Motifs

Author: Fraenkel Ernest
MacIsaac Kenzie D
Publication venue: Public Library of Science
Publication date: 01/04/2006
Field of study

Crossref

Directory of Open Access Journals

PubMed Central

Recommended from our members

Cross-time scales interactions and rainfall extreme events in southeastern South America for the austral summer. Part II: predictive skill

Author: Goddard Lisa M.
Mason Simon J.
Muñoz Ángel G.
Robertson Andrew W.
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2016
Field of study

Potential and real predictive skill of the frequency of extreme rainfall in southeastern South America for the December–February season are evaluated in this paper, finding evidence indicating that mechanisms of climate variability at one time scale contribute to the predictability at another scale; that is, taking into account the interference of different potential sources of predictability at different time scales increases the predictive skill. Part I of this study suggested that a set of daily atmospheric circulation regimes, or weather types, was sensitive to these cross–time scale interferences, conducive to the occurrence of extreme rainfall events in the region, and could be used as a potential predictor. At seasonal scale, a combination of those weather types indeed tends to outperform all the other candidate predictors explored (i.e., sea surface temperature patterns, phases of the Madden–Julian oscillation, and combinations of both). Spatially averaged Kendall’s t improvements of 43% for the potential predictability and 23% for real-time predictions are attained with respect to standard models considering sea surface temperature fields alone. A new subseasonal-to-seasonal predictive methodology for extreme rainfall events is proposed based on probability forecasts of seasonal sequences of these weather types. The cross-validated real-time skill of the new probabilistic approach, as measured by the hit score and the Heidke skill score, is on the order of twice that associated with climatological values. The approach is designed to offer useful subseasonal-to-seasonal climate information to decision-makers interested not only in how many extreme events will happen in the season but also in how, when, and where those events will probably occur

Columbia University Academic Commons

Recent Developments in Video Surveillance

Author
Publication venue: 'IntechOpen'
Publication date: 20/04/2021
Field of study

With surveillance cameras installed everywhere and continuously streaming thousands of hours of video, how can that huge amount of data be analyzed or even be useful? Is it possible to search those countless hours of videos for subjects or events of interest? Shouldn’t the presence of a car stopped at a railroad crossing trigger an alarm system to prevent a potential accident? In the chapters selected for this book, experts in video surveillance provide answers to these questions and other interesting problems, skillfully blending research experience with practical real life applications. Academic researchers will find a reliable compilation of relevant literature in addition to pointers to current advances in the field. Industry practitioners will find useful hints about state-of-the-art applications. The book also provides directions for open problems where further advances can be pursued

Directory of Open Access Books (DOAB)