Search CORE

125 research outputs found

The Importance of Being Clustered: Uncluttering the Trends of Statistics from 1970 to 2015

Author: Anderlucci Laura
Montanari Angela
Viroli Cinzia
Publication venue
Publication date: 01/01/2017
Field of study

In this paper we retrace the recent history of statistics by analyzing all the papers published in five prestigious statistical journals since 1970, namely: Annals of Statistics, Biometrika, Journal of the American Statistical Association, Journal of the Royal Statistical Society, series B and Statistical Science. The aim is to construct a kind of "taxonomy" of the statistical papers by organizing and by clustering them in main themes. In this sense being identified in a cluster means being important enough to be uncluttered in the vast and interconnected world of the statistical research. Since the main statistical research topics naturally born, evolve or die during time, we will also develop a dynamic clustering strategy, where a group in a time period is allowed to migrate or to merge into different groups in the following one. Results show that statistics is a very dynamic and evolving science, stimulated by the rise of new research questions and types of data

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

A theory of (almost) zero resource speech recognition

Author: Bharadwaj Sujeeth Subramanya
Publication venue
Publication date
Field of study

Automatic speech recognition has matured into a commercially successful technology, enabling voice-based interfaces for smartphones, smart TVs, and many other consumer devices. The overwhelming popularity, however, is still limited to languages such as English, Japanese, and German, where vast amounts of labeled training data are available. For most other languages, it is prohibitively expensive to 1) collect and transcribe the speech data required to learn good acoustic models; and 2) acquire adequate text to estimate meaningful language models. A theory of unsupervised and semi-supervised techniques for speech recognition is therefore essential. This thesis focuses on HMM-based sequence clustering and examines acoustic modeling, language modeling, and applications beyond the components of an ASR, such as anomaly detection, from the vantage point of PAC-Bayesian theory. The first part of this thesis extends standard PAC-Bayesian bounds to address the sequential nature of speech and language signals. A novel algorithm, based on sparsifying the cluster assignment probabilities with a Renyi entropy prior, is shown to provably minimize the generalization error of any probabilistic model (e.g. HMMs). The second part examines application-specific loss functions such as cluster purity and perplexity. Empirical results on a variety of tasks -- acoustic event detection, class-based language modeling, and unsupervised sequence anomaly detection -- confirm the practicality of the theory and algorithms developed in this thesis

Illinois Digital Environment for Access to Learning and Scholarship Repository

A survey of machine learning techniques applied to self organizing cellular networks

Author: Imran Muhammad Ali
Onireti Oluwakayode
Souza Richard Demo
Valente Klaine Paulo
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2017
Field of study

In this paper, a survey of the literature of the past fifteen years involving Machine Learning (ML) algorithms applied to self organizing cellular networks is performed. In order for future networks to overcome the current limitations and address the issues of current cellular systems, it is clear that more intelligence needs to be deployed, so that a fully autonomous and flexible network can be enabled. This paper focuses on the learning perspective of Self Organizing Networks (SON) solutions and provides, not only an overview of the most common ML techniques encountered in cellular networks, but also manages to classify each paper in terms of its learning solution, while also giving some examples. The authors also classify each paper in terms of its self-organizing use-case and discuss how each proposed solution performed. In addition, a comparison between the most commonly found ML algorithms in terms of certain SON metrics is performed and general guidelines on when to choose each ML algorithm for each SON function are proposed. Lastly, this work also provides future research directions and new paradigms that the use of more robust and intelligent algorithms, together with data gathered by operators, can bring to the cellular networks domain and fully enable the concept of SON in the near future

Enlighten

A graph theoretical perspective for the unsupervised clustering of free text corpora

Author: Altuncu Muhammed Tarık
Publication venue: Mathematics, Imperial College London
Publication date: 01/09/2021
Field of study

This thesis introduces a robust end to end topic discovery framework that extracts a set of coherent topics stemming intrinsically from document similarities. Some topic clustering methods can support embedded vectors instead of traditional Bag-of-Words (BoW) representation. Some can be free from the number of topics hyperparameter and some others can extract a multi-scale relation between topics. However, no topic clustering method supports all these properties together. This thesis focuses on this gap in the literature by designing a framework that supports any type of document-level features especially the embedded vectors. This framework does not require any uninformed decision making about the underlying data such as the number of topics, instead, the framework extracts topics in multiple resolutions. To achieve this goal, we combine existing methods from natural language processing (NLP) for feature generation and graph theory, first for graph construction based on semantic document similarities, then for graph partitioning to extract corresponding topics in multiple resolutions. Finally, we use specific methods from statistical machine learning to obtain highly generalisable supervised models to deploy topic classifiers for the deployment of topic extraction in real-time. Our applications on both a noisy and specialised corpus of medical records (i.e., descriptions for patient incidents within the NHS) and public news articles in daily language show that our framework extracts coherent topics that have better quantitative benchmark scores than other methods in most cases. The resulting multi-scale topics in both applications enable us to capture specific details more easily and choose the relevant resolutions for the specific objective. This study contributes to topic clustering literature by introducing a novel graph theoretical perspective that provides a combination of new properties. These properties are multiple resolutions, independence from uninformed decisions about the corpus, and usage of recent NLP features, such as vector embeddings.Open Acces

Spiral - Imperial College Digital Repository

Exploiting clustering algorithms in a multiple-level fashion: A comparative study in the medical care scenario

Author: CERQUITELLI TANIA
CHIUSANO SILVIA ANNA
XIAO XIN
Publication venue: 'Elsevier BV'
Publication date: 01/01/2016
Field of study

Clustering real-world data is a challenging task, since many real-data collections are characterized by an inherent sparseness and variable distribution. An appealing domain that generates such data collections is the medical care scenario where collected data include a large cardinality of patient records and a variety of medical treatments usually adopted for a given disease pathology. This paper proposes a two-phase data mining methodology to iteratively analyze dierent dataset portions and locally identify groups of objects with common properties. Discovered cohesive clusters are then analyzed using sequential patterns to characterize temporal relationships among data features. To support an automatic classication of a new data objects within one of the discovered groups, a classication model is created starting from the computed cluster set. A mobile application has been also designed and developed to visualize and update data under analysis as well as categorizing new unlabeled records. A comparative study has been conducted on real datasets in the medical care scenario using diverse clustering algorithms. Results were compared in terms of cluster quality, execution time, classication performance and discovered sequential patterns. The experimental evaluation showed the eectiveness of MLC to discover interesting knowledge items and to easily exploit them through a mobile application. Results have been also discussed from a medical perspective

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

Three-dimensional Laser-based Classification in Outdoor Environments

Author: Behley Jens
Publication venue: Universitäts- und Landesbibliothek Bonn
Publication date
Field of study

Robotics research strives for deploying autonomous systems in populated environments, such as inner city traffic. Autonomous cars need a reliable collision avoidance, but also an object recognition to distinguish different classes of traffic participants. For both tasks, fast three-dimensional laser range sensors generating multiple accurate laser range scans per second, each consisting of a vast number of laser points, are often employed. In this thesis, we investigate and develop classification algorithms that allow us to automatically assign semantic labels to laser scans. We mainly face two challenges: (1) we have to ensure consistent and correct classification results and (2) we must efficiently process a vast number of laser points per scan. In consideration of these challenges, we cover both stages of classification -- the feature extraction from laser range scans and the classification model that maps from the features to semantic labels. As for the feature extraction, we contribute by thoroughly evaluating important state-of-the-art histogram descriptors. We investigate critical parameters of the descriptors and experimentally show for the first time that the classification performance can be significantly improved using a large support radius and a global reference frame. As for learning the classification model, we contribute with new algorithms that improve the classification efficiency and accuracy. Our first approach aims at deriving a consistent point-wise interpretation of the whole laser range scan. By combining efficient similarity-preserving hashing and multiple linear classifiers, we considerably improve the consistency of label assignments, requiring only minimal computational overhead compared to a single linear classifier. In the last part of the thesis, we aim at classifying objects represented by segments. We propose a novel hierarchical segmentation approach comprising multiple stages and a novel mixture classification model of multiple bag-of-words vocabularies. We demonstrate superior performance of both approaches compared to their single component counterparts using challenging real world datasets.Ziel des Forschungsbereichs Robotik ist der Einsatz autonomer Systeme in natürlichen Umgebungen, wie zum Beispiel innerstädtischem Verkehr. Autonome Fahrzeuge benötigen einerseits eine zuverlässige Kollisionsvermeidung und andererseits auch eine Objekterkennung zur Unterscheidung verschiedener Klassen von Verkehrsteilnehmern. Verwendung finden vorallem drei-dimensionale Laserentfernungssensoren, die mehrere präzise Laserentfernungsscans pro Sekunde erzeugen und jeder Scan besteht hierbei aus einer hohen Anzahl an Laserpunkten. In dieser Dissertation widmen wir uns der Untersuchung und Entwicklung neuartiger Klassifikationsverfahren zur automatischen Zuweisung von semantischen Objektklassen zu Laserpunkten. Hierbei begegnen wir hauptsächlich zwei Herausforderungen: (1) wir möchten konsistente und korrekte Klassifikationsergebnisse erreichen und (2) die immense Menge an Laserdaten effizient verarbeiten. Unter Berücksichtigung dieser Herausforderungen untersuchen wir beide Verarbeitungsschritte eines Klassifikationsverfahrens -- die Merkmalsextraktion unter Nutzung von Laserdaten und das eigentliche Klassifikationsmodell, welches die Merkmale auf semantische Objektklassen abbildet. Bezüglich der Merkmalsextraktion leisten wir ein Beitrag durch eine ausführliche Evaluation wichtiger Histogrammdeskriptoren. Wir untersuchen kritische Deskriptorparameter und zeigen zum ersten Mal, dass die Klassifikationsgüte unter Nutzung von großen Merkmalsradien und eines globalen Referenzrahmens signifikant gesteigert wird. Bezüglich des Lernens des Klassifikationsmodells, leisten wir Beiträge durch neue Algorithmen, welche die Effizienz und Genauigkeit der Klassifikation verbessern. In unserem ersten Ansatz möchten wir eine konsistente punktweise Interpretation des gesamten Laserscans erreichen. Zu diesem Zweck kombinieren wir eine ähnlichkeitserhaltende Hashfunktion und mehrere lineare Klassifikatoren und erreichen hierdurch eine erhebliche Verbesserung der Konsistenz der Klassenzuweisung bei minimalen zusätzlichen Aufwand im Vergleich zu einem einzelnen linearen Klassifikator. Im letzten Teil der Dissertation möchten wir Objekte, die als Segmente repräsentiert sind, klassifizieren. Wir stellen eine neuartiges hierarchisches Segmentierungsverfahren und ein neuartiges Klassifikationsmodell auf Basis einer Mixtur mehrerer bag-of-words Vokabulare vor. Wir demonstrieren unter Nutzung von praxisrelevanten Datensätzen, dass beide Ansätze im Vergleich zu ihren Entsprechungen aus einer einzelnen Komponente zu erheblichen Verbesserungen führen

bonndoc – Der Publikationsserver der Universität Bonn

Data Mining Techniques for Complex User-Generated Data

Author: XIAO XIN
Publication venue: country:Italy
Publication date: 01/01/2016
Field of study

Nowadays, the amount of collected information is continuously growing in a variety of different domains. Data mining techniques are powerful instruments to effectively analyze these large data collections and extract hidden and useful knowledge. Vast amount of User-Generated Data (UGD) is being created every day, such as user behavior, user-generated content, user exploitation of available services and user mobility in different domains. Some common critical issues arise for the UGD analysis process such as the large dataset cardinality and dimensionality, the variable data distribution and inherent sparseness, and the heterogeneous data to model the different facets of the targeted domain. Consequently, the extraction of useful knowledge from such data collections is a challenging task, and proper data mining solutions should be devised for the problem under analysis. In this thesis work, we focus on the design and development of innovative solutions to support data mining activities over User-Generated Data characterised by different critical issues, via the integration of different data mining techniques in a unified frame- work. Real datasets coming from three example domains characterized by the above critical issues are considered as reference cases, i.e., health care, social network, and ur- ban environment domains. Experimental results show the effectiveness of the proposed approaches to discover useful knowledge from different domains

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

PORTO Publications Open Repository TOrino

End-to-end Lip-reading: A Preliminary Study

Author: Thapa K.
Thapa K.
Publication venue: London South Bank University
Publication date: 01/01/2023
Field of study

Deep lip-reading is the combination of the domains of computer vision and natural language processing. It uses deep neural networks to extract speech from silent videos. Most works in lip-reading use a multi staged training approach due to the complex nature of the task. A single stage, end-to-end, unified training approach, which is an ideal of machine learning, is also the goal in lip-reading. However, pure end-to-end systems have not yet been able to perform as good as non-end-to-end systems. Some exceptions to this are the very recent Temporal Convolutional Network (TCN) based architectures. This work lays out preliminary study of deep lip-reading, with a special focus on various end-to-end approaches. The research aims to test whether a purely end-to-end approach is justifiable for a task as complex as deep lip-reading. To achieve this, the meaning of pure end-to-end is first defined and several lip-reading systems that follow the definition are analysed. The system that most closely matches the definition is then adapted for pure end-to-end experiments. Four main contributions have been made: i) An analysis of 9 different end-to-end deep lip-reading systems, ii) Creation and public release of a pipeline1 to adapt sentence level Lipreading Sentences in the Wild 3 (LRS3) dataset into word level, iii) Pure end-to-end training of a TCN based network and evaluation on LRS3 word-level dataset as a proof of concept, iv) a public online portal2 to analyse visemes and experiment live end-to-end lip-reading inference. The study is able to verify that pure end-to-end is a sensible approach and an achievable goal for deep machine lip-reading

LSBU Research Open

Recommended from our members

Learning Structure in Time Series for Neuroscience and Beyond

Author: Pfau David Benjamin
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2015
Field of study

Advances in neuroscience are producing data at an astounding rate - data which are fiendishly complex both to process and to interpret. Biological neural networks are high-dimensional, nonlinear, noisy, heterogeneous, and in nearly every way defy the simplifying assumptions of standard statistical methods. In this dissertation we address a number of issues with understanding the structure of neural populations, from the abstract level of how to uncover structure in generic time series, to the practical matter of finding relevant biological structure in state-of-the-art experimental techniques. To learn the structure of generic time series, we develop a new statistical model, which we dub the probabilistic deterministic infinite automata (PDIA), which uses tools from nonparametric Bayesian inference to learn a very general class of sequence models. We show that the models learned by the PDIA often offer better predictive performance and faster inference than Hidden Markov Models, while being significantly more compact than models that simply memorize contexts. For large populations of neurons, models like the PDIA become unwieldy, and we instead investigate ways to robustly reduce the dimensionality of the data. In particular, we adapt the generalized linear model (GLM) framework for regres- sion to the case of matrix completion, which we call the low-dimensional GLM. We show that subspaces and dynamics of neural activity can be accurately recovered from model data, and with only minimal assumptions about the structure of the dynamics can still lead to good predictive performance on real data. Finally, to bridge the gap between recording technology and analysis, particularly as recordings from ever-larger populations of neurons becomes the norm, automated methods for extracting activity from raw recordings become a necessity. We present a number of methods for automatically segmenting biological units from optical imaging data, with applications to light sheet recording of genetically encoded calcium indicator fluorescence in the larval zebrafish, and optical electrophysiology using genetically encoded voltage indicators in culture. Together, these methods are a powerful set of tools for addressing the diverse challenges of modern neuroscience

Columbia University Academic Commons