101 research outputs found
Model Selection Criteria for Segmented Time Series from a Bayesian Approach to Information Compression
The principle that the simplest model capable of describing observed phenomena should also correspond to the best description has long been a guiding rule of inference. In this paper a Bayesian approach to formally implementing this principle is employed to develop model selection criteria for detecting structural change in financial and economic time series. Model selection criteria which allow for multiple structural breaks and which seek the optimal model order and parameter choices within regimes are derived. Comparative simulations against other popular information based model selection criteria are performed. Application of the derived criteria are also made to example financial and economic time series.Complexity theory; segmentation; break points; change points; model selection; model choice.
Statistical and Computational Models for Whole Word Morphology
Das Ziel dieser Arbeit ist die Formulierung eines Ansatzes zum maschinellen Lernen von Sprachmorphologie, in dem letztere als Zeichenkettentransformationen auf ganzen Wörtern, und nicht als Zerlegung von Wörtern in kleinere stukturelle Einheiten, modelliert wird. Der Beitrag besteht aus zwei wesentlichen Teilen: zum einen wird ein Rechenmodell formuliert, in dem morphologische Regeln als Funktionen auf Zeichenketten definiert sind. Solche Funktionen lassen sich leicht zu endlichen Transduktoren übersetzen, was eine solide algorithmische Grundlage für den Ansatz liefert. Zum anderen wird ein statistisches Modell für Graphen von Wortab\-leitungen eingeführt. Die Inferenz in diesem Modell erfolgt mithilfe des Monte Carlo Expectation Maximization-Algorithmus und die Erwartungswerte über Graphen werden durch einen Metropolis-Hastings-Sampler approximiert. Das Modell wird auf einer Reihe von praktischen Aufgaben evaluiert: Clustering flektierter Formen, Lernen von Lemmatisierung, Vorhersage von Wortart für unbekannte Wörter, sowie Generierung neuer Wörter
Learning and mining from personal digital archives
Given the explosion of new sensing technologies, data storage has become significantly cheaper and consequently, people increasingly rely on wearable devices to create personal digital archives. Lifelogging is the act of recording aspects of life in digital format for a variety of purposes such as aiding human memory, analysing human lifestyle and diet monitoring. In this dissertation we are concerned with Visual Lifelogging, a form of lifelogging based on the passive capture of photographs by a wearable camera. Cameras, such as Microsoft's SenseCam can record up to 4,000 images per day as well as logging data from several incorporated sensors. Considering the volume, complexity and heterogeneous nature of such data collections, it is a signifcant challenge to interpret and extract knowledge for the practical use of lifeloggers and others.
In this dissertation, time series analysis methods have been used to identify and extract useful information from temporal lifelogging images data, without benefit of prior knowledge. We focus, in particular, on three fundamental topics: noise reduction, structure and characterization of the raw data; the detection of multi-scale patterns; and the mining of important, previously unknown repeated patterns in the time series of lifelog image data.
Firstly, we show that Detrended Fluctuation Analysis (DFA) highlights the
feature of very high correlation in lifelogging image collections. Secondly, we show that study of equal-time Cross-Correlation Matrix demonstrates atypical or non-stationary characteristics in these images. Next, noise reduction in the Cross-Correlation Matrix is addressed by Random Matrix Theory (RMT) before Wavelet multiscaling is used to characterize the `most important' or `unusual' events through analysis of the associated dynamics of the eigenspectrum. A motif discovery technique is explored for detection of recurring and recognizable episodes of an individual's image data. Finally, we apply these motif discovery techniques to two known lifelog data collections, All I Have Seen (AIHS) and NTCIR-12 Lifelog, in order to examine multivariate recurrent patterns of multiple-lifelogging users
Advances in Weakly Supervised Learning of Morphology
Morphological analysis provides a decomposition of words into smaller constituents. It is an important problem in natural language processing (NLP), particularly for morphologically rich languages whose large vocabularies make statistical modeling difficult. Morphological analysis has traditionally been approached with rule-based methods that yield accurate results, but are expensive to produce. More recently, unsupervised machine learning methods have been shown to perform sufficiently well to benefit applications such as speech recognition and machine translation. Unsupervised methods, however, do not typically model allomorphy, that is, non-concatenative structure, for example pretty/prettier. Moreover, the accuracy of unsupervised methods remains far behind rule-based methods with the best unsupervised methods yielding between 50-66% F-score in Morpho Challenge 2010.
We examine these problems with two approaches that have not previously attracted much attention in the field. First, we propose a novel extension to the popular unsupervised morphological segmentation method Morfessor Baseline to model allomorphy via the use of string transformations. Second, we examine the effect of weak supervision on accuracy by training on a small annotated data set in addition to a large unannotated data set. We propose two novel semi-supervised morphological segmentation methods, namely a semi-supervised extension of Morfessor Baseline and morphological segmentation with conditional random fields (CRF). The methods are evaluated on several languages with different morphological characteristics, including English, Estonian, Finnish, German and Turkish. The proposed methods are compared empirically to recently proposed weakly supervised methods.
For the non-concatenative extension, we find that, while the string transformations identified by the model have high precision, their recall is low. In the overall evaluation the non-concatenative extension improves accuracy on English, but not on other languages. For the weak supervision we find that the semi-supervised extension of Morfessor Baseline improves the accuracy of segmentation markedly over the unsupervised baseline. We find, however, that the discriminatively trained CRFs perform even better. In the empirical comparison, the CRF approach outperforms all other approaches on all included languages. Error analysis reveals that the CRF excels especially on affix accuracy
Dynamical models and machine learning for supervised segmentation
This thesis is concerned with the problem of how to outline regions of interest in medical images, when
the boundaries are weak or ambiguous and the region shapes are irregular. The focus on machine learning
and interactivity leads to a common theme of the need to balance conflicting requirements. First,
any machine learning method must strike a balance between how much it can learn and how well it
generalises. Second, interactive methods must balance minimal user demand with maximal user control.
To address the problem of weak boundaries,methods of supervised texture classification are investigated
that do not use explicit texture features. These methods enable prior knowledge about the image to
benefit any segmentation framework. A chosen dynamic contour model, based on probabilistic boundary
tracking, combines these image priors with efficient modes of interaction. We show the benefits of the
texture classifiers over intensity and gradient-based image models, in both classification and boundary
extraction.
To address the problem of irregular region shape, we devise a new type of statistical shape model
(SSM) that does not use explicit boundary features or assume high-level similarity between region
shapes. First, the models are used for shape discrimination, to constrain any segmentation framework
by way of regularisation. Second, the SSMs are used for shape generation, allowing probabilistic segmentation
frameworks to draw shapes from a prior distribution. The generative models also include
novel methods to constrain shape generation according to information from both the image and user
interactions.
The shape models are first evaluated in terms of discrimination capability, and shown to out-perform
other shape descriptors. Experiments also show that the shape models can benefit a standard type of
segmentation algorithm by providing shape regularisers. We finally show how to exploit the shape
models in supervised segmentation frameworks, and evaluate their benefits in user trials
Combination of multiple image segmentations
Die Arbeit betrachtet die Kombination von mehreren Bildsegmentierungen im Bereich von contour detection und regionenbasierter Bildsegmentierung. Das Ziel ist die Kombination von mehreren Segmentierungen in eine verbesserte finale Segmentierung. Im Fall der regionenbasierten Kombination von Segmentierungen wird das generalized median Konzept verwendet, um automatisch die endgueltige Anzahl von Regionen zu bestimmen. Umfangreiche Experimente zeigen, dass die vorgeschlagene Kombinationsmethode bessere Ergebnisse erzielt als der Lernansatz unter Verwendung von Ground Truth Daten. Schliesslich untersuchen Experimente mit Evaluationsmassen fuer Segmentierungen das Verhalten sowie die Metrik-Eigenschaft der Masse. Die Studie soll als Leitlinie fuer die geeignete Wahl von Evaluationsmassen dienen. The thesis concerns combination of multiple image segmentations in the
domains of contour detection and region-based image segmentation. The
goal is to combine multiple segmentations into a final improved result.
In the case of region-based image segmentation combination, a
generalized median concept is proposed to automatically determine the
final number of regions. Extensive experiments demonstrate that our
combination method outperforms the ground truth based training approach.
In addition, experimental investigation of existing segmentation
evaluation measures on the metric property and the evaluating behaviors
is presented. This study is intended to be as a guideline for
appropriately choosing the evaluation measures
Puheen ja tekstin välisen tilastollisen assosiaation itseohjautuva oppiminen
One of the key challenges in artificial cognitive systems is to develop effective algorithms that learn without human supervision to understand qualitatively different realisations of the same abstraction and therefore also acquire an ability to transcribe a sensory data stream to completely different modality. This is also true in the so-called Big Data problem. Through learning of associations between multiple types of data of the same phenomenon, it is possible to capture hidden dynamics that govern processes that yielded the measured data.
In this thesis, a methodological framework for automatic discovery of statistical associations between two qualitatively different data streams is proposed. The simulations are run on a noisy, high bit-rate, sensory signal (speech) and temporally discrete categorical data (text). In order to distinguish the approach from traditional automatic speech recognition systems, it does not utilize any phonetic or linguistic knowledge in the recognition. It merely learns statistically sound units of speech and text and their mutual mappings in an unsupervised manner. The experiments on child directed speech with limited vocabulary show that, after a period of learning, the method acquires a promising ability to transcribe continuous speech to its textual representation.Keinoälyn toteuttamisessa vaikeimpia haasteita on kehittää ohjaamattomia oppimismenetelmiä, jotka oppivat yhdistämään saman abstraktin käsitteen toteutuksen useassa eri modaaliteeteissa ja vieläpä kuvailemaan aistihavainnon jossain toisessa modaaliteetissa, missä havainto tapahtuu. Vastaava pätee myös niin kutsutun Big Data ongelman yhteydessä. Samasta ilmiöstä voi usein saada monimuotoista mittaustuloksia. Selvittämällä näiden tietovirtojen keskinäiset yhteydet voidaan mahdollisesti oppia ymmärtämään ilmiön taustalla olevia prosesseja ja piilevää dynamiikkaa.
Tässä diplomityössä esitellään menetelmällinen tapa löytää automaattisesti tilastolliset yhteydet kahden ominaisuuksiltaan erilaisen tietovirran välille. Menetelmää simuloidaan kohinaisella sekä korkea bittinopeuksisella aistihavaintosignaalilla (puheella) ja ajallisesti diskreetillä kategorisella datalla (tekstillä). Erotuksena perinteisiin automaattisiin puheentunnistusmenetelmiin esitetty menetelmä ei hyödynnä tunnistuksessa lainkaan foneettista tai kielitieteellistä tietämystä. Menetelmä ainoastaan oppii ohjaamattomasti tilastollisesti vahvat osaset puheesta ja tekstistä sekä niiden väliset yhteydet. Kokeet pikkulapselle suunnatulla, sanastollisesti rajoitetulla puheella osoitti, että oppimisjakson jälkeen menetelmällä saavutetaan lupaava kyky muuntaa puhetta tekstiks
Statistical language learning
Theoretical arguments based on the "poverty of the stimulus" have denied a
priori the possibility that abstract linguistic representations can be learned
inductively from exposure to the environment, given that the linguistic input
available to the child is both underdetermined and degenerate. I reassess such
learnability arguments by exploring a) the type and amount of statistical
information implicitly available in the input in the form of distributional and
phonological cues; b) psychologically plausible inductive mechanisms for
constraining the search space; c) the nature of linguistic representations,
algebraic or statistical. To do so I use three methodologies: experimental
procedures, linguistic analyses based on large corpora of naturally occurring
speech and text, and computational models implemented in computer
simulations.
In Chapters 1,2, and 5, I argue that long-distance structural dependencies
- traditionally hard to explain with simple distributional analyses based on ngram
statistics - can indeed be learned associatively provided the amount of
intervening material is highly variable or invariant (the Variability effect). In
Chapter 3, I show that simple associative mechanisms instantiated in Simple
Recurrent Networks can replicate the experimental findings under the same
conditions of variability. Chapter 4 presents successes and limits of such results
across perceptual modalities (visual vs. auditory) and perceptual presentation
(temporal vs. sequential), as well as the impact of long and short training
procedures. In Chapter 5, I show that generalisation to abstract categories from
stimuli framed in non-adjacent dependencies is also modulated by the Variability
effect. In Chapter 6, I show that the putative separation of algebraic and
statistical styles of computation based on successful speech segmentation versus
unsuccessful generalisation experiments (as published in a recent Science paper)
is premature and is the effect of a preference for phonological properties of the
input. In chapter 7 computer simulations of learning irregular constructions
suggest that it is possible to learn from positive evidence alone, despite Gold's
celebrated arguments on the unlearnability of natural languages. Evolutionary
simulations in Chapter 8 show that irregularities in natural languages can emerge
from full regularity and remain stable across generations of simulated agents. In
Chapter 9 I conclude that the brain may endowed with a powerful statistical
device for detecting structure, generalising, segmenting speech, and recovering
from overgeneralisations. The experimental and computational evidence gathered
here suggests that statistical language learning is more powerful than heretofore
acknowledged by the current literature
- …