2,718 research outputs found

    Directional clustering through matrix factorization

    No full text
    This paper deals with a clustering problem where feature vectors are clustered depending on the angle between feature vectors, that is, feature vectors are grouped together if they point roughly in the same direction. This directional distance measure arises in several applications, including document classification and human brain imaging. Using ideas from the field of constrained low-rank matrix factorization and sparse approximation, a novel approach is presented that differs from classical clustering methods, such as seminonnegative matrix factorization, K-EVD, or k-means clustering, yet combines some aspects of all these. As in nonnegative matrix factorization and K-EVD, the matrix decomposition is iteratively refined to optimize a data fidelity term; however, no positivity constraint is enforced directly nor do we need to explicitly compute eigenvectors. As in k-means and K-EVD, each optimization step is followed by a hard cluster assignment. This leads to an efficient algorithm that is shown here to outperform common competitors in terms of clustering performance and/or computation speed. In addition to a detailed theoretical analysis of some of the algorithm's main properties, the approach is empirically evaluated on a range of toy problems, several standard text clustering data sets, and a high-dimensional problem in brain imaging, where functional magnetic resonance imaging data are used to partition the human cerebral cortex into distinct functional regions

    Hadronic final states in deep-inelastic scattering with Sherpa

    Full text link
    We extend the multi-purpose Monte-Carlo event generator Sherpa to include processes in deeply inelastic lepton-nucleon scattering. Hadronic final states in this kinematical setting are characterised by the presence of multiple kinematical scales, which were up to now accounted for only by specific resummations in individual kinematical regions. Using an extension of the recently introduced method for merging truncated parton showers with higher-order tree-level matrix elements, it is possible to obtain predictions which are reliable in all kinematical limits. Different hadronic final states, defined by jets or individual hadrons, in deep-inelastic scattering are analysed and the corresponding results are compared to HERA data. The various sources of theoretical uncertainties of the approach are discussed and quantified. The extension to deeply inelastic processes provides the opportunity to validate the merging of matrix elements and parton showers in multi-scale kinematics inaccessible in other collider environments. It also allows to use HERA data on hadronic final states in the tuning of hadronisation models.Comment: 32 pages, 22 figure

    Extending the Matrix Element Method beyond the Born approximation: Calculating event weights at next-to-leading order accuracy

    Full text link
    In this article we illustrate how event weights for jet events can be calculated efficiently at next-to-leading order (NLO) accuracy in QCD. This is a crucial prerequisite for the application of the Matrix Element Method in NLO. We modify the recombination procedure used in jet algorithms, to allow a factorisation of the phase space for the real corrections into resolved and unresolved regions. Using an appropriate infrared regulator the latter can be integrated numerically. As illustration, we reproduce differential distributions at NLO for two sample processes. As further application and proof of concept, we apply the Matrix Element Method in NLO accuracy to the mass determination of top quarks produced in e+e- annihilation. This analysis is relevant for a future Linear Collider. We observe a significant shift in the extracted mass depending on whether the Matrix Element Method is used in leading or next-to-leading order.Comment: 35 pages, 12 figures, references & acknowledgments added, typos corrected, matches published versio

    Non-Negative Matrix Factorization Based Algorithms to Cluster Frequency Basis Functions for Monaural Sound Source Separation.

    Get PDF
    Monophonic sound source separation (SSS) refers to a process that separates out audio signals produced from the individual sound sources in a given acoustic mixture, when the mixture signal is recorded using one microphone or is directly recorded onto one reproduction channel. Many audio applications such as pitch modification and automatic music transcription would benefit from the availability of segregated sound sources from the mixture of audio signals for further processing. Recently, Non-negative matrix factorization (NMF) has found application in monaural audio source separation due to its ability to factorize audio spectrograms into additive part-based basis functions, where the parts typically correspond to individual notes or chords in music. An advantage of NMF is that there can be a single basis function for each note played by a given instrument, thereby capturing changes in timbre with pitch for each instrument or source. However, these basis functions need to be clustered to their respective sources for the reconstruction of the individual source signals. Many clustering methods have been proposed to map the separated signals into sources with considerable success. Recently, to avoid the need of clustering, Shifted NMF (SNMF) was proposed, which assumes that the timbre of a note is constant for all the pitches produced by an instrument. SNMF has two drawbacks. Firstly, the assumption that the timbre of the notes played by an instrument remains constant, is not true in general. Secondly, the SNMF method uses the Constant Q transform (CQT) and the lack of a true inverse of the CQT results in compromising on separation quality of the reconstructed signal. The principal aim of this thesis is to attempt to solve the problem of clustering NMF basis functions. Our first major contribution is the use of SNMF as a method of clustering the basis functions obtained via standard NMF. The proposed SNMF clustering method aims to cluster the frequency basis functions obtained via standard NMF to their respective sources by making use of shift invariance in a log-frequency domain. Further, a minor contribution is made by improving the separation performance of the standard SNMF algorithm (here used directly to separate sources) obtained through the use of an improved inverse CQT. Here, the standard SNMF algorithm finds shift-invariance in a CQ spectrogram, that contain the frequency basis functions, obtained directly from the spectrogram of the audio mixture. Our next contribution is an improvement in the SNMF clustering algorithm through the incorporation of the CQT matrix inside the SNMF model in order to avoid the need of an inverse CQT to reconstruct the clustered NMF basis unctions. Another major contribution deals with the incorporation of a constraint called group sparsity (GS) into the SNMF clustering algorithm at two stages to improve clustering. The effect of the GS is evaluated on various SNMF clustering algorithms proposed in this thesis. Finally, we have introduced a new family of masks to reconstruct the original signal from the clustered basis functions and compared their performance to the generalized Wiener filter masks using three different factorisation-based separation algorithms. We show that better separation performance can be achieved by using the proposed family of masks

    Constructing a data-driven receptor model for organic and inorganic aerosol : a synthesis analysis of eight mass spectrometric data sets from a boreal forest site

    Get PDF
    The interactions between organic and inorganic aerosol chemical components are integral to understanding and modelling climate and health-relevant aerosol physicochemical properties, such as volatility, hygroscopicity, light scattering and toxicity. This study presents a synthesis analysis for eight data sets, of non-refractory aerosol composition, measured at a boreal forest site. The measurements, performed with an aerosol mass spectrometer, cover in total around 9 months over the course of 3 years. In our statistical analysis, we use the complete organic and inorganic unit-resolution mass spectra, as opposed to the more common approach of only including the organic fraction. The analysis is based on iterative, combined use of (1) data reduction, (2) classification and (3) scaling tools, producing a data-driven chemical mass balance type of model capable of describing site-specific aerosol composition. The receptor model we constructed was able to explain 83 +/- 8% of variation in data, which increased to 96 +/- 3% when signals from low signal-to-noise variables were not considered. The resulting interpretation of an extensive set of aerosol mass spectrometric data infers seven distinct aerosol chemical components for a rural boreal forest site: ammonium sulfate (35 +/- 7% of mass), low and semi-volatile oxidised organic aerosols (27 +/- 8% and 12 +/- 7 %), biomass burning organic aerosol (11 +/- 7 %), a nitrate-containing organic aerosol type (7 +/- 2 %), ammonium nitrate (5 +/- 2 %), and hydrocarbon-like organic aerosol (3 +/- 1 %). Some of the additionally observed, rare outlier aerosol types likely emerge due to surface ionisation effects and likely represent amine compounds from an unknown source and alkaline metals from emissions of a nearby district heating plant. Compared to traditional, ionbalance-based inorganics apportionment schemes for aerosol mass spectrometer data, our statistics-based method provides an improved, more robust approach, yielding readily useful information for the modelling of submicron atmospheric aerosols physical and chemical properties. The results also shed light on the division between organic and inorganic aerosol types and dynamics of salt formation in aerosol. Equally importantly, the combined methodology exemplifies an iterative analysis, using consequent analysis steps by a combination of statistical methods. Such an approach offers new ways to home in on physicochemically sensible solutions with minimal need for a priori information or analyst interference. We therefore suggest that similar statisticsbased approaches offer significant potential for un- or semi-supervised machine-learning applications in future analyses of aerosol mass spectrometric data.Peer reviewe

    Acoustic Adaptation to Dynamic Background Conditions with Asynchronous Transformations

    Get PDF
    This paper proposes a framework for performing adaptation to complex and non-stationary background conditions in Automatic Speech Recognition (ASR) by means of asynchronous Constrained Maximum Likelihood Linear Regression (aCMLLR) transforms and asynchronous Noise Adaptive Training (aNAT). The proposed method aims to apply the feature transform that best compensates the background for every input frame. The implementation is done with a new Hidden Markov Model (HMM) topology that expands the usual left-to-right HMM into parallel branches adapted to different background conditions and permits transitions among them. Using this, the proposed adaptation does not require ground truth or previous knowledge about the background in each frame as it aims to maximise the overall log-likelihood of the decoded utterance. The proposed aCMLLR transforms can be further improved by retraining models in an aNAT fashion and by using speaker-based MLLR transforms in cascade for an efficient modelling of background effects and speaker. An initial evaluation in a modified version of the WSJCAM0 corpus incorporating 7 different background conditions provides a benchmark in which to evaluate the use of aCMLLR transforms. A relative reduction of 40.5% in Word Error Rate (WER) was achieved by the combined use of aCMLLR and MLLR in cascade. Finally, this selection of techniques was applied in the transcription of multi-genre media broadcasts, where the use of aNAT training, aCMLLR transforms and MLLR transforms provided a relative improvement of 2–3%

    Relevance-based language modelling for recommender systems

    Full text link
    This is the author’s version of a work that was accepted for publication in Journal Information Processing and Management: an International Journal. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. A definitive version was subsequently published in Journal Information Processing and Management: an International Journal, 49, 4, (2013) DOI: 10.1016/j.ipm.2013.03.001Relevance-Based Language Models, commonly known as Relevance Models, are successful approaches to explicitly introduce the concept of relevance in the statistical Language Modelling framework of Information Retrieval. These models achieve state-of-the-art retrieval performance in the pseudo relevance feedback task. On the other hand, the field of recommender systems is a fertile research area where users are provided with personalised recommendations in several applications. In this paper, we propose an adaptation of the Relevance Modelling framework to effectively suggest recommendations to a user. We also propose a probabilistic clustering technique to perform the neighbour selection process as a way to achieve a better approximation of the set of relevant items in the pseudo relevance feedback process. These techniques, although well known in the Information Retrieval field, have not been applied yet to recommender systems, and, as the empirical evaluation results show, both proposals outperform individually several baseline methods. Furthermore, by combining both approaches even larger effectiveness improvements are achieved.This work was funded by Secretaría de Estado de Investigación, Desarrollo e Innovación from the Spanish Government under Projects TIN2012-33867 and TIN2011-28538-C02

    New approaches for unsupervised transcriptomic data analysis based on Dictionary learning

    Get PDF
    The era of high-throughput data generation enables new access to biomolecular profiles and exploitation thereof. However, the analysis of such biomolecular data, for example, transcriptomic data, suffers from the so-called "curse of dimensionality". This occurs in the analysis of datasets with a significantly larger number of variables than data points. As a consequence, overfitting and unintentional learning of process-independent patterns can appear. This can lead to insignificant results in the application. A common way of counteracting this problem is the application of dimension reduction methods and subsequent analysis of the resulting low-dimensional representation that has a smaller number of variables. In this thesis, two new methods for the analysis of transcriptomic datasets are introduced and evaluated. Our methods are based on the concepts of Dictionary learning, which is an unsupervised dimension reduction approach. Unlike many dimension reduction approaches that are widely applied for transcriptomic data analysis, Dictionary learning does not impose constraints on the components that are to be derived. This allows for great flexibility when adjusting the representation to the data. Further, Dictionary learning belongs to the class of sparse methods. The result of sparse methods is a model with few non-zero coefficients, which is often preferred for its simplicity and ease of interpretation. Sparse methods exploit the fact that the analysed datasets are highly structured. Indeed, a characteristic of transcriptomic data is particularly their structuredness, which appears due to the connection of genes and pathways, for example. Nonetheless, the application of Dictionary learning in medical data analysis is mainly restricted to image analysis. Another advantage of Dictionary learning is that it is an interpretable approach. Interpretability is a necessity in biomolecular data analysis to gain a holistic understanding of the investigated processes. Our two new transcriptomic data analysis methods are each designed for one main task: (1) identification of subgroups for samples from mixed populations, and (2) temporal ordering of samples from dynamic datasets, also referred to as "pseudotime estimation". Both methods are evaluated on simulated and real-world data and compared to other methods that are widely applied in transcriptomic data analysis. Our methods convince through high performance and overall outperform the comparison methods
    • 

    corecore