Search CORE

1,968 research outputs found

No Pattern, No Recognition: a Survey about Reproducibility and Distortion Issues of Text Clustering and Topic Modeling

Author: Beinotti João Vitor Pataca
da Silva Nádia Félix Felipe
da Silva Vinícius Adolfo Pereira
de Carvalho André Carlos Ponce de Leon Ferreira
Gardini Miguel de Mattos
Nunes Augusto Sousa
Silva Marília Costa Rosendo
Siqueira Felipe Alves
Tarrega João Pedro Mantovani
Publication venue
Publication date: 02/08/2022
Field of study

Extracting knowledge from unlabeled texts using machine learning algorithms can be complex. Document categorization and information retrieval are two applications that may benefit from unsupervised learning (e.g., text clustering and topic modeling), including exploratory data analysis. However, the unsupervised learning paradigm poses reproducibility issues. The initialization can lead to variability depending on the machine learning algorithm. Furthermore, the distortions can be misleading when regarding cluster geometry. Amongst the causes, the presence of outliers and anomalies can be a determining factor. Despite the relevance of initialization and outlier issues for text clustering and topic modeling, the authors did not find an in-depth analysis of them. This survey provides a systematic literature review (2011-2022) of these subareas and proposes a common terminology since similar procedures have different terms. The authors describe research opportunities, trends, and open issues. The appendices summarize the theoretical background of the text vectorization, the factorization, and the clustering algorithms that are directly or indirectly related to the reviewed works

arXiv.org e-Print Archive

On adaptive decision rules and decision parameter adaptation for automatic speech recognition

Author: Huo Q
Lee CH
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2000
Field of study

Recent advances in automatic speech recognition are accomplished by designing a plug-in maximum a posteriori decision rule such that the forms of the acoustic and language model distributions are specified and the parameters of the assumed distributions are estimated from a collection of speech and language training corpora. Maximum-likelihood point estimation is by far the most prevailing training method. However, due to the problems of unknown speech distributions, sparse training data, high spectral and temporal variabilities in speech, and possible mismatch between training and testing conditions, a dynamic training strategy is needed. To cope with the changing speakers and speaking conditions in real operational conditions for high-performance speech recognition, such paradigms incorporate a small amount of speaker and environment specific adaptation data into the training process. Bayesian adaptive learning is an optimal way to combine prior knowledge in an existing collection of general models with a new set of condition-specific adaptation data. In this paper, the mathematical framework for Bayesian adaptation of acoustic and language model parameters is first described. Maximum a posteriori point estimation is then developed for hidden Markov models and a number of useful parameters densities commonly used in automatic speech recognition and natural language processing.published_or_final_versio

CiteSeerX

HKU Scholars Hub

NILM techniques for intelligent home energy management and ambient assisted living: a review

Author: Alvaro Hernandez
Antonio Ruano
Egarter
Firth
Jesus Ureña
Juan Garcia
Kramer
Maria Ruano
Publication venue: 'MDPI AG'
Publication date: 01/01/2019
Field of study

The ongoing deployment of smart meters and different commercial devices has made electricity disaggregation feasible in buildings and households, based on a single measure of the current and, sometimes, of the voltage. Energy disaggregation is intended to separate the total power consumption into specific appliance loads, which can be achieved by applying Non-Intrusive Load Monitoring (NILM) techniques with a minimum invasion of privacy. NILM techniques are becoming more and more widespread in recent years, as a consequence of the interest companies and consumers have in efficient energy consumption and management. This work presents a detailed review of NILM methods, focusing particularly on recent proposals and their applications, particularly in the areas of Home Energy Management Systems (HEMS) and Ambient Assisted Living (AAL), where the ability to determine the on/off status of certain devices can provide key information for making further decisions. As well as complementing previous reviews on the NILM field and providing a discussion of the applications of NILM in HEMS and AAL, this paper provides guidelines for future research in these topics.Agência financiadora: Programa Operacional Portugal 2020 and Programa Operacional Regional do Algarve 01/SAICT/2018/39578 Fundação para a Ciência e Tecnologia through IDMEC, under LAETA: SFRH/BSAB/142998/2018 SFRH/BSAB/142997/2018 UID/EMS/50022/2019 Junta de Comunidades de Castilla-La-Mancha, Spain: SBPLY/17/180501/000392 Spanish Ministry of Economy, Industry and Competitiveness (SOC-PLC project): TEC2015-64835-C3-2-R MINECO/FEDERinfo:eu-repo/semantics/publishedVersio

Multidisciplinary Digital Publishing Institute

Crossref

Sapientia

A Bayesian Approach to Graphical Record Linkage and De-duplication

Author: Fienberg Stephen E.
Hall Rob
Steorts Rebecca C.
Publication venue
Publication date: 02/03/2014
Field of study

We propose an unsupervised approach for linking records across arbitrarily many files, while simultaneously detecting duplicate records within files. Our key innovation involves the representation of the pattern of links between records as a bipartite graph, in which records are directly linked to latent true individuals, and only indirectly linked to other records. This flexible representation of the linkage structure naturally allows us to estimate the attributes of the unique observable people in the population, calculate transitive linkage probabilities across records (and represent this visually), and propagate the uncertainty of record linkage into later analyses. Our method makes it particularly easy to integrate record linkage with post-processing procedures such as logistic regression, capture-recapture, etc. Our linkage structure lends itself to an efficient, linear-time, hybrid Markov chain Monte Carlo algorithm, which overcomes many obstacles encountered by previously record linkage approaches, despite the high-dimensional parameter space. We illustrate our method using longitudinal data from the National Long Term Care Survey and with data from the Italian Survey on Household and Wealth, where we assess the accuracy of our method and show it to be better in terms of error rates and empirical scalability than other approaches in the literature.Comment: 39 pages, 8 figures, 8 tables. Longer version of arXiv:1403.0211, In press, Journal of the American Statistical Association: Theory and Methods (2015

arXiv.org e-Print Archive

CiteSeerX

Cohesion and Repulsion in Bayesian Distance Clustering

Author: De Iorio Maria
Glenn Simon
Heinecke Andreas
Mayer Emanuel
Natarajan Abhinav
Publication venue
Publication date: 23/03/2023
Field of study

Clustering in high-dimensions poses many statistical challenges. While traditional distance-based clustering methods are computationally feasible, they lack probabilistic interpretation and rely on heuristics for estimation of the number of clusters. On the other hand, probabilistic model-based clustering techniques often fail to scale and devising algorithms that are able to effectively explore the posterior space is an open problem. Based on recent developments in Bayesian distance-based clustering, we propose a hybrid solution that entails defining a likelihood on pairwise distances between observations. The novelty of the approach consists in including both cohesion and repulsion terms in the likelihood, which allows for cluster identifiability. This implies that clusters are composed of objects which have small "dissimilarities" among themselves (cohesion) and similar dissimilarities to observations in other clusters (repulsion). We show how this modelling strategy has interesting connection with existing proposals in the literature as well as a decision-theoretic interpretation. The proposed method is computationally efficient and applicable to a wide variety of scenarios. We demonstrate the approach in a simulation study and an application in digital numismatics.Comment: 1 supplementary PDF attached. To view the supplementary PDF, please download the attachment under "Ancilliary Files

arXiv.org e-Print Archive

White Rose Research Online

FigShare

Machine Learning Aided Stochastic Elastoplastic and Damage Analysis of Functionally Graded Structures

Author: Feng Yuan
Publication venue: UNSW, Sydney
Publication date: 01/01/2021
Field of study

The elastoplastic and damage analyses, which serve as key indicators for the nonlinear performances of engineering structures, have been extensively investigated during the past decades. However, with the development of advanced composite material, such as the functionally graded material (FGM), the nonlinear behaviour evaluations of such advantageous materials still remain tough challenges. Moreover, despite of the assumption that structural system parameters are widely adopted as deterministic, it is already illustrated that the inevitable and mercurial uncertainties of these system properties inherently associate with the concerned structural models and nonlinear analysis process. The existence of such fluctuations potentially affects the actual elastoplastic and damage behaviours of the FGM structures, which leads to the inadequacy between the approximation results with the actual structural safety conditions. Consequently, it is requisite to establish a robust stochastic nonlinear analysis framework complied with the requirements of modern composite engineering practices. In this dissertation, a novel uncertain nonlinear analysis framework, namely the machine leaning aided stochastic elastoplastic and damage analysis framework, is presented herein for FGM structures. The proposed approach is a favorable alternative to determine structural reliability when full-scale testing is not achievable, thus leading to significant eliminations of manpower and computational efforts spent in practical engineering applications. Within the developed framework, a novel extended support vector regression (X-SVR) with Dirichlet feature mapping approach is introduced and then incorporated for the subsequent uncertainty quantification. By successfully establishing the governing relationship between the uncertain system parameters and any concerned structural output, a comprehensive probabilistic profile including means, standard deviations, probability density functions (PDFs), and cumulative distribution functions (CDFs) of the structural output can be effectively established through a sampling scheme. Consequently, by adopting the machine learning aided stochastic elastoplastic and damage analysis framework into real-life engineering application, the advantages of the next generation uncertainty quantification analysis can be highlighted, and appreciable contributions can be delivered to both structural safety evaluation and structural design fields

UNSWorks