1,968 research outputs found
No Pattern, No Recognition: a Survey about Reproducibility and Distortion Issues of Text Clustering and Topic Modeling
Extracting knowledge from unlabeled texts using machine learning algorithms
can be complex. Document categorization and information retrieval are two
applications that may benefit from unsupervised learning (e.g., text clustering
and topic modeling), including exploratory data analysis. However, the
unsupervised learning paradigm poses reproducibility issues. The initialization
can lead to variability depending on the machine learning algorithm.
Furthermore, the distortions can be misleading when regarding cluster geometry.
Amongst the causes, the presence of outliers and anomalies can be a determining
factor. Despite the relevance of initialization and outlier issues for text
clustering and topic modeling, the authors did not find an in-depth analysis of
them. This survey provides a systematic literature review (2011-2022) of these
subareas and proposes a common terminology since similar procedures have
different terms. The authors describe research opportunities, trends, and open
issues. The appendices summarize the theoretical background of the text
vectorization, the factorization, and the clustering algorithms that are
directly or indirectly related to the reviewed works
On adaptive decision rules and decision parameter adaptation for automatic speech recognition
Recent advances in automatic speech recognition are accomplished by designing a plug-in maximum a posteriori decision rule such that the forms of the acoustic and language model distributions are specified and the parameters of the assumed distributions are estimated from a collection of speech and language training corpora. Maximum-likelihood point estimation is by far the most prevailing training method. However, due to the problems of unknown speech distributions, sparse training data, high spectral and temporal variabilities in speech, and possible mismatch between training and testing conditions, a dynamic training strategy is needed. To cope with the changing speakers and speaking conditions in real operational conditions for high-performance speech recognition, such paradigms incorporate a small amount of speaker and environment specific adaptation data into the training process. Bayesian adaptive learning is an optimal way to combine prior knowledge in an existing collection of general models with a new set of condition-specific adaptation data. In this paper, the mathematical framework for Bayesian adaptation of acoustic and language model parameters is first described. Maximum a posteriori point estimation is then developed for hidden Markov models and a number of useful parameters densities commonly used in automatic speech recognition and natural language processing.published_or_final_versio
NILM techniques for intelligent home energy management and ambient assisted living: a review
The ongoing deployment of smart meters and different commercial devices has made electricity disaggregation feasible in buildings and households, based on a single measure of the current and, sometimes, of the voltage. Energy disaggregation is intended to separate the total power consumption into specific appliance loads, which can be achieved by applying Non-Intrusive Load Monitoring (NILM) techniques with a minimum invasion of privacy. NILM techniques are becoming more and more widespread in recent years, as a consequence of the interest companies and consumers have in efficient energy consumption and management. This work presents a detailed review of NILM methods, focusing particularly on recent proposals and their applications, particularly in the areas of Home Energy Management Systems (HEMS) and Ambient Assisted Living (AAL), where the ability to determine the on/off status of certain devices can provide key information for making further decisions. As well as complementing previous reviews on the NILM field and providing a discussion of the applications of NILM in HEMS and AAL, this paper provides guidelines for future research in these topics.Agência financiadora:
Programa Operacional Portugal 2020 and Programa Operacional Regional do Algarve
01/SAICT/2018/39578
Fundação para a Ciência e Tecnologia through IDMEC, under LAETA:
SFRH/BSAB/142998/2018
SFRH/BSAB/142997/2018
UID/EMS/50022/2019
Junta de Comunidades de Castilla-La-Mancha, Spain:
SBPLY/17/180501/000392
Spanish Ministry of Economy, Industry and Competitiveness (SOC-PLC project):
TEC2015-64835-C3-2-R MINECO/FEDERinfo:eu-repo/semantics/publishedVersio
A Bayesian Approach to Graphical Record Linkage and De-duplication
We propose an unsupervised approach for linking records across arbitrarily
many files, while simultaneously detecting duplicate records within files. Our
key innovation involves the representation of the pattern of links between
records as a bipartite graph, in which records are directly linked to latent
true individuals, and only indirectly linked to other records. This flexible
representation of the linkage structure naturally allows us to estimate the
attributes of the unique observable people in the population, calculate
transitive linkage probabilities across records (and represent this visually),
and propagate the uncertainty of record linkage into later analyses. Our method
makes it particularly easy to integrate record linkage with post-processing
procedures such as logistic regression, capture-recapture, etc. Our linkage
structure lends itself to an efficient, linear-time, hybrid Markov chain Monte
Carlo algorithm, which overcomes many obstacles encountered by previously
record linkage approaches, despite the high-dimensional parameter space. We
illustrate our method using longitudinal data from the National Long Term Care
Survey and with data from the Italian Survey on Household and Wealth, where we
assess the accuracy of our method and show it to be better in terms of error
rates and empirical scalability than other approaches in the literature.Comment: 39 pages, 8 figures, 8 tables. Longer version of arXiv:1403.0211, In
press, Journal of the American Statistical Association: Theory and Methods
(2015
Cohesion and Repulsion in Bayesian Distance Clustering
Clustering in high-dimensions poses many statistical challenges. While
traditional distance-based clustering methods are computationally feasible,
they lack probabilistic interpretation and rely on heuristics for estimation of
the number of clusters. On the other hand, probabilistic model-based clustering
techniques often fail to scale and devising algorithms that are able to
effectively explore the posterior space is an open problem. Based on recent
developments in Bayesian distance-based clustering, we propose a hybrid
solution that entails defining a likelihood on pairwise distances between
observations. The novelty of the approach consists in including both cohesion
and repulsion terms in the likelihood, which allows for cluster
identifiability. This implies that clusters are composed of objects which have
small "dissimilarities" among themselves (cohesion) and similar dissimilarities
to observations in other clusters (repulsion). We show how this modelling
strategy has interesting connection with existing proposals in the literature
as well as a decision-theoretic interpretation. The proposed method is
computationally efficient and applicable to a wide variety of scenarios. We
demonstrate the approach in a simulation study and an application in digital
numismatics.Comment: 1 supplementary PDF attached. To view the supplementary PDF, please
download the attachment under "Ancilliary Files
Machine Learning Aided Stochastic Elastoplastic and Damage Analysis of Functionally Graded Structures
The elastoplastic and damage analyses, which serve as key indicators for the nonlinear performances of engineering structures, have been extensively investigated during the past decades. However, with the development of advanced composite material, such as the functionally graded material (FGM), the nonlinear behaviour evaluations of such advantageous materials still remain tough challenges. Moreover, despite of the assumption that structural system parameters are widely adopted as deterministic, it is already illustrated that the inevitable and mercurial uncertainties of these system properties inherently associate with the concerned structural models and nonlinear analysis process. The existence of such fluctuations potentially affects the actual elastoplastic and damage behaviours of the FGM structures, which leads to the inadequacy between the approximation results with the actual structural safety conditions. Consequently, it is requisite to establish a robust stochastic nonlinear analysis framework complied with the requirements of modern composite engineering practices.
In this dissertation, a novel uncertain nonlinear analysis framework, namely the machine leaning aided stochastic elastoplastic and damage analysis framework, is presented herein for FGM structures. The proposed approach is a favorable alternative to determine structural reliability when full-scale testing is not achievable, thus leading to significant eliminations of manpower and computational efforts spent in practical engineering applications. Within the developed framework, a novel extended support vector regression (X-SVR) with Dirichlet feature mapping approach is introduced and then incorporated for the subsequent uncertainty quantification. By successfully establishing the governing relationship between the uncertain system parameters and any concerned structural output, a comprehensive probabilistic profile including means, standard deviations, probability density functions (PDFs), and cumulative distribution functions (CDFs) of the structural output can be effectively established through a sampling scheme.
Consequently, by adopting the machine learning aided stochastic elastoplastic and damage analysis framework into real-life engineering application, the advantages of the next generation uncertainty quantification analysis can be highlighted, and appreciable contributions can be delivered to both structural safety evaluation and structural design fields
- …