283 research outputs found
On adaptive decision rules and decision parameter adaptation for automatic speech recognition
Recent advances in automatic speech recognition are accomplished by designing a plug-in maximum a posteriori decision rule such that the forms of the acoustic and language model distributions are specified and the parameters of the assumed distributions are estimated from a collection of speech and language training corpora. Maximum-likelihood point estimation is by far the most prevailing training method. However, due to the problems of unknown speech distributions, sparse training data, high spectral and temporal variabilities in speech, and possible mismatch between training and testing conditions, a dynamic training strategy is needed. To cope with the changing speakers and speaking conditions in real operational conditions for high-performance speech recognition, such paradigms incorporate a small amount of speaker and environment specific adaptation data into the training process. Bayesian adaptive learning is an optimal way to combine prior knowledge in an existing collection of general models with a new set of condition-specific adaptation data. In this paper, the mathematical framework for Bayesian adaptation of acoustic and language model parameters is first described. Maximum a posteriori point estimation is then developed for hidden Markov models and a number of useful parameters densities commonly used in automatic speech recognition and natural language processing.published_or_final_versio
Learning from Data Streams with Randomized Forests
Non-stationary streaming data poses a familiar challenge in machine learning: the need to
obtain fast and accurate predictions. A data stream is a continuously generated sequence of
data, with data typically arriving rapidly. They are often characterised by a non-stationary
generative process, with concept drift occurring as the process changes. Such processes are
commonly seen in the real world, such as in advertising, shopping trends, environmental
conditions, electricity monitoring and traffic monitoring.
Typical stationary algorithms are ill-suited for use with concept drifting data, thus necessitating
more targeted methods. Tree-based methods are a popular approach to this problem,
traditionally focussing on the use of the Hoeffding bound in order to guarantee performance
relative to a stationary scenario. However, there are limited single learners available for
regression scenarios, and those that do exist often struggle to choose between similarly
discriminative splits, leading to longer training times and worse performance. This limited
pool of single learners in turn hampers the performance of ensemble approaches in which
they act as base learners.
In this thesis we seek to remedy this gap in the literature, developing methods which
focus on increasing randomization to both improve predictive performance and reduce the
training times of tree-based ensemble methods. In particular, we have chosen to investigate
the use of randomization as it is known to be able to improve generalization error in
ensembles, and is also expected to lead to fast training times, thus being a natural method
of handling the problems typically experienced by single learners.
We begin in a regression scenario, introducing the Adaptive Trees for Streaming with
Extreme Randomization (ATSER) algorithm; a partially randomized approach based on
the concept of Extremely Randomized (extra) trees. The ATSER algorithm incrementally
trains trees, using the Hoeffding bound to select the best of a random selection of splits.
Simultaneously, the trees also detect and adapt to changes in the data stream. Unlike many
traditional streaming algorithms ATSER trees can easily be extended to include nominal
features. We find that compared to other contemporary methods ensembles of ATSER
trees lead to improved predictive performance whilst also reducing run times.
We then demonstrate the Adaptive Categorisation Trees for Streaming with Extreme
Randomization (ACTSER) algorithm, an adaption of the ATSER algorithm to the more
traditional categorization scenario, again showing improved predictive performance and
reduced runtimes. The inclusion of nominal features is particularly novel in this setting
since typical categorization approaches struggle to handle them.
Finally we examine a completely randomized scenario, where an ensemble of trees is generated
prior to having access to the data stream, while also considering multivariate splits
in addition to the traditional axis-aligned approach. We find that through the combination
of a forgetting mechanism in linear models and dynamic weighting for ensemble members,
we are able to avoid explicitly testing for concept drift. This leads to fast ensembles
with strong predictive performance, whilst also requiring fewer parameters than other
contemporary methods.
For each of the proposed methods in this thesis, we demonstrate empirically that they are
effective over a variety of different non-stationary data streams, including on multiple
types of concept drift. Furthermore, in comparison to other contemporary data streaming
algorithms, we find the biggest improvements in performance are on noisy data streams.Engineers Gat
Spatial Price Transmission, Transaction Costs, and Econometric Modelling and Modelling Salmonella Spread in Broiler Production
Transaction costs are major determinants of price transmission across space and must be accounted for when modelling price transmission. This article contributes to literature by evaluating the impact of not properly accounting for transaction cost variation on price transmission parameters using a Monte Carlo experiment and a real world application. We show that when transaction costs are variable and nonstationary, threshold vector error correction models assuming fixed thresholds provide biased inference, while the flexible threshold specification accounting for transaction cost variation is able to provide unbiased estimates on market performance indicators.In the second essay, we identify determinants and control strategies for Salmonella in broiler production. The presence of Salmonella spp. in broiler production is a concern as the bacterium can be transmitted to humans via contaminated meat and derived products. A longitudinal study using official results of Salmonella spp isolation from drag swabs collected at the end of the grow-out period was performed to determine risk factors related to farm and broiler house characteristics and management practices, as recorded by a Brazilian integrated broiler enterprise. A Bayesian hierarchical spatio-temporal model revealed significant spatial and time influence and significant effects of size of broiler house and total housing area per farm, type of broiler house and litter recycles on the odds of isolating Salmonella spp from litter, allowing the implementation of measures to reduce the risk of persistence of the bacterium in the broiler production chain. We find evidence of a principal-agent problem while setting strategies to control the bacteria in litter and suggest the adoption of incentives aiming to reduce prevalence in the integrated enterprise. The possibility of implementing optimal control measures by extending recorded data is discussed
Untangling hotel industry’s inefficiency: An SFA approach applied to a renowned Portuguese hotel chain
The present paper explores the technical efficiency of four hotels from Teixeira Duarte Group - a renowned Portuguese hotel chain. An efficiency ranking is established from these four hotel units located in Portugal using Stochastic Frontier Analysis. This methodology allows to discriminate between measurement error and systematic inefficiencies in the estimation process enabling to investigate the main inefficiency causes. Several suggestions concerning efficiency improvement are undertaken for each hotel studied.info:eu-repo/semantics/publishedVersio
Comparative analysis of the frequentist and Bayesian approaches to stress testing
Stress testing is necessary for banks as it is required by the Basel Accords for loss
predictions and regulatory and economic capital computations. It has become
increasingly important especially after the 2008 global financial crisis. Credit models
are essential in controlling credit risk. The search for new ways to more accurately
predict credit risk continues. This thesis concentrates on stress testing the probability
of default using the Bayesian posterior distribution to incorporate estimation
uncertainty and parameter instability. It also explores modelling the probability of
default using Bayesian informative priors to enhance the model predictive accuracy.
A new Bayesian informative prior selection method is proposed to include additional
information to credit risk modelling and improve model performances. We employ
cross-sectional logistic regressions to model the probability of default of mortgage
loans using both the Bayesian approach with various priors and the frequentist
approach. In the Bayesian informative prior selection method that we propose, we
treat coefficients in the PD model as time series variables. We build ARIMA models
to forecast the coefficient values in future time periods and use these ARIMA
forecasts as Bayesian informative priors. We find that the Bayesian models using this
prior selection method outperform both frequentist models and Bayesian models
with other priors in terms of model predictive accuracy.
We propose a new stress testing method to model both macroeconomic stress and
coefficient uncertainty. Based on U.S. mortgage loan data, we model the probability
of default at the account level using discrete time hazard analysis. We employ both
the frequentist and Bayesian methods in parameter estimation and default rate (DR)
stress testing. By applying the parameter posterior distribution obtained in the
Bayesian approach to simulating the Bayesian estimated DR distribution, we reduce
the estimation risk coming from employing point estimates in stress testing. We find
that the 99% value at risk (VaR) using the Bayesian posterior distribution approach is around 6.5 times the VaR at the same probability level using the frequentist approach
with parameter mean estimates.
We furthersimulate DR distributions based on models built on crisis and tranquil time
periods to explore the impact changes in model parameters between different
scenarios have on stress testing results. We apply the parameter posterior
distribution obtained in a Bayesian approach to stress testing to reduce the
estimation risk that results from using parameter point estimates. We compute the
VaRs and required capital with both parameter instability between scenarios and
with estimation risk considered. The results are compared with those obtained when
coefficient changes in stress testing models or coefficient uncertainty are neglected.
We find that the required capital is considerably underestimated when neither
parameter instability nor estimation risk is addressed
Hierarchical Bayesian Fuzzy Clustering Approach for High Dimensional Linear Time-Series
This paper develops a computational approach to improve fuzzy clustering and forecasting performance when dealing with endogeneity issues and misspecified dynamics in high dimensional dynamic data. Hierarchical Bayesian methods are used to structure linear time variations, reduce dimensionality, and compute a distance function capturing the most probable set of clusters among univariate and multivariate time-series. Nonlinearities involved in the procedure look like permanent shifts and are replaced by coefficient changes. Monte Carlo implementations are also addressed to compute exact posterior probabilities for each cluster chosen and then minimize the increasing probability of outliers plaguing traditional clustering time-series techniques. An empirical example highlights the strengths and limitations of the estimating procedure. Discussions with related works are also displayed
Recommended from our members
Essays on the Quantification and Propagation of Uncertainty in Climate Change Impact Assessments for Water Resource Systems
Sustainable water resources planning and management under climate change requires a proper treatment of uncertainties that emerge in an impacts analysis. A primary source of this uncertainty originates from the difficulties in projecting how anthropogenic greenhouse gas emissions will evolve over time and influence the climate system at regional and local scales. However, other sources of uncertainty, such as errors in modeling hydrologic response to climate and the influences of internal climate variability, compound the effects of climate change uncertainty and further obscure our understanding of water resources performance under future climate conditions. This work presents an approach to quantify the interactions, propagation, and relative contributions of different sources of uncertainty in a water resources impacts assessment under climate change. Hydrologic modeling uncertainty is addressed using Bayesian methods that can quantify both parametric and structural errors. Hydrologic uncertainties are propagated through an ensemble of climate projections to explore their joint uncertainty. A new stochastic weather generator is presented to develop a wide ensemble of climate projections that can extend beyond the limited range of change often afforded by global climate models and better explore climate risks. The weather generator also enables the development of multiple realizations of the same mean climate conditions, allowing an exploration of the effects of internal climate variability. The uncertainties from mean climate changes, internal climate variability, and hydrologic modeling errors are then integrated in two climate change analyses of a flood control facility and a multi-purpose surface reservoir system, respectively, to explore their separate and combined effect on future system performance. The primary goal of this work is to present methods that can better estimate the precision associated with future projections of water resource system performance under climate change, and through this provide information that can guide the development of adaptation strategies that are robust to these uncertainties
Dynamic effective connectivity
Metastability is a key source of itinerant dynamics in the brain; namely, spontaneous spatiotemporal reorganization of neuronal activity. This itinerancy has been the focus of numerous dynamic functional connectivity (DFC) analyses - developed to characterize the formation and dissolution of distributed functional patterns over time, using resting state fMRI. However, aside from technical and practical controversies, these approaches cannot recover the neuronal mechanisms that underwrite itinerant (e.g., metastable) dynamics-due to their descriptive, model-free nature. We argue that effective connectivity (EC) analyses are more apt for investigating the neuronal basis of metastability. To this end, we appeal to biologically-grounded models (i.e., dynamic causal modelling, DCM) and dynamical systems theory (i.e., heteroclinic sequential dynamics) to create a probabilistic, generative model of haemodynamic fluctuations. This model generates trajectories in the parametric space of EC modes (i.e., states of connectivity) that characterize functional brain architectures. In brief, it extends an established spectral DCM, to generate functional connectivity data features that change over time. This foundational paper tries to establish the model's face validity by simulating non-stationary fMRI time series and recovering key model parameters (i.e., transition probabilities among connectivity states and the parametric nature of these states) using variational Bayes. These data are further characterized using Bayesian model comparison (within and between subjects). Finally, we consider practical issues that attend applications and extensions of this scheme. Importantly, the scheme operates within a generic Bayesian framework - that can be adapted to study metastability and itinerant dynamics in any non-stationary time series
Extensions to the Latent Dirichlet Allocation Topic Model Using Flexible Priors
Intrinsically, topic models have always their likelihood functions fixed to multinomial
distributions as they operate on count data instead of Gaussian data. As a result,
their performances ultimately depend on the flexibility of the chosen prior distributions
when following the Bayesian paradigm compared to classical approaches such as PLSA
(probabilistic latent semantic analysis), unigrams and mixture of unigrams that do not use
prior information. The standard LDA (latent Dirichlet allocation) topic model operates
with symmetric Dirichlet distribution (as a conjugate prior) which has been found to carry
some limitations due to its independent structure that tends to hinder performance for
instance in topic correlation including positively correlated data processing. Compared to
classical ML estimators, the use of priors ultimately presents another unique advantage of
smoothing out the multinomials while enhancing predictive topic models.
In this thesis, we propose a series of flexible priors such as generalized Dirichlet (GD)
and Beta-Liouville (BL) for our topic models within the collapsed representation, leading
to much improved CVB (collapsed variational Bayes) update equations compared to ones
from the standard LDA. This is because the flexibility of these priors improves significantly
the lower bounds in the corresponding CVB algorithms. We also show the robustness of our
proposed CVB inferences when using simultaneously the BL and GD in hybrid generative-discriminative models
where the generative stage produces good and heterogeneous topic
features that are used in the discriminative stage by powerful classifiers such as SVMs
(support vector machines) as we propose efficient probabilistic kernels to facilitate processing
(classification) of documents based on topic signatures. Doing so, we implicitly cast topic
modeling which is an unsupervised learning method into a supervised learning technique.
Furthermore, due to the complexity of the CVB algorithm (as it requires second order
Taylor expansions) in general, despite its flexibility, we propose a much simpler and tractable
update equation using a MAP (maximum a posteriori) framework with the standard EM
(expectation-maximization) algorithm. As most Bayesian posteriors are not tractable for
complex models, we ultimately propose the MAP-LBLA (latent BL allocation) where we
characterize the contributions of asymmetric BL priors over the symmetric Dirichlet (Dir).
The proposed MAP technique importantly offers a point estimate (mode) with a much
tractable solution. In the MAP, we show that point estimate could be easy to implement
than full Bayesian analysis that integrates over the entire parameter space. The MAP
implicitly exhibits some equivalent relationship with the CVB especially the zero order
approximations CVB0 and its stochastic version SCVB0. The proposed method enhances
performances in information retrieval in text document analysis.
We show that parametric topic models (as they are finite dimensional methods) have a
much smaller hypothesis space and they generally suffer from model selection. We therefore
propose a Bayesian nonparametric (BNP) technique that uses the Hierarchical Dirichlet
process (HDP) as conjugate prior to the document multinomial distributions where the
asymmetric BL serves as a diffuse (probability) base measure that provides the global
atoms (topics) that are shared among documents. The heterogeneity in the topic structure
helps in providing an alternative to model selection because the nonparametric topic model
(which is infinite dimensional with a much bigger hypothesis space) could now prune out
irrelevant topics based on the associated probability masses to only retain the most relevant
ones.
We also show that for large scale applications, stochastic optimizations using natural
gradients of the objective functions have demonstrated significant performances when we
learn rapidly both data and parameters in online fashion (streaming). We use both
predictive likelihood and perplexity as evaluation methods to assess the robustness of our
proposed topic models as we ultimately refer to probability as a way to quantify uncertainty
in our Bayesian framework. We improve object categorization in terms of inferences through
the flexibility of our prior distributions in the collapsed space. We also improve information
retrieval technique with the MAP and the HDP-LBLA topic models while extending the
standard LDA. These two applications present the ultimate capability of enhancing a search
engine based on topic models
- …