9 research outputs found

    Bayesian Cluster Enumeration Criterion for Unsupervised Learning

    Full text link
    We derive a new Bayesian Information Criterion (BIC) by formulating the problem of estimating the number of clusters in an observed data set as maximization of the posterior probability of the candidate models. Given that some mild assumptions are satisfied, we provide a general BIC expression for a broad class of data distributions. This serves as a starting point when deriving the BIC for specific distributions. Along this line, we provide a closed-form BIC expression for multivariate Gaussian distributed variables. We show that incorporating the data structure of the clustering problem into the derivation of the BIC results in an expression whose penalty term is different from that of the original BIC. We propose a two-step cluster enumeration algorithm. First, a model-based unsupervised learning algorithm partitions the data according to a given set of candidate models. Subsequently, the number of clusters is determined as the one associated with the model for which the proposed BIC is maximal. The performance of the proposed two-step algorithm is tested using synthetic and real data sets.Comment: 14 pages, 7 figure

    Incorporating periodic variability in hidden Markov models for animal movement

    Get PDF

    Investigation of the optimal number of clusters by the adaptive EM algorithm

    Get PDF
    This paper considers the investigation of the optimal number of clusters for datasets that are modeled as the Gaussian mixture. For that purpose, the adaptive method that is based on a modified Expectation Maximization (EM) algorithm is developed. The modification is conducted within the hidden variable of the standard EM algorithm. Assuming that data are multivariate normally distributed, where each component of the Gaussian mixture corresponds to one cluster, the modification is provided by utilizing the fact that the Mahalanobis distance of samples follows a Chi-square distribution. Besides, the quantity measure is constructed in order to determine number of clusters. The proposed method is presented in several numerical examples

    Cluster validity in clustering methods

    Get PDF

    Oscillating Dispersed-Phase Co-Flow Microfluidic Droplet Generation: Effects on Jet Length and Droplet Size

    Get PDF
    Droplet-based microfluidics have emerged as versatile platforms offering unique advantages in biology and chemistry. Although there is adequate control on size and monodispersity, most conventional microfluidic techniques cannot generate more than one droplet size at a time in a continuous and high-throughput manner. Moreover, the widely used co-flow microfluidic droplet generation technique is bottlenecked with droplet polydispersity at high throughputs due to the transition from a more-stable dripping regime to an instable jetting regime at high d-phase flow rates. We applied nozzle oscillatory motion to generate an axial shear gradient as well as inducing an additional transverse drag force. We hypothesized that the combined effects of axial and transverse drags can be used for overcoming the aforementioned limitations of co-flow systems. Nozzle oscillation effect was studied in both dripping and jetting regimes to generate repeatable patterns of multi-size monodisperse droplets and jet length reduction in different biphasic systems, respectively

    Causal impacts of transport interventions on air quality

    Get PDF
    The transport sector is one of the main sources of air pollution emissions, particularly for carbon monoxide, nitrogen oxides, and particulate matter. Evaluating the effectiveness of transport interventions on improving air quality is essential to informing future policy. However, a comparison of air quality observations before and after an intervention can be biased by various factors, such as weather conditions and seasonality effects. Causal inference methods generally have advantages in intervention evaluation in terms of data requirements, model building, and the interpretation of effect estimates. Causality goes beyond statistical association in the sense that it seeks to measure the net effect of an intervention on an outcome through all possible pathways directing from the intervention to the outcome. Causal inference methods have been applied to address the same question, however, the important confounders (such as weather conditions) are commonly controlled for by including variables in the causal inference model and assuming a parametric relationship. The thesis focuses on understanding the causal impacts of transport interventions on air quality. A novel ex-post policy evaluation framework, combining meteorological normalisation, change point detection, and causal inferencing, is proposed to overcome the limitations of previous approaches, and it is applied to three distinct transport interventions: improving public transport supply (Jubilee Line Extension), tightening road traffic emission standards (London Ultra Low Emission Zone), and restricting both transport activities and supply (COVID-19 lockdown). The Jubilee Line extension led to only small (< 1%) or insignificant changes in air pollution on average in London. The Ultra Low Emission Zone showed an average reduction of less than 3% for NO2 concentrations and insignificant effects on O3 and PM2.5 concentrations. The lockdown reduced the NO2 concentrations in London by less than 12% on average, and it had an insignificant effect on O3, PM10, and PM2.5. Therefore, the empirical results of the thesis consistently highlight the necessity of a multi-faceted set of policies that aim to reduce emissions across sectors with coordination among local, regional, and national government in order to achieve long-term improvements in air quality in cities.Open Acces

    Estatística não-paramétrica: estimação, classificação e uma nova abordagem de seleção automática para largura de banda

    Get PDF
    The thesis initial motivation was to know the state-of-the-art in non-parametric density estimation, compare different situations and assess their impact on the likelihood-based classification. Therefore, a study was carried out related to the automatic choice of bandwidth, the main parameter used by the four classic non-parametric estimators: Histogram, Average Shifted Histogram, Frequency Polygon and Kernel Density Estimation (KDE). In general, the KDE method showed the best results in all tested distributions and, due to this performance, its analysis was further developed, entering into the variable KDE theories with variable bandwidth. Furthermore, several tests shown that the selectors based on cross-validation are more resilient than the Plug-In methods, leading to better density estimation and classification results in complex problems. Finally, this thesis unfolded in some contributions to the state-of-the-art in the investigation subject, whose main ones are listed below: increased knowledge about some of the main non-parametric estimators discussed in the scientific world; development of a technique for evaluating density estimators called the Region of Interest Map (RoIMap); proposal for a hybrid automatic technique to adjust the variable bandwidth selector called Region of interest-based kernel density estimation (ROIKDE); and impact evaluation of the nonparametric estimation in classifying samples.Esta tese teve como motivação conhecer o estado da arte em estimação nãoparamétrica de densidade de probabilidade, avaliar as técnicas mais proeminentes encontradas em publicações científicas, compará-las em diversas situações e avaliar seu impacto em classificação utilizando verossimilhança. Para isto, foi realizado um estudo sobre a escolha automática da largura de banda, principal parâmetro utilizado pelos quatro estimadores não-paramétricos de densidade clássicos: Histograma, Average Shifted Histogram (ASH), Polígono de Frequência (PF) e Kernel Density Estimation (KDE). Em linhas gerais, o método KDE mostrou os melhores resultados em todas as distribuições testadas e devido a esse desempenho sua análise foi mais aprofundada, adentrando nas teorias do KDE com largura de banda variável. Ademais, foi percebido nos diversos testes realizados que os seletores baseados em validação-cruzada são mais resilientes do que os métodos de Plug-In (PI), levando a melhores resultados de estimação e classificação em realidades complexas. Por fim, este trabalho teve como desdobramento algumas contribuições para o estado da arte no assunto de investigação, cujas principais são elencadas a seguir: aumento do conhecimento sobre alguns dos principais estimadores não-paramétricos discutidos no mundo científico; desenvolvimento de uma técnica de avaliação de estimadores de densidade, nomeada de Region of Interest Map (RoIMap); proposta de uma técnica automática híbrida para ajustar o seletor de largura de banda variável, denominada Region of Interest-based Kernel Density Estimation (ROIKDE); e avaliação do impacto da estimação não-paramétrica em classificação de amostras.CAPES - Coordenação de Aperfeiçoamento de Pessoal de Nível Superio
    corecore