302 research outputs found

    Simplicial similarity and its application to hierarchical clustering

    Get PDF
    In the present document, an extension of the statistical depth notion is introduced with the aim to allow for measuring proximities between pairs of points. In particular, we will extend the simplicial depth function, which measures how central is a point by using random simplices (triangles in the two-dimensional space). The paper is structured as follows: In first place, there is a brief introduction to statistical depth functions. Next, the simplicial similarity function will be defined and its properties studied. Finally, we will present a few graphical examples in order to show its behavior with symmetric and asymmetric distributions, and apply the function to hierarchical clustering.Statistical depth, Similarity measures, Hierarchical clustering

    A mixture of variational canonical correlation analysis for nonlinear and quality-relevant process monitoring

    Get PDF
    Proper monitoring of quality-related variables in industrial processes is nowadays one of the main worldwide challenges with significant safety and efficiency implications.Variational Bayesian mixture of canonical correlation analysis (VBMCCA)-based process monitoring method was proposed in this paper to predict and diagnose these hard-to-measure quality-related variables simultaneously. Use of Student's t-distribution, rather than Gaussian distribution, in the VBMCCA model makes the proposed process monitoring scheme insensitive to disturbances, measurement noises, and model discrepancies. A sequential perturbation (SP) method together with derived parameter distribution of VBMCCA is employed to approach the uncertainty levels, which is able to provide a confidence interval around the predicted values and give additional control line, rather than just a certain absolute control limit, for process monitoring. The proposed process monitoring framework has been validated in a wastewater treatment plant (WWTP) simulated by benchmark simulation model with abrupt changes imposing on a sensor and a real WWTP with filamentous sludge bulking. The results show that the proposed methodology is capable of detecting sensor faults and process faults with satisfactory accuracy

    Complexity-based classification of software modules

    Get PDF
    Software plays a major role in many organizations. Organizational success depends partially on the quality of software used. In recent years, many researchers have recognized that statistical classification techniques are well-suited to develop software quality prediction models. Different statistical software quality models, using complexity metrics as early indicators of software quality, have been proposed in the past. At a high-level the problem of software categorization is to classify software modules into fault prone and non-fault prone. The focus of this thesis is two-fold. One is to study some selected classification techniques including unsupervised and supervised learning algorithms widely used for software categorization. The second emphasis is to explore a new unsupervised learning model, employing Bayesian and deterministic approaches. Besides, we evaluate and compare experimentally these approaches using a real data set. Our experimental results show that different algorithms lead to different statistically significant results

    Early Detection of Research Trends

    Get PDF
    Being able to rapidly recognise new research trends is strategic for many stakeholders, including universities, institutional funding bodies, academic publishers and companies. The literature presents several approaches to identifying the emergence of new research topics, which rely on the assumption that the topic is already exhibiting a certain degree of popularity and consistently referred to by a community of researchers. However, detecting the emergence of a new research area at an embryonic stage, i.e., before the topic has been consistently labelled by a community of researchers and associated with a number of publications, is still an open challenge. In this dissertation, we begin to address this challenge by performing a study of the dynamics preceding the creation of new topics. This study indicates that the emergence of a new topic is anticipated by a significant increase in the pace of collaboration between relevant research areas, which can be seen as the 'ancestors' of the new topic. Based on this understanding, we developed Augur, a novel approach to effectively detect the emergence of new research topics. Augur analyses the diachronic relationships between research areas and is able to detect clusters of topics that exhibit dynamics correlated with the emergence of new research topics. Here we also present the Advanced Clique Percolation Method (ACPM), a new community detection algorithm developed specifically for supporting this task. Augur was evaluated on a gold standard of 1,408 debutant topics in the 2000-2011 timeframe and outperformed four alternative approaches in terms of both precision and recall

    A Multiple Instance Learning Approach to Electrophysiological Muscle Classification for Diagnosing Neuromuscular Disorders Using Quantitative EMG

    Get PDF
    Neuromuscular disorder is a broad term that refers to diseases that impair muscle functionality either by affecting any part of the nerve or muscle. Electrodiagnosis of most neuromuscular disorders is based on the electrophysiological classification of involved muscles which in turn, is performed by inferring the structure and function of the muscles by analyzing electromyographic (EMG) signals recorded during low to moderate levels of contraction. The functional unit of muscle contraction is called a motor unit (MU). The morphology and physiology of the MUs of an examined muscle are inferred by extracting motor unit potentials (MUPs) from the EMG signals detected from the muscle. As such, electrophysiological muscle classification is performed by first characterizing extracted MUPs and then aggregating these characterizations. The task of classifying muscles can be represented as an instance of a multiple instance learning (MIL) problem. In the MIL paradigm, a bag of instances shares a label and the instance labels are hidden, contrary to standard supervised learning, where each training instance is labeled. In MIL-based muscle classification, the instances are the MUPs extracted from the EMG signals of the analyzed muscle and the bag is the muscle. Detecting and counting the MUPs indicating a specific category of a neuromuscular disorder can result in accurately classifying the examined muscle. As such, three major issues usually arise: how to infer MUP labels without full supervision; how the cardinality relationships between MUP labels contribute to predict the muscle label; and how the muscle as a whole entity is classified. In this thesis, these three challenges are addressed. To this end, an MIL-based muscle classification system is proposed that has five major steps: 1) MUPs are represented using morphological, stability, and novel near fiber parameters as well as spectral features extracted from wavelet coefficients. This representation helps to analyze MUPs from a variety of aspects. 2) MUP feature selection using unsupervised similarity preserving Laplacian score which is independent of any learning algorithm. Hence, the features selected in this work can be used in other electrophysiological muscle classification systems. 3) MUP clustering using a novel clustering algorithm called Neighbourhood Distance Entropy Consistency (NDEC) which contributes to solve the traditional problem of finding representations of MUP normality and abnormality and provides a dynamic number of MUP characterization classes which will be used instead of the conventional three classes (i.e. normal, myopathic, and neurogenic). This clustering was performed to highlight the effects of disease on both fiber spatial distributions and fiber diameter distributions, which lead to a continuity of MUP characteristics. These clusters can potentially represent several concepts of MUP normality and abnormality. 4) Muscle representation by embedding its MUP cluster associations in a feature vector, and 5) Muscle classification using support vector machines or random forests. Quantitative results obtained by applying the proposed method to four electrophysiologically different groups of muscles including proximal arm, proximal leg, distal arm, and distal leg show the superior and stable performance of the proposed muscle classification system compared to previous works. Additionally, modelling electrophysiological muscle classification as an instance of the MIL can solve the traditional problem of characterizing MUPs without full supervision. The proposed clustering algorithm in this work, can be used as an effective technique in other pattern recognition and medical diagnostic systems in which discovering natural clusters within data is a necessity

    A comparison-based approach to mispronunciation detection

    Get PDF
    Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2012.Cataloged from PDF version of thesis.Includes bibliographical references (p. 89-92).This thesis focuses on the problem of detecting word-level mispronunciations in nonnative speech. Conventional automatic speech recognition-based mispronunciation detection systems have the disadvantage of requiring a large amount of language-specific, annotated training data. Some systems even require a speech recognizer in the target language and another one in the students' native language. To reduce human labeling effort and for generalization across all languages, we propose a comparison-based framework which only requires word-level timing information from the native training data. With the assumption that the student is trying to enunciate the given script, dynamic time warping (DTW) is carried out between a student's utterance (nonnative speech) and a teacher's utterance (native speech), and we focus on detecting mis-alignment in the warping path and the distance matrix. The first stage of the system locates word boundaries in the nonnative utterance. To handle the problem that nonnative speech often contains intra-word pauses, we run DTW with a silence model which can align the two utterances, detect and remove silences at the same time. In order to segment each word into smaller, acoustically similar, units for a finer-grained analysis, we develop a phoneme-like unit segmentor which works by segmenting the selfsimilarity matrix into low-distance regions along the diagonal. Both phone-level and wordlevel features that describe the degree of mis-alignment between the two utterances are extracted, and the problem is formulated as a classification task. SVM classifiers are trained, and three voting schemes are considered for the cases where there are more than one matching reference utterance. The system is evaluated on the Chinese University Chinese Learners of English (CUCHLOE) corpus, and the TIMIT corpus is used as the native corpus. Experimental results have shown 1) the effectiveness of the silence model in guiding DTW to capture the word boundaries in nonnative speech more accurately, 2) the complimentary performance of the word-level and the phone-level features, and 3) the stable performance of the system with or without phonetic units labeling.by Ann Lee.S.M

    Automatic \u3csup\u3e13\u3c/sup\u3eC Chemical Shift Reference Correction of Protein NMR Spectral Data Using Data Mining and Bayesian Statistical Modeling

    Get PDF
    Nuclear magnetic resonance (NMR) is a highly versatile analytical technique for studying molecular configuration, conformation, and dynamics, especially of biomacromolecules such as proteins. However, due to the intrinsic properties of NMR experiments, results from the NMR instruments require a refencing step before the down-the-line analysis. Poor chemical shift referencing, especially for 13C in protein Nuclear Magnetic Resonance (NMR) experiments, fundamentally limits and even prevents effective study of biomacromolecules via NMR. There is no available method that can rereference carbon chemical shifts from protein NMR without secondary experimental information such as structure or resonance assignment. To solve this problem, we constructed a Bayesian probabilistic framework that circumvents the limitations of previous reference correction methods that required protein resonance assignment and/or three-dimensional protein structure. Our algorithm named Bayesian Model Optimized Reference Correction (BaMORC) can detect and correct 13C chemical shift referencing errors before the protein resonance assignment step of analysis and without a three-dimensional structure. By combining the BaMORC methodology with a new intra-peaklist grouping algorithm, we created a combined method called Unassigned BaMORC that utilizes only unassigned experimental peak lists and the amino acid sequence. Unassigned BaMORC kept all experimental three-dimensional HN(CO)CACB-type peak lists tested within ± 0.4 ppm of the correct 13C reference value. On a much larger unassigned chemical shift test set, the base method kept 13C chemical shift referencing errors to within ± 0.45 ppm at a 90% confidence interval. With chemical shift assignments, Assigned BaMORC can detect and correct 13C chemical shift referencing errors to within ± 0.22 at a 90% confidence interval. Therefore, Unassigned BaMORC can correct 13C chemical shift referencing errors when it will have the most impact, right before protein resonance assignment and other downstream analyses are started. After assignment, chemical shift reference correction can be further refined with Assigned BaMORC. To further support a broader usage of these new methods, we also created a software package with web-based interface for the NMR community. This software will allow non-NMR experts to detect and correct 13C referencing errors at critical early data analysis steps, lowering the bar of NMR expertise required for effective protein NMR analysis

    Improving Electricity Distribution System State Estimation with AMR-Based Load Profiles

    Get PDF
    The ongoing battle against global warming is rapidly increasing the amount of renewable power generation, and smart solutions are needed to integrate these new generation units into the existing distribution systems. Smart grids answer this call by introducing intelligent ways of controlling the network and active resources connected to it. However, before the network can be controlled, the automation system must know what the node voltages and line currents defining the network state are.Distribution system state estimation (DSSE) is needed to find the most likely state of the network when the number and accuracy of measurements are limited. Typically, two types of measurements are used in DSSE: real-time measurements and pseudomeasurements. In recent years, finding cost-efficient ways to improve the DSSE accuracy has been a popular subject in the literature. While others have focused on optimizing the type, amount and location of real-time measurements, the main hypothesis of this thesis is that it is possible to enhance the DSSE accuracy by using interval measurements collected with automatic meter reading (AMR) to improve the load profiles used as pseudo-measurements.The work done in this thesis can be divided into three stages. In the first stage, methods for creating new AMR-based load profiles are studied. AMR measurements from thousands of customers are used to test and compare the different options for improving the load profiling accuracy. Different clustering algorithms are tested and a novel twostage clustering method for load profiling is developed. In the second stage, a DSSE algorithm suited for smart grid environment is developed. Simulations and real-life demonstrations are conducted to verify the accuracy and applicability of the developed state estimator. In the third and final stage, the AMR-based load profiling and DSSE are combined. Matlab simulations with real AMR data and a real distribution network model are made and the developed load profiles are compared with other commonly used pseudo-measurements.The results indicate that clustering is an efficient way to improve the load profiling accuracy. With the help of clustering, both the customer classification and customer class load profiles can be updated simultaneously. Several of the tested clustering algorithms were suited for clustering electricity customers, but the best results were achieved with a modified k-means algorithm. Results from the third stage simulations supported the main hypothesis that the new AMR-based load profiles improve the DSSE accuracy.The results presented in this thesis should motivate distribution system operators and other actors in the field of electricity distribution to utilize AMR data and clustering algorithms in load profiling. It improves not only the DSSE accuracy but also many other functions that rely on load flow calculation and need accurate load estimates or forecasts

    Four essays on financial risk quantification

    Get PDF
    136 p.Esta tesis tiene como objetivo analizar el desempeño de medidas de riesgo como el Value-at-Risk (VaR) y (recientemente propuesta por el Comité de Basilea) de Expected Shortfall (ES), principalmente para cuantificar riesgo de mercado, con diferentes modelos distribucionales, como por ejemplo la distribución Gaussiana (como modelo base), y varias distribuciones que presentan colas pesadas como la distribución t de Student, distribución de Pareto generalizada (GPD, por sus siglas inglés), la distribución ¿-estable, distribución g-h, y la distribución Gram-Charlier. Para tal fin se emplean diferentes activos como índices bursátiles de energía tradicional y de activos financieros sostenibles. Dada la preocupación en los mercados financieros por el (ab)uso de activos como Exchange-traded funds (ETFs), en especial los ETFs apalancados (LETFs, por sus siglas en inglés), estos activos también son analizados en la presente tesis. Aunque Expected Shortfall es una medida coherente al riesgo, es conocido que esta medida no cumple con la propiedad de elicitabilidad, una propiedad deseable en pronósticos con fines de validación de modelos (backtesting). Esta tesis implementa dos recientes técnicas de validación de ES con buenos resultados e implicaciones en cuanto a estabilidad financiera. Finalmente se realiza una revisión de la reciente propuesta del Comité de Basilea para cuantificar riesgo operacional

    Four essays on financial risk quantification

    Get PDF
    136 p.Esta tesis tiene como objetivo analizar el desempeño de medidas de riesgo como el Value-at-Risk (VaR) y (recientemente propuesta por el Comité de Basilea) de Expected Shortfall (ES), principalmente para cuantificar riesgo de mercado, con diferentes modelos distribucionales, como por ejemplo la distribución Gaussiana (como modelo base), y varias distribuciones que presentan colas pesadas como la distribución t de Student, distribución de Pareto generalizada (GPD, por sus siglas inglés), la distribución ¿-estable, distribución g-h, y la distribución Gram-Charlier. Para tal fin se emplean diferentes activos como índices bursátiles de energía tradicional y de activos financieros sostenibles. Dada la preocupación en los mercados financieros por el (ab)uso de activos como Exchange-traded funds (ETFs), en especial los ETFs apalancados (LETFs, por sus siglas en inglés), estos activos también son analizados en la presente tesis. Aunque Expected Shortfall es una medida coherente al riesgo, es conocido que esta medida no cumple con la propiedad de elicitabilidad, una propiedad deseable en pronósticos con fines de validación de modelos (backtesting). Esta tesis implementa dos recientes técnicas de validación de ES con buenos resultados e implicaciones en cuanto a estabilidad financiera. Finalmente se realiza una revisión de la reciente propuesta del Comité de Basilea para cuantificar riesgo operacional
    corecore