2,053 research outputs found
A survey of outlier detection methodologies
Outlier detection has been used for centuries to detect and, where appropriate, remove anomalous observations from data. Outliers arise due to mechanical faults, changes in system behaviour, fraudulent behaviour, human error, instrument error or simply through natural deviations in populations. Their detection can identify system faults and fraud before they escalate with potentially catastrophic consequences. It can identify errors and remove their contaminating effect on the data set and as such to purify the data for processing. The original outlier detection methods were arbitrary but now, principled and systematic techniques are used, drawn from the full gamut of Computer Science and Statistics. In this paper, we introduce a survey of contemporary techniques for outlier detection. We identify their respective motivations and distinguish their advantages and disadvantages in a comparative review
Bayesian neural network learning for repeat purchase modelling in direct marketing.
We focus on purchase incidence modelling for a European direct mail company. Response models based on statistical and neural network techniques are contrasted. The evidence framework of MacKay is used as an example implementation of Bayesian neural network learning, a method that is fairly robust with respect to problems typically encountered when implementing neural networks. The automatic relevance determination (ARD) method, an integrated feature of this framework, allows to assess the relative importance of the inputs. The basic response models use operationalisations of the traditionally discussed Recency, Frequency and Monetary (RFM) predictor categories. In a second experiment, the RFM response framework is enriched by the inclusion of other (non-RFM) customer profiling predictors. We contribute to the literature by providing experimental evidence that: (1) Bayesian neural networks offer a viable alternative for purchase incidence modelling; (2) a combined use of all three RFM predictor categories is advocated by the ARD method; (3) the inclusion of non-RFM variables allows to significantly augment the predictive power of the constructed RFM classifiers; (4) this rise is mainly attributed to the inclusion of customer\slash company interaction variables and a variable measuring whether a customer uses the credit facilities of the direct mailing company.Marketing; Companies; Models; Model; Problems; Neural networks; Networks; Variables; Credit;
Robustness and Regularization of Support Vector Machines
We consider regularized support vector machines (SVMs) and show that they are
precisely equivalent to a new robust optimization formulation. We show that
this equivalence of robust optimization and regularization has implications for
both algorithms, and analysis. In terms of algorithms, the equivalence suggests
more general SVM-like algorithms for classification that explicitly build in
protection to noise, and at the same time control overfitting. On the analysis
front, the equivalence of robustness and regularization, provides a robust
optimization interpretation for the success of regularized SVMs. We use the
this new robustness interpretation of SVMs to give a new proof of consistency
of (kernelized) SVMs, thus establishing robustness as the reason regularized
SVMs generalize well
Recommended from our members
Multivariate Data Analysis for Neuroimaging Data: Overview and Application to Alzheimer's Disease
As clinical and cognitive neuroscience mature, the need for sophisticated neuroimaging analysis becomes more apparent. Multivariate analysis techniques have recently received increasing attention as they have many attractive features that cannot be easily realized by the more commonly used univariate, voxel-wise, techniques. Multivariate approaches evaluate correlation/covariance of activation across brain regions, rather than proceeding on a voxel-by-voxel basis. Thus, their results can be more easily interpreted as a signature of neural networks. Univariate approaches, on the other hand, cannot directly address functional connectivity in the brain. The covariance approach can also result in greater statistical power when compared with univariate techniques, which are forced to employ very stringent, and often overly conservative, corrections for voxel-wise multiple comparisons. Multivariate techniques also lend themselves much better to prospective application of results from the analysis of one dataset to entirely new datasets. Multivariate techniques are thus well placed to provide information about mean differences and correlations with behavior, similarly to univariate approaches, with potentially greater statistical power and better reproducibility checks. In contrast to these advantages is the high barrier of entry to the use of multivariate approaches, preventing more widespread application in the community. To the neuroscientist becoming familiar with multivariate analysis techniques, an initial survey of the field might present a bewildering variety of approaches that, although algorithmically similar, are presented with different emphases, typically by people with mathematics backgrounds. We believe that multivariate analysis techniques have sufficient potential to warrant better dissemination. Researchers should be able to employ them in an informed and accessible manner. The following article attempts to provide a basic introduction with sample applications to simulated and real-world data sets
ROBUST SPEAKER RECOGNITION BASED ON LATENT VARIABLE MODELS
Automatic speaker recognition in uncontrolled environments is a very challenging task due to channel distortions, additive noise and reverberation. To address these issues, this thesis studies probabilistic latent variable models of short-term spectral information that leverage large amounts of data to achieve robustness in challenging conditions.
Current speaker recognition systems represent an entire speech utterance as a single point in a high-dimensional space. This representation is known as "supervector". This thesis starts by analyzing the properties of this representation. A novel visualization procedure of supervectors is presented by which qualitative insight about the information being captured is obtained. We then propose the use of an overcomplete dictionary to explicitly decompose a supervector into a speaker-specific component and an undesired variability component. An algorithm to learn the dictionary from a large collection of data is discussed and analyzed. A subset of the entries of the dictionary is learned to represent speaker-specific information and another subset to represent distortions. After encoding the supervector as a linear combination of the dictionary entries, the undesired variability is removed by discarding the contribution of the distortion components. This paradigm is closely related to the previously proposed paradigm of Joint Factor Analysis modeling of supervectors. We establish a connection between the two approaches and show how our proposed method provides improvements in terms of computation and recognition accuracy.
An alternative way to handle undesired variability in supervector representations is to first project them into a lower dimensional space and then to model them in the reduced subspace. This low-dimensional projection is known as "i-vector". Unfortunately, i-vectors exhibit non-Gaussian behavior, and direct statistical modeling requires the use of heavy-tailed distributions for optimal performance. These approaches lack closed-form solutions, and therefore are hard to analyze. Moreover, they do not scale well to large datasets. Instead of directly modeling i-vectors, we propose to first apply a non-linear transformation and then use a linear-Gaussian model. We present two alternative transformations and show experimentally that the transformed i-vectors can be optimally modeled by a simple linear-Gaussian model (factor analysis). We evaluate our method on a benchmark dataset with a large amount of channel variability and show that the results compare favorably against the competitors. Also, our approach has closed-form solutions and scales gracefully to large datasets.
Finally, a multi-classifier architecture trained on a multicondition fashion is proposed to address the problem of speaker recognition in the presence of additive noise. A large number of experiments are conducted to analyze the proposed architecture and to obtain guidelines for optimal performance in noisy environments. Overall, it is shown that multicondition training of multi-classifier architectures not only produces great robustness in the anticipated conditions, but also generalizes well to unseen conditions
Detection of severe obstructive sleep apnea through voice analysis
tThis paper deals with the potential and limitations of using voice and speech processing to detect Obstruc-tive Sleep Apnea (OSA). An extensive body of voice features has been extracted from patients whopresent various degrees of OSA as well as healthy controls. We analyse the utility of a reduced set offeatures for detecting OSA. We apply various feature selection and reduction schemes (statistical rank-ing, Genetic Algorithms, PCA, LDA) and compare various classifiers (Bayesian Classifiers, kNN, SupportVector Machines, neural networks, Adaboost). S-fold crossvalidation performed on 248 subjects showsthat in the extreme cases (that is, 127 controls and 121 patients with severe OSA) voice alone is able todiscriminate quite well between the presence and absence of OSA. However, this is not the case withmild OSA and healthy snoring patients where voice seems to play a secondary role. We found that thebest classification schemes are achieved using a Genetic Algorithm for feature selection/reduction
Multistage classification of multispectral Earth observational data: The design approach
An algorithm is proposed which predicts the optimal features at every node in a binary tree procedure. The algorithm estimates the probability of error by approximating the area under the likelihood ratio function for two classes and taking into account the number of training samples used in estimating each of these two classes. Some results on feature selection techniques, particularly in the presence of a very limited set of training samples, are presented. Results comparing probabilities of error predicted by the proposed algorithm as a function of dimensionality as compared to experimental observations are shown for aircraft and LANDSAT data. Results are obtained for both real and simulated data. Finally, two binary tree examples which use the algorithm are presented to illustrate the usefulness of the procedure
A machine learning based personalized system for driving state recognition
Reliable driving state recognition (e.g. normal, drowsy, and aggressive) plays a significant role in improving road safety, driving experience and fuel efficiency. It lays the foundation for a number of advanced functions such as driver safety monitoring systems and adaptive driving assistance systems. In these applications, state recognition accuracy is of paramount importance to guarantee user acceptance. This paper is mainly focused on developing a personalized driving state recognition system by learning from non-intrusive, easily accessible vehicle related measurements and its validation using real-world driving data. Compared to conventional approaches, this paper first highlights the necessities of adopting a personalized system by analysing feature distribution of individual driver’s data and all drivers’ data via advanced data visualization and statistical analysis. If significant differences are identified, a dedicated personalized model is learnt to predict the driver’s driving state. Spearman distance is also drawn to evaluate the differences between individual driver’s data and all drivers’ data in a quantitative manner. In addition, five categories of classifiers are tested and compared to identify a suitable one for classification, where random forest with Bayesian parameter optimization outperforms others and therefore is adopted in this paper. A recently collected dataset from real-world driving experiments is adopted to evaluate the proposed system. Comparative experimental results indicate that the personalized learning system with road information significantly outperforms conventional approaches without considering personalized characteristics or road information, where the overall accuracy increases from 81.3% to 91.6%. It is believed that the newly developed personalized learning system can find a wide range of applications where diverse behaviours exist
Statistical learning in complex and temporal data: distances, two-sample testing, clustering, classification and Big Data
Programa Oficial de Doutoramento en EstatÃstica e Investigación Operativa. 555V01[Resumo]
Esta tesis trata sobre aprendizaxe estatÃstico en obxetos complexos, con énfase en
series temporais. O problema abórdase introducindo coñecemento sobre o dominio do
fenómeno subxacente, mediante distancias e caracterÃsticas.
Proponse un contraste de dúas mostras basado en distancias e estúdase o seu
funcionamento nun gran abanico de escenarios. As distancias para clasificación e
clustering de series temporais acadan un incremento da potencia estatÃstica cando se
aplican a contrastes de dúas mostras. O noso test compárase de xeito favorable con
outros métodos gracias á súa flexibilidade ante diferentes alternativas.
DefÃnese unha nova distancia entre series temporais mediante un xeito innovador
de comparar as distribucións retardadas das series. Esta distancia herda o bo funcionamento
empÃrico doutros métodos pero elimina algunhas das súas limitacións.
Proponse un método de predicción baseada en caracterÃsticas das series. O método
combina diferentes algoritmos estándar de predicción mediante unha suma ponderada.
Os pesos desta suma veñen dun modelo que se axusta a un conxunto de entrenamento
de gran tamaño.
Propónse un método de clasificación distribuida, baseado en comparar, mediante
unha distancia, as funcións de distribución empÃricas do conxuto de proba común e as
dos datos que recibe cada nodo de cómputo.[Resumen]
Esta tesis trata sobre aprendizaje estadÃstico en objetos complejos, con énfasis en
series temporales. El problema se aborda introduciendo conocimiento del dominio del
fenómeno subyacente, mediante distancias y caracterÃsticas.
Se propone un test de dos muestras basado en distancias y se estudia su funcionamiento
en un gran abanico de escenarios. La distancias para clasificación y
clustering de series temporales consiguen un incremento de la potencia estadÃstica
cuando se aplican al tests de dos muestras. Nuestro test se compara favorablemente
con otros métodos gracias a su flexibilidad antes diferentes alternativas.
Se define una nueva distancia entre series temporales mediante una manera innovadora
de comparar las distribuciones retardadas de la series. Esta distancia hereda el
buen funcionamiento empÃrico de otros métodos pero elimina algunas de sus limitaciones.
Se propone un método de predicción basado en caracterÃsticas de las series. El
método combina diferentes algoritmos estándar de predicción mediante una suma
ponderada. Los pesos de esta suma salen de un modelo que se ajusta a un conjunto de
entrenamiento de gran tamaño.
Se propone un método de clasificación distribuida, basado en comparar, mediante
una distancia, las funciones de distribución empÃrica del conjuto de prueba común y
las de los datos que recibe cada nodo de cómputo.[Abstract]
This thesis deals with the problem of statistical learning in complex objects, with
emphasis on time series data. The problem is approached by facilitating the introduction
of domain knoweldge of the underlying phenomena by means of distances and features.
A distance-based two sample test is proposed, and its performance is studied under
a wide range of scenarios. Distances for time series classification and clustering are
also shown to increase statistical power when applied to two-sample testing. Our
test compares favorably to other methods regarding its flexibility against different
alternatives. A new distance for time series is defined by considering an innovative
way of comparing lagged distributions of the series. This distance inherits the good
empirical performance of existing methods while removing some of their limitations.
A forecast method based on times series features is proposed. The method works
by combining individual standard forecasting algorithms using a weighted average.
These weights come from a learning model fitted on a large training set. A distributed
classification algorithm is proposed, based on comparing, using a distance, the empirical
distribution functions between the dataset that each computing node receives and the
test set
- …