467 research outputs found
The Shape of Learning Curves: a Review
Learning curves provide insight into the dependence of a learner's
generalization performance on the training set size. This important tool can be
used for model selection, to predict the effect of more training data, and to
reduce the computational complexity of model training and hyperparameter
tuning. This review recounts the origins of the term, provides a formal
definition of the learning curve, and briefly covers basics such as its
estimation. Our main contribution is a comprehensive overview of the literature
regarding the shape of learning curves. We discuss empirical and theoretical
evidence that supports well-behaved curves that often have the shape of a power
law or an exponential. We consider the learning curves of Gaussian processes,
the complex shapes they can display, and the factors influencing them. We draw
specific attention to examples of learning curves that are ill-behaved, showing
worse learning performance with more training data. To wrap up, we point out
various open problems that warrant deeper empirical and theoretical
investigation. All in all, our review underscores that learning curves are
surprisingly diverse and no universal model can be identified
HIGH-DIMENSIONAL SIGNAL PROCESSING AND STATISTICAL LEARNING
Classical statistical and signal processing techniques are not generally
useful in situations wherein the dimensionality (p) of observations is comparable
or exceeding the sample size (n). This is mainly due to the fact that
the performance of these techniques is guaranteed through classical notion of
statistical consistency, which is itself fashioned for situations wherein n >> p.
Statistical consistency has been viogorously used in the past century to develop
many signal processing and statistical learning techniques. However, in recent
years, two sets of mathematical machineries have emerged that show the possibility
of developing superior techniques suitable for analyzing high-dimensional
observations, i.e., situations where p >> n. In this thesis, we refer to these
techniques, which are grounded either in double asymptotic regimes or sparsity
assumptions, as high-dimensional techniques.
In this thesis, we examine and develop a set of high-dimensional techniques
with applications in classification. The thesis is mainly divided to three
parts. In the first part, we introduce a novel approach based on double asymptotics
to estimate the regularization parameter used in a well-known technique
known as RLDA classifier. We examine the robustness of the developed approach
to Gaussianity, an assumption used in developing the core estimator.
The performance of the technique in terms of accuracy and efficiency is verified
against other popular methods such as cross-validation. In the second part of
the thesis, the performance of the newly developed RLDA and several other
classifiers are compared in situations where p is comparable or exceeding n.
While in the first two parts of the thesis, we focus more on double asympii
totic methods, in the third part, we study two important class of techniques
based on sparsity assumption. One of these techniques known as LASSO has
gained much attention in recent years within the statistical community, while the
second one, known as compressed sensing, has become very popular in signal
processing literature. Although both of these techniques use sparsity assumptions
as well as L1 minimization, the objective functions and constrains they are
constructed on are different. In the third part of the thesis, we demonstrate the
application of both techniques in high-dimensional classification and compare
them in terms of shrinkage rate and classification accurac
Factors influencing the accuracy of remote sensing classifications: a comparative study
Within last 20 years, a number of methods have been employed for classifying remote sensing data, including parametric methods (e.g. the maximum likelihood classifier) and non-parametric classifiers (such as neural network classifiers).Each of these classification algorithms has some specific problems which limits its use. This research studies some alternative classification methods for land cover classification and compares their performance with the well established classification methods. The areas selected for this study are located near Littleport (Ely), in East Anglia, UK and in La Mancha region of Spain. Images in the optical bands of the Landsat ETM+ for year 2000 and InSAR data from May to September of 1996 for UK area, DAIS hyperspectral data and Landsat ETM+ for year 2000 for Spain area are used for this study. In addition, field data for the year 1996 were collected from farmers and for year 2000 were collected by field visits to both areas in the UK and Spain to generate the ground reference data set.
The research was carried out in three main stages.The overall aim of this study is to assess the relative performance of four approaches to classification in remote sensing - the maximum likelihood, artificial neural net, decision tree and support vector machine methods and to examine factors which affect their performance in term of overall classification accuracy.
Firstly, this research studies the behaviour of decision tree and support vector machine classifiers for land cover classification using ETM+ (UK) data. This stage discusses some factors affecting classification accuracy of a decision tree classifier, and also compares the performance of the decision tree with that of the maximum likelihood and neural network classifiers. The use of SVM requires the user to set the values of some parameters, such as type of kernel, kernel parameters, and multi-class methods as these parameters can significantly affect the accuracy of the resulting classification. This stage involves studying the effects of varying the various user defined parameters and noting their effect on classification accuracy. It is concluded that SVM perform far better than decision tree, maximum likelihood and neural network classifiers for this type of study.
The second stage involves applying the decision tree, maximum likelihood and neural network classifiers to InSAR coherence and intensity data and evaluating the utility of this type of data for land cover classification studies. Finally, the last stage involves studying the response of SVMs, decision trees, maximum likelihood and neural classifier to different training data sizes, number of features, sampling plan, and the scale of the data used. The conclusion from the experiments presented in this stage is that the SVMs are unaffected by the Hughes phenomenon, and perform far better than the other classifiers in all cases. The performance of decision tree classifier based feature selection is found to be quite good in comparison with MNF transform. This study indicates that good classification performance depends on various parameters such as data type, scale of data, training sample size and type of classification method employed
Compensação digital de distorções da fibra em sistemas de comunicação óticos de longa distância
The continuous increase of traffic demand in long-haul communications motivated
the network operators to look for receiver side techniques to mitigate the nonlinear
effects, resulting from signal-signal and signal-noise interaction, thus pushing the
current Capacity boundaries. Machine learning techniques are a very hot-topic
with given proofs in the most diverse applications. This dissertation aims to study
nonlinear impairments in long-haul coherent optical links and the current state of
the art in DSP techniques for impairment mitigation as well as the integration of
machine learning strategies in optical networks. Starting with a simplified fiber
model only impaired by ASE noise, we studied how to integrate an ANN-based
symbol estimator into the signal pipeline, enabling to validate the implementation
by matching the theoretical performance. We then moved to nonlinear proof of
concept with the incorporation of NLPN in the fiber link. Finally, we evaluated
the performance of the estimator under realistic simulations of Single and Multi-
Channel links in both SSFM and NZDSF fibers. The obtained results indicate
that even though it may be hard to find the best architecture, Nonlinear Symbol
Estimator networks have the potential to surpass more conventional DSP strategies.O aumento contínuo de tráfego nas comunicações de longo-alcance motivou os
operadores de rede a procurar técnicas do lado do receptor para atenuar os efeitos
não lineares resultantes da interacção sinal-sinal e sinal-ruído, alargando assim os
limites da capacidade do sistema. As técnicas de aprendizagem-máquina são um
tópico em ascenção com provas dadas nas mais diversas aplicações e setores. Esta
dissertação visa estudar as principais deficiências nas ligações de longo curso e o
actual estado da arte em técnicas de DSP para mitigação das mesmas, bem como
a integração de estratégias de aprendizagem-máquina em redes ópticas. Começando
com um modelo simplificado de fibra apenas perturbado pelo ruído ASE,
estudámos como integrar um estimador de símbolos baseado em ANN na cadeia
do prodessamento de sinal, conseguindo igualar o desempenho teórico. Procedemos
com uma prova de conceito perante não linearidades com a incorporação do
ruído de fase não linear na propagação. Finalmente, avaliamos o desempenho do
estimador com simulações realistas de links Single e Multi canal tanto em fibras
SSFM como NZDSF. Os resultados obtidos indicam que apesar da dificuldade de
encontrar a melhor arquitectura, a estimação não linear baseada em redes neuronais
têm o potencial para ultrapassar estratégias DSP mais convencionais.Mestrado em Engenharia Eletrónica e Telecomunicaçõe
Why are married men working so much?
Empirical patterns of labor supply at the micro level tend to reject the unitary model assumption implicit in most macro theories, where households are the deemed to be rational agents. This paper examines the rise in in per-capita labor since 1975 and asks how the inclusion of bargaining between spouses in a standard macro model would alter the analysis of recent trends in aggregate labor supply. The main findings are that the stationarity of married men's work hours reflects weakening of men's bargaining position as women's wages rose, and that the unitary model seriously overstates the response of aggregate labor to trends in relative wages
Factors influencing the accuracy of remote sensing classifications: a comparative study
Within last 20 years, a number of methods have been employed for classifying remote sensing data, including parametric methods (e.g. the maximum likelihood classifier) and non-parametric classifiers (such as neural network classifiers).Each of these classification algorithms has some specific problems which limits its use. This research studies some alternative classification methods for land cover classification and compares their performance with the well established classification methods. The areas selected for this study are located near Littleport (Ely), in East Anglia, UK and in La Mancha region of Spain. Images in the optical bands of the Landsat ETM+ for year 2000 and InSAR data from May to September of 1996 for UK area, DAIS hyperspectral data and Landsat ETM+ for year 2000 for Spain area are used for this study. In addition, field data for the year 1996 were collected from farmers and for year 2000 were collected by field visits to both areas in the UK and Spain to generate the ground reference data set.
The research was carried out in three main stages.The overall aim of this study is to assess the relative performance of four approaches to classification in remote sensing - the maximum likelihood, artificial neural net, decision tree and support vector machine methods and to examine factors which affect their performance in term of overall classification accuracy.
Firstly, this research studies the behaviour of decision tree and support vector machine classifiers for land cover classification using ETM+ (UK) data. This stage discusses some factors affecting classification accuracy of a decision tree classifier, and also compares the performance of the decision tree with that of the maximum likelihood and neural network classifiers. The use of SVM requires the user to set the values of some parameters, such as type of kernel, kernel parameters, and multi-class methods as these parameters can significantly affect the accuracy of the resulting classification. This stage involves studying the effects of varying the various user defined parameters and noting their effect on classification accuracy. It is concluded that SVM perform far better than decision tree, maximum likelihood and neural network classifiers for this type of study.
The second stage involves applying the decision tree, maximum likelihood and neural network classifiers to InSAR coherence and intensity data and evaluating the utility of this type of data for land cover classification studies. Finally, the last stage involves studying the response of SVMs, decision trees, maximum likelihood and neural classifier to different training data sizes, number of features, sampling plan, and the scale of the data used. The conclusion from the experiments presented in this stage is that the SVMs are unaffected by the Hughes phenomenon, and perform far better than the other classifiers in all cases. The performance of decision tree classifier based feature selection is found to be quite good in comparison with MNF transform. This study indicates that good classification performance depends on various parameters such as data type, scale of data, training sample size and type of classification method employed
The detection of fraudulent financial statements using textual and financial data
Das Vertrauen in die Korrektheit veröffentlichter Jahresabschlüsse bildet ein Fundament für funktionierende Kapitalmärkte. Prominente Bilanzskandale erschüttern immer wieder das Vertrauen der Marktteilnehmer in die Glaubwürdigkeit der veröffentlichten Informationen und führen dadurch zu einer ineffizienten Ressourcenallokation. Zuverlässige, automatisierte Betrugserkennungssysteme, die auf öffentlich zugänglichen Daten basieren, können dazu beitragen, die Prüfungsressourcen effizienter zuzuweisen und stärken die Resilienz der Kapitalmärkte indem Marktteilnehmer stärker vor Bilanzbetrug geschützt werden. In dieser Studie steht die Entwicklung eines Betrugserkennungsmodells im Vordergrund, welches aus textuelle und numerische Bestandteile von Jahresabschlüssen typische Muster für betrügerische Manipulationen extrahiert und diese in einem umfangreichen Aufdeckungsmodell vereint. Die Untersuchung stützt sich dabei auf einen umfassenden methodischen Ansatz, welcher wichtige Probleme und Fragestellungen im Prozess der Erstellung, Erweiterung und Testung der Modelle aufgreift. Die Analyse der textuellen Bestandteile der Jahresabschlüsse wird dabei auf Basis von Mehrwortphrasen durchgeführt, einschließlich einer umfassenden Sprachstandardisierung, um erzählerische Besonderheiten und Kontext besser verarbeiten zu können. Weiterhin wird die Musterextraktion um erfolgreiche Finanzprädiktoren aus den Rechenwerken wie Bilanz oder Gewinn- und Verlustrechnung angereichert und somit der Jahresabschluss in seiner Breite erfasst und möglichst viele Hinweise identifiziert. Die Ergebnisse deuten auf eine zuverlässige und robuste Erkennungsleistung über einen Zeitraum von 15 Jahren hin. Darüber hinaus implizieren die Ergebnisse, dass textbasierte Prädiktoren den Finanzkennzahlen überlegen sind und eine Kombination aus beiden erforderlich ist, um die bestmöglichen Ergebnisse zu erzielen. Außerdem zeigen textbasierte Prädiktoren im Laufe der Zeit eine starke Variation, was die Wichtigkeit einer regelmäßigen Aktualisierung der Modelle unterstreicht. Die insgesamt erzielte Erkennungsleistung konnte sich im Durchschnitt gegen vergleichbare Ansätze durchsetzen.Fraudulent financial statements inhibit markets allocating resources efficiently and induce considerable economic cost. Therefore, market participants strive to identify fraudulent financial statements. Reliable automated fraud detection systems based on publically available data may help to allocate audit resources more effectively. This study examines how quantitative data (financials) and corporate narratives, both can be used to identify accounting fraud (proxied by SEC’s AAERs). Thereby, the detection models are based upon a sound foundation from fraud theory, highlighting how accounting fraud is carried out and discussing the causes for companies to engage in fraudulent alteration of financial records. The study relies on a comprehensive methodological approach to create the detection model. Therefore, the design process is divided into eight design and three enhancing questions, shedding light onto important issues during model creation, improving and testing. The corporate narratives are analysed using multi-word phrases, including an extensive language standardisation that allows to capture narrative peculiarities more precisely and partly address context. The narrative clues are enriched by successful predictors from company financials found in previous studies. The results indicate a reliable and robust detection performance over a timeframe of 15 years. Furthermore, they suggest that text-based predictors are superior to financial ratios and a combination of both is required to achieve the best results possible. Moreover, it is found that text-based predictors vary considerably over time, which shows the importance of updating fraud detection systems frequently. The achieved detection performance was slightly higher on average than for comparable approaches
HIGH-DIMENSIONAL SIGNAL PROCESSING AND STATISTICAL LEARNING
Classical statistical and signal processing techniques are not generally
useful in situations wherein the dimensionality (p) of observations is comparable
or exceeding the sample size (n). This is mainly due to the fact that
the performance of these techniques is guaranteed through classical notion of
statistical consistency, which is itself fashioned for situations wherein n >> p.
Statistical consistency has been viogorously used in the past century to develop
many signal processing and statistical learning techniques. However, in recent
years, two sets of mathematical machineries have emerged that show the possibility
of developing superior techniques suitable for analyzing high-dimensional
observations, i.e., situations where p >> n. In this thesis, we refer to these
techniques, which are grounded either in double asymptotic regimes or sparsity
assumptions, as high-dimensional techniques.
In this thesis, we examine and develop a set of high-dimensional techniques
with applications in classification. The thesis is mainly divided to three
parts. In the first part, we introduce a novel approach based on double asymptotics
to estimate the regularization parameter used in a well-known technique
known as RLDA classifier. We examine the robustness of the developed approach
to Gaussianity, an assumption used in developing the core estimator.
The performance of the technique in terms of accuracy and efficiency is verified
against other popular methods such as cross-validation. In the second part of
the thesis, the performance of the newly developed RLDA and several other
classifiers are compared in situations where p is comparable or exceeding n.
While in the first two parts of the thesis, we focus more on double asympii
totic methods, in the third part, we study two important class of techniques
based on sparsity assumption. One of these techniques known as LASSO has
gained much attention in recent years within the statistical community, while the
second one, known as compressed sensing, has become very popular in signal
processing literature. Although both of these techniques use sparsity assumptions
as well as L1 minimization, the objective functions and constrains they are
constructed on are different. In the third part of the thesis, we demonstrate the
application of both techniques in high-dimensional classification and compare
them in terms of shrinkage rate and classification accurac
- …