467 research outputs found

    The Shape of Learning Curves: a Review

    Full text link
    Learning curves provide insight into the dependence of a learner's generalization performance on the training set size. This important tool can be used for model selection, to predict the effect of more training data, and to reduce the computational complexity of model training and hyperparameter tuning. This review recounts the origins of the term, provides a formal definition of the learning curve, and briefly covers basics such as its estimation. Our main contribution is a comprehensive overview of the literature regarding the shape of learning curves. We discuss empirical and theoretical evidence that supports well-behaved curves that often have the shape of a power law or an exponential. We consider the learning curves of Gaussian processes, the complex shapes they can display, and the factors influencing them. We draw specific attention to examples of learning curves that are ill-behaved, showing worse learning performance with more training data. To wrap up, we point out various open problems that warrant deeper empirical and theoretical investigation. All in all, our review underscores that learning curves are surprisingly diverse and no universal model can be identified

    HIGH-DIMENSIONAL SIGNAL PROCESSING AND STATISTICAL LEARNING

    Get PDF
    Classical statistical and signal processing techniques are not generally useful in situations wherein the dimensionality (p) of observations is comparable or exceeding the sample size (n). This is mainly due to the fact that the performance of these techniques is guaranteed through classical notion of statistical consistency, which is itself fashioned for situations wherein n >> p. Statistical consistency has been viogorously used in the past century to develop many signal processing and statistical learning techniques. However, in recent years, two sets of mathematical machineries have emerged that show the possibility of developing superior techniques suitable for analyzing high-dimensional observations, i.e., situations where p >> n. In this thesis, we refer to these techniques, which are grounded either in double asymptotic regimes or sparsity assumptions, as high-dimensional techniques. In this thesis, we examine and develop a set of high-dimensional techniques with applications in classification. The thesis is mainly divided to three parts. In the first part, we introduce a novel approach based on double asymptotics to estimate the regularization parameter used in a well-known technique known as RLDA classifier. We examine the robustness of the developed approach to Gaussianity, an assumption used in developing the core estimator. The performance of the technique in terms of accuracy and efficiency is verified against other popular methods such as cross-validation. In the second part of the thesis, the performance of the newly developed RLDA and several other classifiers are compared in situations where p is comparable or exceeding n. While in the first two parts of the thesis, we focus more on double asympii totic methods, in the third part, we study two important class of techniques based on sparsity assumption. One of these techniques known as LASSO has gained much attention in recent years within the statistical community, while the second one, known as compressed sensing, has become very popular in signal processing literature. Although both of these techniques use sparsity assumptions as well as L1 minimization, the objective functions and constrains they are constructed on are different. In the third part of the thesis, we demonstrate the application of both techniques in high-dimensional classification and compare them in terms of shrinkage rate and classification accurac

    Factors influencing the accuracy of remote sensing classifications: a comparative study

    Get PDF
    Within last 20 years, a number of methods have been employed for classifying remote sensing data, including parametric methods (e.g. the maximum likelihood classifier) and non-parametric classifiers (such as neural network classifiers).Each of these classification algorithms has some specific problems which limits its use. This research studies some alternative classification methods for land cover classification and compares their performance with the well established classification methods. The areas selected for this study are located near Littleport (Ely), in East Anglia, UK and in La Mancha region of Spain. Images in the optical bands of the Landsat ETM+ for year 2000 and InSAR data from May to September of 1996 for UK area, DAIS hyperspectral data and Landsat ETM+ for year 2000 for Spain area are used for this study. In addition, field data for the year 1996 were collected from farmers and for year 2000 were collected by field visits to both areas in the UK and Spain to generate the ground reference data set. The research was carried out in three main stages.The overall aim of this study is to assess the relative performance of four approaches to classification in remote sensing - the maximum likelihood, artificial neural net, decision tree and support vector machine methods and to examine factors which affect their performance in term of overall classification accuracy. Firstly, this research studies the behaviour of decision tree and support vector machine classifiers for land cover classification using ETM+ (UK) data. This stage discusses some factors affecting classification accuracy of a decision tree classifier, and also compares the performance of the decision tree with that of the maximum likelihood and neural network classifiers. The use of SVM requires the user to set the values of some parameters, such as type of kernel, kernel parameters, and multi-class methods as these parameters can significantly affect the accuracy of the resulting classification. This stage involves studying the effects of varying the various user defined parameters and noting their effect on classification accuracy. It is concluded that SVM perform far better than decision tree, maximum likelihood and neural network classifiers for this type of study. The second stage involves applying the decision tree, maximum likelihood and neural network classifiers to InSAR coherence and intensity data and evaluating the utility of this type of data for land cover classification studies. Finally, the last stage involves studying the response of SVMs, decision trees, maximum likelihood and neural classifier to different training data sizes, number of features, sampling plan, and the scale of the data used. The conclusion from the experiments presented in this stage is that the SVMs are unaffected by the Hughes phenomenon, and perform far better than the other classifiers in all cases. The performance of decision tree classifier based feature selection is found to be quite good in comparison with MNF transform. This study indicates that good classification performance depends on various parameters such as data type, scale of data, training sample size and type of classification method employed

    Compensação digital de distorções da fibra em sistemas de comunicação óticos de longa distância

    Get PDF
    The continuous increase of traffic demand in long-haul communications motivated the network operators to look for receiver side techniques to mitigate the nonlinear effects, resulting from signal-signal and signal-noise interaction, thus pushing the current Capacity boundaries. Machine learning techniques are a very hot-topic with given proofs in the most diverse applications. This dissertation aims to study nonlinear impairments in long-haul coherent optical links and the current state of the art in DSP techniques for impairment mitigation as well as the integration of machine learning strategies in optical networks. Starting with a simplified fiber model only impaired by ASE noise, we studied how to integrate an ANN-based symbol estimator into the signal pipeline, enabling to validate the implementation by matching the theoretical performance. We then moved to nonlinear proof of concept with the incorporation of NLPN in the fiber link. Finally, we evaluated the performance of the estimator under realistic simulations of Single and Multi- Channel links in both SSFM and NZDSF fibers. The obtained results indicate that even though it may be hard to find the best architecture, Nonlinear Symbol Estimator networks have the potential to surpass more conventional DSP strategies.O aumento contínuo de tráfego nas comunicações de longo-alcance motivou os operadores de rede a procurar técnicas do lado do receptor para atenuar os efeitos não lineares resultantes da interacção sinal-sinal e sinal-ruído, alargando assim os limites da capacidade do sistema. As técnicas de aprendizagem-máquina são um tópico em ascenção com provas dadas nas mais diversas aplicações e setores. Esta dissertação visa estudar as principais deficiências nas ligações de longo curso e o actual estado da arte em técnicas de DSP para mitigação das mesmas, bem como a integração de estratégias de aprendizagem-máquina em redes ópticas. Começando com um modelo simplificado de fibra apenas perturbado pelo ruído ASE, estudámos como integrar um estimador de símbolos baseado em ANN na cadeia do prodessamento de sinal, conseguindo igualar o desempenho teórico. Procedemos com uma prova de conceito perante não linearidades com a incorporação do ruído de fase não linear na propagação. Finalmente, avaliamos o desempenho do estimador com simulações realistas de links Single e Multi canal tanto em fibras SSFM como NZDSF. Os resultados obtidos indicam que apesar da dificuldade de encontrar a melhor arquitectura, a estimação não linear baseada em redes neuronais têm o potencial para ultrapassar estratégias DSP mais convencionais.Mestrado em Engenharia Eletrónica e Telecomunicaçõe

    Why are married men working so much?

    No full text
    Empirical patterns of labor supply at the micro level tend to reject the unitary model assumption implicit in most macro theories, where households are the deemed to be rational agents. This paper examines the rise in in per-capita labor since 1975 and asks how the inclusion of bargaining between spouses in a standard macro model would alter the analysis of recent trends in aggregate labor supply. The main findings are that the stationarity of married men's work hours reflects weakening of men's bargaining position as women's wages rose, and that the unitary model seriously overstates the response of aggregate labor to trends in relative wages

    Factors influencing the accuracy of remote sensing classifications: a comparative study

    Get PDF
    Within last 20 years, a number of methods have been employed for classifying remote sensing data, including parametric methods (e.g. the maximum likelihood classifier) and non-parametric classifiers (such as neural network classifiers).Each of these classification algorithms has some specific problems which limits its use. This research studies some alternative classification methods for land cover classification and compares their performance with the well established classification methods. The areas selected for this study are located near Littleport (Ely), in East Anglia, UK and in La Mancha region of Spain. Images in the optical bands of the Landsat ETM+ for year 2000 and InSAR data from May to September of 1996 for UK area, DAIS hyperspectral data and Landsat ETM+ for year 2000 for Spain area are used for this study. In addition, field data for the year 1996 were collected from farmers and for year 2000 were collected by field visits to both areas in the UK and Spain to generate the ground reference data set. The research was carried out in three main stages.The overall aim of this study is to assess the relative performance of four approaches to classification in remote sensing - the maximum likelihood, artificial neural net, decision tree and support vector machine methods and to examine factors which affect their performance in term of overall classification accuracy. Firstly, this research studies the behaviour of decision tree and support vector machine classifiers for land cover classification using ETM+ (UK) data. This stage discusses some factors affecting classification accuracy of a decision tree classifier, and also compares the performance of the decision tree with that of the maximum likelihood and neural network classifiers. The use of SVM requires the user to set the values of some parameters, such as type of kernel, kernel parameters, and multi-class methods as these parameters can significantly affect the accuracy of the resulting classification. This stage involves studying the effects of varying the various user defined parameters and noting their effect on classification accuracy. It is concluded that SVM perform far better than decision tree, maximum likelihood and neural network classifiers for this type of study. The second stage involves applying the decision tree, maximum likelihood and neural network classifiers to InSAR coherence and intensity data and evaluating the utility of this type of data for land cover classification studies. Finally, the last stage involves studying the response of SVMs, decision trees, maximum likelihood and neural classifier to different training data sizes, number of features, sampling plan, and the scale of the data used. The conclusion from the experiments presented in this stage is that the SVMs are unaffected by the Hughes phenomenon, and perform far better than the other classifiers in all cases. The performance of decision tree classifier based feature selection is found to be quite good in comparison with MNF transform. This study indicates that good classification performance depends on various parameters such as data type, scale of data, training sample size and type of classification method employed

    The detection of fraudulent financial statements using textual and financial data

    Get PDF
    Das Vertrauen in die Korrektheit veröffentlichter Jahresabschlüsse bildet ein Fundament für funktionierende Kapitalmärkte. Prominente Bilanzskandale erschüttern immer wieder das Vertrauen der Marktteilnehmer in die Glaubwürdigkeit der veröffentlichten Informationen und führen dadurch zu einer ineffizienten Ressourcenallokation. Zuverlässige, automatisierte Betrugserkennungssysteme, die auf öffentlich zugänglichen Daten basieren, können dazu beitragen, die Prüfungsressourcen effizienter zuzuweisen und stärken die Resilienz der Kapitalmärkte indem Marktteilnehmer stärker vor Bilanzbetrug geschützt werden. In dieser Studie steht die Entwicklung eines Betrugserkennungsmodells im Vordergrund, welches aus textuelle und numerische Bestandteile von Jahresabschlüssen typische Muster für betrügerische Manipulationen extrahiert und diese in einem umfangreichen Aufdeckungsmodell vereint. Die Untersuchung stützt sich dabei auf einen umfassenden methodischen Ansatz, welcher wichtige Probleme und Fragestellungen im Prozess der Erstellung, Erweiterung und Testung der Modelle aufgreift. Die Analyse der textuellen Bestandteile der Jahresabschlüsse wird dabei auf Basis von Mehrwortphrasen durchgeführt, einschließlich einer umfassenden Sprachstandardisierung, um erzählerische Besonderheiten und Kontext besser verarbeiten zu können. Weiterhin wird die Musterextraktion um erfolgreiche Finanzprädiktoren aus den Rechenwerken wie Bilanz oder Gewinn- und Verlustrechnung angereichert und somit der Jahresabschluss in seiner Breite erfasst und möglichst viele Hinweise identifiziert. Die Ergebnisse deuten auf eine zuverlässige und robuste Erkennungsleistung über einen Zeitraum von 15 Jahren hin. Darüber hinaus implizieren die Ergebnisse, dass textbasierte Prädiktoren den Finanzkennzahlen überlegen sind und eine Kombination aus beiden erforderlich ist, um die bestmöglichen Ergebnisse zu erzielen. Außerdem zeigen textbasierte Prädiktoren im Laufe der Zeit eine starke Variation, was die Wichtigkeit einer regelmäßigen Aktualisierung der Modelle unterstreicht. Die insgesamt erzielte Erkennungsleistung konnte sich im Durchschnitt gegen vergleichbare Ansätze durchsetzen.Fraudulent financial statements inhibit markets allocating resources efficiently and induce considerable economic cost. Therefore, market participants strive to identify fraudulent financial statements. Reliable automated fraud detection systems based on publically available data may help to allocate audit resources more effectively. This study examines how quantitative data (financials) and corporate narratives, both can be used to identify accounting fraud (proxied by SEC’s AAERs). Thereby, the detection models are based upon a sound foundation from fraud theory, highlighting how accounting fraud is carried out and discussing the causes for companies to engage in fraudulent alteration of financial records. The study relies on a comprehensive methodological approach to create the detection model. Therefore, the design process is divided into eight design and three enhancing questions, shedding light onto important issues during model creation, improving and testing. The corporate narratives are analysed using multi-word phrases, including an extensive language standardisation that allows to capture narrative peculiarities more precisely and partly address context. The narrative clues are enriched by successful predictors from company financials found in previous studies. The results indicate a reliable and robust detection performance over a timeframe of 15 years. Furthermore, they suggest that text-based predictors are superior to financial ratios and a combination of both is required to achieve the best results possible. Moreover, it is found that text-based predictors vary considerably over time, which shows the importance of updating fraud detection systems frequently. The achieved detection performance was slightly higher on average than for comparable approaches

    HIGH-DIMENSIONAL SIGNAL PROCESSING AND STATISTICAL LEARNING

    Get PDF
    Classical statistical and signal processing techniques are not generally useful in situations wherein the dimensionality (p) of observations is comparable or exceeding the sample size (n). This is mainly due to the fact that the performance of these techniques is guaranteed through classical notion of statistical consistency, which is itself fashioned for situations wherein n >> p. Statistical consistency has been viogorously used in the past century to develop many signal processing and statistical learning techniques. However, in recent years, two sets of mathematical machineries have emerged that show the possibility of developing superior techniques suitable for analyzing high-dimensional observations, i.e., situations where p >> n. In this thesis, we refer to these techniques, which are grounded either in double asymptotic regimes or sparsity assumptions, as high-dimensional techniques. In this thesis, we examine and develop a set of high-dimensional techniques with applications in classification. The thesis is mainly divided to three parts. In the first part, we introduce a novel approach based on double asymptotics to estimate the regularization parameter used in a well-known technique known as RLDA classifier. We examine the robustness of the developed approach to Gaussianity, an assumption used in developing the core estimator. The performance of the technique in terms of accuracy and efficiency is verified against other popular methods such as cross-validation. In the second part of the thesis, the performance of the newly developed RLDA and several other classifiers are compared in situations where p is comparable or exceeding n. While in the first two parts of the thesis, we focus more on double asympii totic methods, in the third part, we study two important class of techniques based on sparsity assumption. One of these techniques known as LASSO has gained much attention in recent years within the statistical community, while the second one, known as compressed sensing, has become very popular in signal processing literature. Although both of these techniques use sparsity assumptions as well as L1 minimization, the objective functions and constrains they are constructed on are different. In the third part of the thesis, we demonstrate the application of both techniques in high-dimensional classification and compare them in terms of shrinkage rate and classification accurac
    corecore