3,934 research outputs found

    Developments in the theory of randomized shortest paths with a comparison of graph node distances

    Get PDF
    There have lately been several suggestions for parametrized distances on a graph that generalize the shortest path distance and the commute time or resistance distance. The need for developing such distances has risen from the observation that the above-mentioned common distances in many situations fail to take into account the global structure of the graph. In this article, we develop the theory of one family of graph node distances, known as the randomized shortest path dissimilarity, which has its foundation in statistical physics. We show that the randomized shortest path dissimilarity can be easily computed in closed form for all pairs of nodes of a graph. Moreover, we come up with a new definition of a distance measure that we call the free energy distance. The free energy distance can be seen as an upgrade of the randomized shortest path dissimilarity as it defines a metric, in addition to which it satisfies the graph-geodetic property. The derivation and computation of the free energy distance are also straightforward. We then make a comparison between a set of generalized distances that interpolate between the shortest path distance and the commute time, or resistance distance. This comparison focuses on the applicability of the distances in graph node clustering and classification. The comparison, in general, shows that the parametrized distances perform well in the tasks. In particular, we see that the results obtained with the free energy distance are among the best in all the experiments.Comment: 30 pages, 4 figures, 3 table

    One-Class Classification: Taxonomy of Study and Review of Techniques

    Full text link
    One-class classification (OCC) algorithms aim to build classification models when the negative class is either absent, poorly sampled or not well defined. This unique situation constrains the learning of efficient classifiers by defining class boundary just with the knowledge of positive class. The OCC problem has been considered and applied under many research themes, such as outlier/novelty detection and concept learning. In this paper we present a unified view of the general problem of OCC by presenting a taxonomy of study for OCC problems, which is based on the availability of training data, algorithms used and the application domains applied. We further delve into each of the categories of the proposed taxonomy and present a comprehensive literature review of the OCC algorithms, techniques and methodologies with a focus on their significance, limitations and applications. We conclude our paper by discussing some open research problems in the field of OCC and present our vision for future research.Comment: 24 pages + 11 pages of references, 8 figure

    Ward's Hierarchical Clustering Method: Clustering Criterion and Agglomerative Algorithm

    Full text link
    The Ward error sum of squares hierarchical clustering method has been very widely used since its first description by Ward in a 1963 publication. It has also been generalized in various ways. However there are different interpretations in the literature and there are different implementations of the Ward agglomerative algorithm in commonly used software systems, including differing expressions of the agglomerative criterion. Our survey work and case studies will be useful for all those involved in developing software for data analysis using Ward's hierarchical clustering method.Comment: 20 pages, 21 citations, 4 figure

    Forecasting loss given default with the nearest neighbor algorithm

    Get PDF
    Mestrado em Matemática FinanceiraNos últimos anos, a previsão do Loss Given Default (LGD) tem sido um dos principais desafios no âmbito da gestão do risco de crédito. Investigadores académicos e profissionais da indústria bancária têm-se dedicado ao estudo deste parâmetro de risco em particular. Apesar de todas as diferentes abordagens já desenvolvidas e publicadas até hoje, a previsão do LGD continua a ser um tema de estudo académico intenso e sobre o qual ainda não existe um "consenso" metodológico na banca. Este trabalho apresenta uma abordagem alternativa para a previsão do LGD baseada na utilização de um simples, mas intuitivo, algoritmo de Machine Learning: o algoritmo nearest neighbor. De forma a avaliar a perfomance desta técnica não paramétrica na previsão do LGD, são utilizadas determinadas métricas de avaliação que permitem a comparação com um modelo paramétrico mais convencional e com a utilização do LGD médio histórico.In recent years, forecasting Loss Given Default (LGD) has been a major challenge in the field of credit risk management. Practitioners and academic researchers have focused on the study of this particular risk dimension. Despite all different approaches that have been developed and published so far, it remains an area of intense academic study and with lack of consensual solutions in the banking industry. This paper presents an LGD forecasting approach based on a simple and intuitive Machine Learning algorithm: the nearest neighbor algorithm. In order to evaluate the performance of this non parametric technique, some proper evaluation metrics are used to compare it to a more ?classical? parametric model and to the use of historical recovery rates to predict LGD

    Survey of data mining approaches to user modeling for adaptive hypermedia

    Get PDF
    The ability of an adaptive hypermedia system to create tailored environments depends mainly on the amount and accuracy of information stored in each user model. Some of the difficulties that user modeling faces are the amount of data available to create user models, the adequacy of the data, the noise within that data, and the necessity of capturing the imprecise nature of human behavior. Data mining and machine learning techniques have the ability to handle large amounts of data and to process uncertainty. These characteristics make these techniques suitable for automatic generation of user models that simulate human decision making. This paper surveys different data mining techniques that can be used to efficiently and accurately capture user behavior. The paper also presents guidelines that show which techniques may be used more efficiently according to the task implemented by the applicatio
    corecore