10,175 research outputs found

    Machine Learning in Automated Text Categorization

    Full text link
    The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert manpower, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely document representation, classifier construction, and classifier evaluation.Comment: Accepted for publication on ACM Computing Survey

    A review of multi-instance learning assumptions

    Get PDF
    Multi-instance (MI) learning is a variant of inductive machine learning, where each learning example contains a bag of instances instead of a single feature vector. The term commonly refers to the supervised setting, where each bag is associated with a label. This type of representation is a natural fit for a number of real-world learning scenarios, including drug activity prediction and image classification, hence many MI learning algorithms have been proposed. Any MI learning method must relate instances to bag-level class labels, but many types of relationships between instances and class labels are possible. Although all early work in MI learning assumes a specific MI concept class known to be appropriate for a drug activity prediction domain; this ‘standard MI assumption’ is not guaranteed to hold in other domains. Much of the recent work in MI learning has concentrated on a relaxed view of the MI problem, where the standard MI assumption is dropped, and alternative assumptions are considered instead. However, often it is not clearly stated what particular assumption is used and how it relates to other assumptions that have been proposed. In this paper, we aim to clarify the use of alternative MI assumptions by reviewing the work done in this area

    Customer intimacy analytics : leveraging operational data to assess customer knowledge and relationships and to measure their business impact

    Get PDF
    The ability to capture customer needs and to tailor the provided solutions accordingly, also defined as customer intimacy, has become a significant success factor in the B2B space - in particular for increasingly \"servitizing\" businesses. This book elaborates on the solution CI Analytics to assess and monitor the impact of customer intimacy strategies by leveraging business analytics and social network analysis technology. This solution thereby effectively complements existing CRM solutions

    Machine learning: Challenges and opportunities on credit risk

    Get PDF
    The constant challenge in anticipating the risk of default by borrowers has led financial institutions to develop techniques and models to improve their credit risk monitoring, and to predict how likely it is for certain customers to default on a loan, as well as how likely it is for others to meet their financial obligations. Thus, it is interesting to investigate how financial institutions can anticipate this occurrence using Machine Learning algorithms. This dissertation aims to demonstrate the power of Machine Learning algorithms in credit risk analysis, focusing on building the models, training them, and testing the data, and presenting the opportunities and challenges of Machine Learning that are still open to developing future studies. For this purpose, we present two Machine Learning classification algorithms: Decision Trees and Logistic Regression. In addition, numerical results obtained from various comparisons of these algorithms, which were programmed and ran in Python using the Jupyter Notebook application, are also presented. The initial sample data, consisting of 850 observations, contained credit details about borrowers in the United States of America, and is freely available data. To check the model execution and performance, between Logistic Regression and Decision Trees, we used measures such as AUC, precision and F1-score.O constante desafio na antecipação do risco de incumprimento por parte dos tomadores de crédito, levou a que as instituições financeiras desenvolvessem técnicas e modelos de forma a melhorar a sua monitorização do risco de crédito, e antever o quão provável será para determinados clientes entrar em incumprimento, assim como o quão provável será para outros de cumprirem com as suas obrigações financeiras. Portanto, interessa averiguar como as instituições financeiras podem antecipar esta ocorrência beneficiando de algoritmos de Machine Learning. A presente dissertação pretende demonstrar o poder dos algoritmos de Machine Learning na análise de risco de crédito, com foco no processo de construção dos modelos, treinando-os e testando os dados, e apresentar as oportunidades e os desafios de Machine Learning que ainda estão em aberto para desenvolver futuros estudos. Para esse propósito, apresentamos dois algoritmos de classificação de Machine Learning: as Árvores de Decisão e a Regressão Logística. Adicionalmente, também se apresenta os resultados numéricos obtidos entre várias comparações desses algoritmos que foram programados e corridos em Python, utilizando a aplicação Jupyter Notebook. Os dados da amostra inicial, constituída por 850 observações, contêm detalhes de crédito sobre os tomadores de empréstimos nos Estados Unidos da América, sendo os dados de livre acesso e uitilização. Para verificar a execução e a performance do modelo, entre Regressão Logística e Árvores de Decisão, usamos medidas como o AUC, precisão e F1-score

    A Fast Quartet Tree Heuristic for Hierarchical Clustering

    Get PDF
    The Minimum Quartet Tree Cost problem is to construct an optimal weight tree from the 3(n4)3{n \choose 4} weighted quartet topologies on nn objects, where optimality means that the summed weight of the embedded quartet topologies is optimal (so it can be the case that the optimal tree embeds all quartets as nonoptimal topologies). We present a Monte Carlo heuristic, based on randomized hill climbing, for approximating the optimal weight tree, given the quartet topology weights. The method repeatedly transforms a dendrogram, with all objects involved as leaves, achieving a monotonic approximation to the exact single globally optimal tree. The problem and the solution heuristic has been extensively used for general hierarchical clustering of nontree-like (non-phylogeny) data in various domains and across domains with heterogeneous data. We also present a greatly improved heuristic, reducing the running time by a factor of order a thousand to ten thousand. All this is implemented and available, as part of the CompLearn package. We compare performance and running time of the original and improved versions with those of UPGMA, BioNJ, and NJ, as implemented in the SplitsTree package on genomic data for which the latter are optimized. Keywords: Data and knowledge visualization, Pattern matching--Clustering--Algorithms/Similarity measures, Hierarchical clustering, Global optimization, Quartet tree, Randomized hill-climbing,Comment: LaTeX, 40 pages, 11 figures; this paper has substantial overlap with arXiv:cs/0606048 in cs.D

    An Overview on Application of Machine Learning Techniques in Optical Networks

    Get PDF
    Today's telecommunication networks have become sources of enormous amounts of widely heterogeneous data. This information can be retrieved from network traffic traces, network alarms, signal quality indicators, users' behavioral data, etc. Advanced mathematical tools are required to extract meaningful information from these data and take decisions pertaining to the proper functioning of the networks from the network-generated data. Among these mathematical tools, Machine Learning (ML) is regarded as one of the most promising methodological approaches to perform network-data analysis and enable automated network self-configuration and fault management. The adoption of ML techniques in the field of optical communication networks is motivated by the unprecedented growth of network complexity faced by optical networks in the last few years. Such complexity increase is due to the introduction of a huge number of adjustable and interdependent system parameters (e.g., routing configurations, modulation format, symbol rate, coding schemes, etc.) that are enabled by the usage of coherent transmission/reception technologies, advanced digital signal processing and compensation of nonlinear effects in optical fiber propagation. In this paper we provide an overview of the application of ML to optical communications and networking. We classify and survey relevant literature dealing with the topic, and we also provide an introductory tutorial on ML for researchers and practitioners interested in this field. Although a good number of research papers have recently appeared, the application of ML to optical networks is still in its infancy: to stimulate further work in this area, we conclude the paper proposing new possible research directions
    corecore