10,175 research outputs found
Machine Learning in Automated Text Categorization
The automated categorization (or classification) of texts into predefined
categories has witnessed a booming interest in the last ten years, due to the
increased availability of documents in digital form and the ensuing need to
organize them. In the research community the dominant approach to this problem
is based on machine learning techniques: a general inductive process
automatically builds a classifier by learning, from a set of preclassified
documents, the characteristics of the categories. The advantages of this
approach over the knowledge engineering approach (consisting in the manual
definition of a classifier by domain experts) are a very good effectiveness,
considerable savings in terms of expert manpower, and straightforward
portability to different domains. This survey discusses the main approaches to
text categorization that fall within the machine learning paradigm. We will
discuss in detail issues pertaining to three different problems, namely
document representation, classifier construction, and classifier evaluation.Comment: Accepted for publication on ACM Computing Survey
A review of multi-instance learning assumptions
Multi-instance (MI) learning is a variant of inductive machine learning, where each learning example contains a bag of instances instead of a single feature vector. The term commonly refers to the supervised setting, where each bag is associated with a label. This type of representation is a natural fit for a number of real-world learning scenarios, including drug activity prediction and image classification, hence many MI learning algorithms have been proposed. Any MI learning method must relate instances to bag-level class labels, but many types of relationships between instances and class labels are possible. Although all early work in MI learning assumes a specific MI concept class known to be appropriate for a drug activity prediction domain; this ‘standard MI assumption’ is not guaranteed to hold in other domains. Much of the recent work in MI learning has concentrated on a relaxed view of the MI problem, where the standard MI assumption is dropped, and alternative assumptions are considered instead. However, often it is not clearly stated what particular assumption is used and how it relates to other assumptions that have been proposed. In this paper, we aim to clarify the use of alternative MI assumptions by reviewing the work done in this area
Customer intimacy analytics : leveraging operational data to assess customer knowledge and relationships and to measure their business impact
The ability to capture customer needs and to tailor the provided solutions accordingly, also defined as customer intimacy, has become a significant success factor in the B2B space - in particular for increasingly \"servitizing\" businesses. This book elaborates on the solution CI Analytics to assess and monitor the impact of customer intimacy strategies by leveraging business analytics and social network analysis technology. This solution thereby effectively complements existing CRM solutions
Machine learning: Challenges and opportunities on credit risk
The constant challenge in anticipating the risk of default by borrowers has led financial
institutions to develop techniques and models to improve their credit risk monitoring, and to
predict how likely it is for certain customers to default on a loan, as well as how likely it is for
others to meet their financial obligations. Thus, it is interesting to investigate how financial
institutions can anticipate this occurrence using Machine Learning algorithms.
This dissertation aims to demonstrate the power of Machine Learning algorithms in credit
risk analysis, focusing on building the models, training them, and testing the data, and
presenting the opportunities and challenges of Machine Learning that are still open to
developing future studies. For this purpose, we present two Machine Learning classification
algorithms: Decision Trees and Logistic Regression. In addition, numerical results obtained
from various comparisons of these algorithms, which were programmed and ran in Python using
the Jupyter Notebook application, are also presented. The initial sample data, consisting of 850
observations, contained credit details about borrowers in the United States of America, and is
freely available data. To check the model execution and performance, between Logistic
Regression and Decision Trees, we used measures such as AUC, precision and F1-score.O constante desafio na antecipação do risco de incumprimento por parte dos tomadores de
crédito, levou a que as instituições financeiras desenvolvessem técnicas e modelos de forma a
melhorar a sua monitorização do risco de crédito, e antever o quão provável será para
determinados clientes entrar em incumprimento, assim como o quão provável será para outros
de cumprirem com as suas obrigações financeiras. Portanto, interessa averiguar como as
instituições financeiras podem antecipar esta ocorrência beneficiando de algoritmos de
Machine Learning.
A presente dissertação pretende demonstrar o poder dos algoritmos de Machine Learning
na análise de risco de crédito, com foco no processo de construção dos modelos, treinando-os
e testando os dados, e apresentar as oportunidades e os desafios de Machine Learning que ainda
estão em aberto para desenvolver futuros estudos. Para esse propósito, apresentamos dois
algoritmos de classificação de Machine Learning: as Árvores de Decisão e a Regressão
Logística. Adicionalmente, também se apresenta os resultados numéricos obtidos entre várias
comparações desses algoritmos que foram programados e corridos em Python, utilizando a
aplicação Jupyter Notebook. Os dados da amostra inicial, constituída por 850 observações,
contêm detalhes de crédito sobre os tomadores de empréstimos nos Estados Unidos da América,
sendo os dados de livre acesso e uitilização. Para verificar a execução e a performance do
modelo, entre Regressão Logística e Árvores de Decisão, usamos medidas como o AUC,
precisão e F1-score
A Fast Quartet Tree Heuristic for Hierarchical Clustering
The Minimum Quartet Tree Cost problem is to construct an optimal weight tree
from the weighted quartet topologies on objects, where
optimality means that the summed weight of the embedded quartet topologies is
optimal (so it can be the case that the optimal tree embeds all quartets as
nonoptimal topologies). We present a Monte Carlo heuristic, based on randomized
hill climbing, for approximating the optimal weight tree, given the quartet
topology weights. The method repeatedly transforms a dendrogram, with all
objects involved as leaves, achieving a monotonic approximation to the exact
single globally optimal tree. The problem and the solution heuristic has been
extensively used for general hierarchical clustering of nontree-like
(non-phylogeny) data in various domains and across domains with heterogeneous
data. We also present a greatly improved heuristic, reducing the running time
by a factor of order a thousand to ten thousand. All this is implemented and
available, as part of the CompLearn package. We compare performance and running
time of the original and improved versions with those of UPGMA, BioNJ, and NJ,
as implemented in the SplitsTree package on genomic data for which the latter
are optimized.
Keywords: Data and knowledge visualization, Pattern
matching--Clustering--Algorithms/Similarity measures, Hierarchical clustering,
Global optimization, Quartet tree, Randomized hill-climbing,Comment: LaTeX, 40 pages, 11 figures; this paper has substantial overlap with
arXiv:cs/0606048 in cs.D
An Overview on Application of Machine Learning Techniques in Optical Networks
Today's telecommunication networks have become sources of enormous amounts of
widely heterogeneous data. This information can be retrieved from network
traffic traces, network alarms, signal quality indicators, users' behavioral
data, etc. Advanced mathematical tools are required to extract meaningful
information from these data and take decisions pertaining to the proper
functioning of the networks from the network-generated data. Among these
mathematical tools, Machine Learning (ML) is regarded as one of the most
promising methodological approaches to perform network-data analysis and enable
automated network self-configuration and fault management. The adoption of ML
techniques in the field of optical communication networks is motivated by the
unprecedented growth of network complexity faced by optical networks in the
last few years. Such complexity increase is due to the introduction of a huge
number of adjustable and interdependent system parameters (e.g., routing
configurations, modulation format, symbol rate, coding schemes, etc.) that are
enabled by the usage of coherent transmission/reception technologies, advanced
digital signal processing and compensation of nonlinear effects in optical
fiber propagation. In this paper we provide an overview of the application of
ML to optical communications and networking. We classify and survey relevant
literature dealing with the topic, and we also provide an introductory tutorial
on ML for researchers and practitioners interested in this field. Although a
good number of research papers have recently appeared, the application of ML to
optical networks is still in its infancy: to stimulate further work in this
area, we conclude the paper proposing new possible research directions
- …