10 research outputs found
Automatic Classification of Text Databases through Query Probing
Many text databases on the web are "hidden" behind search interfaces, and
their documents are only accessible through querying. Search engines typically
ignore the contents of such search-only databases. Recently, Yahoo-like
directories have started to manually organize these databases into categories
that users can browse to find these valuable resources. We propose a novel
strategy to automate the classification of search-only text databases. Our
technique starts by training a rule-based document classifier, and then uses
the classifier's rules to generate probing queries. The queries are sent to the
text databases, which are then classified based on the number of matches that
they produce for each query. We report some initial exploratory experiments
that show that our approach is promising to automatically characterize the
contents of text databases accessible on the web.Comment: 7 pages, 1 figur
Aplicação de machine learning no combate ao branqueamento de capitais e ao financiamento do terrorismo
Mestrado em Métodos Quantitativos para a Decisão Económica e EmpresarialEste trabalho resulta de um estágio desenvolvido na Empresa Quidgest, S.A. O trabalho final de mestrado versa sobre uma aplicação de Machine Learning na resolução da problemática de combate ao branqueamento de capitais e ao financiamento do terrorismo. Tal problema é conhecido como um caso de dados desbalanceados. Por conseguinte, a questão é abordada no decorrer do trabalho, apresentando várias formas de resolução. São ainda tratados os conceitos Machine Learning, Data Mining e Knowledge-Discovery in Databases. No âmbito do Machine Learning, o presente trabalho apenas se debruça sobre algoritmos supervisionados. Mais especificamente, os classificadores Random Forest, Adaboost e Boosting C5.0. Tais métodos foram aplicados sobre um repositório de dados que se encontravam alojados no sistema de gestão de base de dados Microsoft SQL Server.
A investigação seguiu a metodologia CRISP-DM e teve a sua implementação no software R.This work results from an internship developed at Quidgest, S.A. This Master Final Work deals with an application of the Machine Learning in order to solve the problem of money laundering and the financing of terrorism. This problem is known as a case of unbalanced data. Therefore, the issue is addressed in the course of the work, presenting various forms of resolution. The concepts of Machine Learning, Data Mining and Knowledge-Discovery in Databases are also discussed. In Machine Learning, this paper only focuses on supervised algorithms. More specifically, the classifiers: Random Forest, Adaboost, and Boosting C5.0. These methods were applied to a data repository that was hosted in Microsoft SQL Server database management system.
The research followed the CRISP-DM methodology and was implemented in the R software.info:eu-repo/semantics/publishedVersio
Syy-seuraustietoinen ennustajavalinta ympäristöön mukautumiseen
Despite development in many areas of machine learning in recent decades, still, changing data sources between the domain in a model is trained and the domain in the same model is used for predictions is a fundamental and common problem. In the area of domain adaptation, these circum- stances have been studied by incorporating causal knowledge about the information flow between features to be utilized in the feature selection for the model. That work has shown promising results to accomplish so-called invariant causal prediction, which means a prediction performance is immune to the change levels between domains. Within these approaches, recognizing the Markov blanket to the target variable has served as a principal workhorse to find the optimal starting point.
In this thesis, we continue to investigate closely the property of invariant prediction performance within Markov blankets to target variable. Also, some scenarios with latent parents involved in the Markov blanket are included to understand the role of the related covariates around the latent parent effect to the invariant prediction properties. Before the experiments, we cover the concepts of Makov blankets, structural causal models, causal feature selection, covariate shift, and target shift. We also look into ways to measure bias between changing domains by introducing transfer bias and incomplete information bias, as these biases play an important role in the feature selection, often being a trade-off situation between these biases.
In the experiments, simulated data sets are generated from structural causal models to conduct the testing scenarios with the changing conditions of interest. With different scenarios, we investigate changes in the features of Markov blankets between training and prediction domains. Some scenarios involve changes in latent covariates as well. As result, we show that parent features are generally steady predictors enabling invariant prediction. An exception is a changing target, which basically requires more information about the changes in other earlier domains to enable invariant prediction. Also, emerging with latent parents, it is important to have some real direct causes in the feature sets to achieve invariant prediction performance
Ensemble learning in the presence of noise
Tesis Doctoral inédita leída en la Universidad Autónoma de Madrid, Escuela Politécnica Superior, Departamento de Ingenieria Informática. Fecha de lectura: 14-02-2019La disponibilidad de grandes cantidades de datos provenientes de diversas fuentes ampl a
enormemente las posibilidades para una explotaci on inteligente de la informaci on. No
obstante, la extracci on de conocimiento a partir de datos en bruto es una tarea compleja
que requiere el desarrollo de m etodos de aprendizaje e cientes y robustos. Una de las
principales di cultades en el aprendizaje autom atico es la presencia de ruido en los datos.
En esta tesis, abordamos el problema del aprendizaje autom atico en presencia de ruido.
Para este prop osito, nos centraremos en el uso de conjuntos de clasi cadores. Nuestro
objetivo es crear colecciones de aprendices base cuyos resultados, al ser combinados,
mejoren no solo la precisi on sino tambi en la robustez de las predicciones.
Una primera contribuci on de esta tesis es aprovechar el ratio de submuestreo para construir
conjuntos de clasi cadores basados en bootstrap (como bagging o random forests)
precisos y robustos. La idea de utilizar el submuestreo como mecanismo de regularizaci
on tambi en se explota para la detecci on de ejemplos ruidosos. En concreto, los
ejemplos que est an mal clasi cados por una fracci on de los miembros del conjunto se
marcan como ruido. El valor optimo de este umbral se determina mediante validaci on
cruzada. Las instancias ruidosas se eliminan ( ltrado) o se corrigen sus etiquetas de su
clase (limpieza). Finalmente, se construye un conjunto de clasi cadores utilizando los
datos de entrenamiento limpios ( ltrados o limpiados).
Otra contribuci on de esta tesis es vote-boosting, un m etodo de conjuntos secuencial
especialmente dise~nado para ser robusto al ruido en las etiquetas de clase. Vote-boosting
reduce la excesiva sensibilidad a este tipo de ruido de los algoritmos basados en boosting,
como adaboost. En general, los algoritmos basados en booting modi can la distribuci on
de pesos en los datos de entrenamiento progresivamente para enfatizar instancias mal
clasi cadas. Este enfoque codicioso puede terminar dando un peso excesivamente alto
a instancias cuya etiqueta de clase sea incorrecta. Por el contrario, en vote-boosting, el
enfasis se basa en el nivel de incertidumbre (acuerdo o desacuerdo) de la predicci on del
conjunto, independientemente de la etiqueta de clase. Al igual que en boosting, voteboosting
se puede analizar como una optimizaci on de descenso por gradiente en espacio
funcional.
Uno de los problemas abiertos en el aprendizaje de conjuntos es c omo construir combinaciones
de clasi cadores fuertes. La principal di cultad es lograr diversidad entre
los clasi cadores base sin un deterioro signi cativo de su rendimiento y sin aumentar
en exceso el coste computacional. En esta tesis, proponemos construir conjuntos de
SVM con la ayuda de mecanismos de aleatorizaci on y optimizaci on. Gracias a esta combinaci on de estrategias complementarias, es posible crear conjuntos de SVM que
son mucho m as r apidos de entrenar y son potencialmente m as precisos que un SVM
individual optimizado.
Por ultimo, hemos desarrollado un procedimiento para construir conjuntos heterog eneos
que interpolan sus decisiones a partir de conjuntos homog eneos compuestos por diferentes
tipos de clasi cadores. La composici on optima del conjunto se determina mediante
validaci on cruzada.
v