5 research outputs found
Détection de doublons parmi des informations non structurées provenant de sources de données différentes
Ce mémoire rend compte de l’exploration de deux approches de détection des doublons entre les descriptions d’entreprises d’une base de données interne et celles d’une source externe non structurée en assurance commerciale. Puisqu’il est coûteux et fastidieux pour un assureur de recueillir les informations nécessaires au calcul d’une prime d’assurance, notre motivation est de les aider à minimiser la quantité de ressources nécessaires à leur acquisition en leur permettant d’utiliser des sources de données externes. Dans ce mémoire, nous avons d’abord observé que l’utilisation d’algorithmes de similarité permet de détecter la majorité des doublons entre les sources de données à partir du nom. Nos expérimentations indiquent que lorsqu’on utilise le nom comme source de comparaison entre les entités, une très grande majorité de ces doublons peut être identifiée. Des expérimentations similaires, mais avec l’adresse, nous ont permis d’observer qu’il était aussi possible d’identifier les doublons d’entreprises par cet attribut, mais dans une moins grande proportion. Par la suite, nous avons entraîné des modèles d’apprentissage automatique afin de coupler les entreprises en double par le nom et l’adresse conjointement. C’est avec ces modèles que nous avons observé les meilleurs résultats. Dans une tentative finale d’améliorer davantage nos résultats, nous avons assoupli notre hypothèse initiale, qui impliquait d’utiliser l’entité la plus probable d’être le doublon d’une entreprise, pour utiliser les N entités les plus probables, ce qui a permis de maximiser le rappel à 91,07 %.This thesis reports the exploration of two approaches to detecting duplicates between the companies descriptions in an internal database and those in an unstructured external source in commercial insurance. Since it is costly and tedious for an insurer to collect the information required to calculate an insurance premium, our motivation is to help them minimize the amount of resources necessary by extracting that information directly from external databases. In this thesis, we first observed that the use of similarity algorithms allows us to detect most of the duplicates between databases using the name. Our experiments indicate that when the name is used as a source of comparison between the entities, a vast majority of these duplicates can be identified. Similar experiments, but using the address this time, allowed us to observe that it was also possible to identify duplicate companies by this feature, but to a lesser extent. Subsequently, we trained machine learning models to match duplicate companies using the name and the address at the same time. It is with these models that we observed the best results. In a final attempt to further improve our results, we used the N most likely entities to be a duplicate of a company, instead of only the first one, thus maximizing the recall to 91.07%
Postal address parsing using deep neural networks
En la actualidad empresas ofertantes de productos y servicios precisan conocer y ubicar
el domicilio del cliente para que la compra se haga efectiva. Podemos ver esta casuística
en el mercado de los marketplaces, la comida a domicilio o, el caso de este trabajo,
los servicios básicos de las compañías de servicios de telecomunicaciones como son el
teléfono fijo y la conexión a internet. Las empresas en el sector telco precisan saber la
localización de sus clientes con el fin de conocer si existe o no y quien posee la titularidad
de la fibra óptica o el ADSL del domicilio del cliente con el fin de conocer la rentabilidad
de la contratación de ese nuevo cliente. Conocer su titularidad pasa primero por conocer
su localización y esto supone traducir la dirección postal del cliente en un identificador
de domicilios creado en España denominado GESCAL. Muchas marcas dependen de una
API de Google para el proceso, pero el Grupo MásMóvil ha decidido crear una herramienta
propia que le permita finalizar su dependencia con el gigante tecnológico con el
único propósito de reducir costes de adquisición de un nuevo cliente.
Este trabajo platea una solución a un problema encontrado en la herramienta diseñada
por el grupo que impide a esta obtener una eficacia similar a la facilitada por la API de
Google. El objetivo de esta tesis es crear un parseador de direcciones que separe en sus
campos correspondientes cada una de las direcciones postales españolas escritas en los
frontales de la compañía, facilitando así la búsqueda posterior de la dirección en la base
de datos de esta y por ende mejorando la eficacia general de la herramienta.
El documento contiene todas las partes necesarias para la creación de un parseador de
direcciones, desde una contextualización e investigación teórica de parseadores previamente
creados hasta la presentación y evaluación de diferentes modelos. Además, incluye
un apéndice que presenta la planificación del proyecto y estima los costes en los cuales
incurre una empresa que precise un proyecto similar.Doble Doble Grado en Ingeniería Informática y Administración de Empresa
What's in the laundromat? Mapping and characterising offshore owned domestic property in London
The UK, particularly London, is a global hub for money laundering, a
significant portion of which uses domestic property. However, understanding the
distribution and characteristics of offshore domestic property in the UK is
challenging due to data availability. This paper attempts to remedy that
situation by enhancing a publicly available dataset of UK property owned by
offshore companies. We create a data processing pipeline which draws on several
datasets and machine learning techniques to create a parsed set of addresses
classified into six use classes. The enhanced dataset contains 138,000
properties 44,000 more than the original dataset. The majority are domestic
(95k), with a disproportionate amount of those in London (42k). The average
offshore domestic property in London is worth 1.33 million GBP collectively
this amounts to approximately 56 Billion GBP. We perform an in-depth analysis
of the offshore domestic property in London, comparing the price, distribution
and entropy/concentration with Airbnb property, low-use/empty property and
conventional domestic property. We estimate that the total amount of offshore,
low-use and airbnb property in London is between 144,000 and 164,000 and that
they are collectively worth between 145-174 billion GBP. Furthermore, offshore
domestic property is more expensive and has higher entropy/concentration than
all other property types. In addition, we identify two different types of
offshore property, nested and individual, which have different price and
distribution characteristics. Finally, we release the enhanced offshore
property dataset, the complete low-use London dataset and the pipeline for
creating the enhanced dataset to reduce the barriers to studying this topic.Comment: 27 pages, 7 figures, 7 table
Automatic Identification of Addresses: A Systematic Literature Review
Cruz, P., Vanneschi, L., Painho, M., & Rita, P. (2022). Automatic Identification of Addresses: A Systematic Literature Review. ISPRS International Journal of Geo-Information, 11(1), 1-27. https://doi.org/10.3390/ijgi11010011 -----------------------------------------------------------------------The work by Leonardo Vanneschi, Marco Painho and Paulo Rita was supported by Fundação para a Ciência e a Tecnologia (FCT) within the Project: UIDB/04152/2020—Centro de Investigação em Gestão de Informação (MagIC). The work by Prof. Leonardo Vanneschi was also partially supported by FCT, Portugal, through funding of project AICE (DSAIPA/DS/0113/2019).Address matching continues to play a central role at various levels, through geocoding and data integration from different sources, with a view to promote activities such as urban planning, location-based services, and the construction of databases like those used in census operations. However, the task of address matching continues to face several challenges, such as non-standard or incomplete address records or addresses written in more complex languages. In order to better understand how current limitations can be overcome, this paper conducted a systematic literature review focused on automated approaches to address matching and their evolution across time. The Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines were followed, resulting in a final set of 41 papers published between 2002 and 2021, the great majority of which are after 2017, with Chinese authors leading the way. The main findings revealed a consistent move from more traditional approaches to deep learning methods based on semantics, encoder-decoder architectures, and attention mechanisms, as well as the very recent adoption of hybrid approaches making an increased use of spatial constraints and entities. The adoption of evolutionary-based approaches and privacy preserving methods stand as some of the research gaps to address in future studies.publishersversionpublishe