Search CORE

5 research outputs found

Détection de doublons parmi des informations non structurées provenant de sources de données différentes

Author: Beauchemin David
Publication venue: Bibliotheque de l' Universite Laval
Publication date: 01/01/2020
Field of study

Ce mémoire rend compte de l’exploration de deux approches de détection des doublons entre les descriptions d’entreprises d’une base de données interne et celles d’une source externe non structurée en assurance commerciale. Puisqu’il est coûteux et fastidieux pour un assureur de recueillir les informations nécessaires au calcul d’une prime d’assurance, notre motivation est de les aider à minimiser la quantité de ressources nécessaires à leur acquisition en leur permettant d’utiliser des sources de données externes. Dans ce mémoire, nous avons d’abord observé que l’utilisation d’algorithmes de similarité permet de détecter la majorité des doublons entre les sources de données à partir du nom. Nos expérimentations indiquent que lorsqu’on utilise le nom comme source de comparaison entre les entités, une très grande majorité de ces doublons peut être identifiée. Des expérimentations similaires, mais avec l’adresse, nous ont permis d’observer qu’il était aussi possible d’identifier les doublons d’entreprises par cet attribut, mais dans une moins grande proportion. Par la suite, nous avons entraîné des modèles d’apprentissage automatique afin de coupler les entreprises en double par le nom et l’adresse conjointement. C’est avec ces modèles que nous avons observé les meilleurs résultats. Dans une tentative finale d’améliorer davantage nos résultats, nous avons assoupli notre hypothèse initiale, qui impliquait d’utiliser l’entité la plus probable d’être le doublon d’une entreprise, pour utiliser les N entités les plus probables, ce qui a permis de maximiser le rappel à 91,07 %.This thesis reports the exploration of two approaches to detecting duplicates between the companies descriptions in an internal database and those in an unstructured external source in commercial insurance. Since it is costly and tedious for an insurer to collect the information required to calculate an insurance premium, our motivation is to help them minimize the amount of resources necessary by extracting that information directly from external databases. In this thesis, we first observed that the use of similarity algorithms allows us to detect most of the duplicates between databases using the name. Our experiments indicate that when the name is used as a source of comparison between the entities, a vast majority of these duplicates can be identified. Similar experiments, but using the address this time, allowed us to observe that it was also possible to identify duplicate companies by this feature, but to a lesser extent. Subsequently, we trained machine learning models to match duplicate companies using the name and the address at the same time. It is with these models that we observed the best results. In a final attempt to further improve our results, we used the N most likely entities to be a duplicate of a company, instead of only the first one, thus maximizing the recall to 91.07%

Postal address parsing using deep neural networks

Author: Pita González-Campos Javier
Publication venue
Publication date: 01/09/2021
Field of study

En la actualidad empresas ofertantes de productos y servicios precisan conocer y ubicar el domicilio del cliente para que la compra se haga efectiva. Podemos ver esta casuística en el mercado de los marketplaces, la comida a domicilio o, el caso de este trabajo, los servicios básicos de las compañías de servicios de telecomunicaciones como son el teléfono fijo y la conexión a internet. Las empresas en el sector telco precisan saber la localización de sus clientes con el fin de conocer si existe o no y quien posee la titularidad de la fibra óptica o el ADSL del domicilio del cliente con el fin de conocer la rentabilidad de la contratación de ese nuevo cliente. Conocer su titularidad pasa primero por conocer su localización y esto supone traducir la dirección postal del cliente en un identificador de domicilios creado en España denominado GESCAL. Muchas marcas dependen de una API de Google para el proceso, pero el Grupo MásMóvil ha decidido crear una herramienta propia que le permita finalizar su dependencia con el gigante tecnológico con el único propósito de reducir costes de adquisición de un nuevo cliente. Este trabajo platea una solución a un problema encontrado en la herramienta diseñada por el grupo que impide a esta obtener una eficacia similar a la facilitada por la API de Google. El objetivo de esta tesis es crear un parseador de direcciones que separe en sus campos correspondientes cada una de las direcciones postales españolas escritas en los frontales de la compañía, facilitando así la búsqueda posterior de la dirección en la base de datos de esta y por ende mejorando la eficacia general de la herramienta. El documento contiene todas las partes necesarias para la creación de un parseador de direcciones, desde una contextualización e investigación teórica de parseadores previamente creados hasta la presentación y evaluación de diferentes modelos. Además, incluye un apéndice que presenta la planificación del proyecto y estima los costes en los cuales incurre una empresa que precise un proyecto similar.Doble Doble Grado en Ingeniería Informática y Administración de Empresa

Universidad Carlos III de Madrid e-Archivo

What's in the laundromat? Mapping and characterising offshore owned domestic property in London

Author: Bourne Jonathan
Ingianni Andrea
McKenzie Rex
Publication venue
Publication date: 22/07/2022
Field of study

The UK, particularly London, is a global hub for money laundering, a significant portion of which uses domestic property. However, understanding the distribution and characteristics of offshore domestic property in the UK is challenging due to data availability. This paper attempts to remedy that situation by enhancing a publicly available dataset of UK property owned by offshore companies. We create a data processing pipeline which draws on several datasets and machine learning techniques to create a parsed set of addresses classified into six use classes. The enhanced dataset contains 138,000 properties 44,000 more than the original dataset. The majority are domestic (95k), with a disproportionate amount of those in London (42k). The average offshore domestic property in London is worth 1.33 million GBP collectively this amounts to approximately 56 Billion GBP. We perform an in-depth analysis of the offshore domestic property in London, comparing the price, distribution and entropy/concentration with Airbnb property, low-use/empty property and conventional domestic property. We estimate that the total amount of offshore, low-use and airbnb property in London is between 144,000 and 164,000 and that they are collectively worth between 145-174 billion GBP. Furthermore, offshore domestic property is more expensive and has higher entropy/concentration than all other property types. In addition, we identify two different types of offshore property, nested and individual, which have different price and distribution characteristics. Finally, we release the enhanced offshore property dataset, the complete low-use London dataset and the pipeline for creating the enhanced dataset to reduce the barriers to studying this topic.Comment: 27 pages, 7 figures, 7 table

arXiv.org e-Print Archive

Automatic Identification of Addresses: A Systematic Literature Review

Author: Cruz Paula
Painho Marco
Rita Paulo
Vanneschi Leonardo
Publication venue: 'MDPI AG'
Publication date: 01/12/2021
Field of study

Cruz, P., Vanneschi, L., Painho, M., & Rita, P. (2022). Automatic Identification of Addresses: A Systematic Literature Review. ISPRS International Journal of Geo-Information, 11(1), 1-27. https://doi.org/10.3390/ijgi11010011 -----------------------------------------------------------------------The work by Leonardo Vanneschi, Marco Painho and Paulo Rita was supported by Fundação para a Ciência e a Tecnologia (FCT) within the Project: UIDB/04152/2020—Centro de Investigação em Gestão de Informação (MagIC). The work by Prof. Leonardo Vanneschi was also partially supported by FCT, Portugal, through funding of project AICE (DSAIPA/DS/0113/2019).Address matching continues to play a central role at various levels, through geocoding and data integration from different sources, with a view to promote activities such as urban planning, location-based services, and the construction of databases like those used in census operations. However, the task of address matching continues to face several challenges, such as non-standard or incomplete address records or addresses written in more complex languages. In order to better understand how current limitations can be overcome, this paper conducted a systematic literature review focused on automated approaches to address matching and their evolution across time. The Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines were followed, resulting in a final set of 41 papers published between 2002 and 2021, the great majority of which are after 2017, with Chinese authors leading the way. The main findings revealed a consistent move from more traditional approaches to deep learning methods based on semantics, encoder-decoder architectures, and attention mechanisms, as well as the very recent adoption of hybrid approaches making an increased use of spatial constraints and entities. The adoption of evolutionary-based approaches and privacy preserving methods stand as some of the research gaps to address in future studies.publishersversionpublishe

Directory of Open Access Journals