4 research outputs found

    A hybrid quantum approach to leveraging data from HTML tables

    Get PDF
    The Web provides many data that are encoded using HTML tables. This facilitates rendering them, but obfuscates their structure and makes it difficult for automated business processes to leverage them. This has motivated many authors to work on proposals to extract them as automatically as possible. In this article, we present a new unsupervised proposal that uses a hybrid approach in which a standard computer is used to perform pre and post-processing tasks and a quantum computer is used to perform the core task: guessing whether the cells have labels or values. The problem is addressed using a clustering approach that is known to be NP using standard computers, but our proposal can solve it in polynomial time, which implies a significant performance improvement. It is novel in that it relies on an entropy-preservation metaphor that has proven to work very well on two large collections of real-world tables from the Wikipedia and the Dresden Web Table Corpus. Our experiments prove that our proposal can beat the state-of-the-art proposal in terms of both effectiveness and efficiency; the key difference is that our proposal is totally unsupervised, whereas the state-of-the-art proposal is supervised.Ministerio de Econom铆a y Competitividad TIN2016-75394-RMinisterio de Ciencia e Innovaci贸n PID2020-112540RB-C44Junta de Andaluc铆a P18-RT-106

    A clustering approach to extract data from HTML tables

    Get PDF
    HTML tables have become pervasive on the Web. Extracting their data automatically is difficult because finding the relationships between their cells is not trivial due to the many different layouts, encodings, and formats available. In this article, we introduce Melva, which is an unsupervised domain-agnostic proposal to extract data from HTML tables without requiring any external knowledge bases. It relies on a clustering approach that helps make label cells apart from value cells and establish their relationships. We compared Melva to four competitors on more than 3 000 HTML tables from the Wikipedia and the Dresden Web Table Corpus. The conclusion is that our proposal is 21.70% better than the best unsupervised competitor and equals the best supervised competitor regarding effectiveness, but it is 99.14% better regarding efficiencyMinisterio de Ciencia e Innovaci贸n PID2020-112540RB-C44Ministerio de Econom铆a y Competitividad TIN2016-75394-RJunta de Andaluc铆a P18-RT-106

    Chatbot para la atenci贸n a los clientes en la asesor铆a legal en el Estudio Rodr铆guez Angobaldo

    Get PDF
    El presente trabajo es un estudio realizado para el estudio Rodr铆guez Angobaldo, debido a que la ubicaci贸n de la empresa antes de la aplicaci贸n del asistente virtual ten铆a un problema en la prestaci贸n del servicio, en referencia al asistente virtual o chatbot para atenci贸n al cliente en asesor铆a jur铆dica. Que intentaron comunicarse a trav茅s de Facebook o una p谩gina web que detallaba el desarrollo, provoc贸 la insatisfacci贸n de los clientes porque no ten铆an la cantidad necesaria de asesores para atenderlos. Por esta raz贸n, los aspectos te贸ricos del proceso del servicio de chat empresarial se han discutido anteriormente. Se utiliz贸 la metodolog铆a Common KADS para desarrollar los agentes inteligentes que mejor se adaptaban a las necesidades de este proyecto. Adem谩s, utilizamos la herramienta Dialog Flow para desarrollar un flujo de comunicaci贸n derivado de la base de conocimientos. Este estudio tiene como objetivo apoyarnos en nuestro proceso de atenci贸n al cliente mediante el uso del Asistente Virtual como medio de respuesta a las solicitudes de los clientes

    Enterprise Data Integration: On Extracting Data from HTML Tables

    No full text
    The Web is a universal communication channel that provides a vast amount of valuable data about a plethora of topics. In recent years, there has been a quick rise of data-hungry products and services that have motivated the need for ways to extract web to feed them with as little effort as possible. HTML tables are a source of up-to-date data that is not being extracted and loaded into major knowledge bases in an automated manner. Extracting them is challenging because there are several common layouts in which data are displayed and they present several encoding and formatting problems; furthermore, the available general-purpose data extractors ignore the particularities of HTML table encodings and do not suffice to deal with the intricacies of web tables. In this dissertation, we have studied the problem of extracting data from HTML tables with no supervision. After completing an extensive review of the literature, we realised that none of the available table-specific proposals provided a holistic approach to solve this problem. This motivated us to work on TOMATE, a table extraction proposal that encompasses every table extraction task with an emphasis in the crucial task of identifying cell functions. Our experimental analysis proved that we have advanced the state of the art with several proposals that are intended to help both researchers and practitioners. While working on this dissertation, we have developed a number of marginal contributions, namely: Aquila, a proposal to synthesise meta-data tags for HTML documents; Kizomba, a general extraction proposal that was called; and Romulo, a proposal to cluster data. Furthermore, we have collaborated on the inception of a start-up project called Stargazr where we hope to put much of the knowledge generated in this dissertation into practice.La Web es una v铆a universal de comunicaci贸n que contiene un volumen de datos extraordinario sobre una gran variedad de temas. En los 煤ltimos a帽os se ha producido un r谩pido aumento de los productos y servicios que consumen gran cantidad de datos, lo que ha motivado la necesidad de encontrar formas de extraerlos autom谩ticamente. Las tablas HTML son una fuente de datos actualizados que no se est谩 integrando de forma automatizada a las principales bases de conocimiento. La extracci贸n de tablas resulta compleja ya que existe una gran variedad de estructuras y formas de presentar y codificar los datos. Usar extractores de prop贸sito general no es una soluci贸n al problema, dado que ignoran las particularidades del rico lenguaje que se usa para representar tablas. En esta tesis hemos estudiado el problema de extraer datos de tablas HTML sin supervisi贸n. Al realizar un an谩lisis exhaustivo de la literatura de extracci贸n de tablas, hemos observado que ninguna de las propuestas disponibles resuelve el problema al completo. Esto nos ha motivado a desarrollar TOMATE, una propuesta de extracci贸n de tablas que abarca todas las tareas involucradas, aunque pone el 茅nfasis en la tarea crucial de identificar la funci贸n de las celdas. Nuestro an谩lisis experimental ha demostrado que hemos dado un paso adelante en el estado del arte con varias propuestas que tienen por objeto ayudar a investigadores y profesionales del sector. Durante el desarrollo de esta tesis, hemos producido algunas contribuciones marginales, a saber: Aquila, una propuesta para sintetizar etiquetas de metadatos para ficheros HTML; Kizomba, un extractor general de datos de la Web; y R贸mulo, una propuesta para clusterizar datos. Adem谩s, hemos colaborado internacionalmente en un proyecto start-up denominado Stargazr en el que tenemos como objetivo poner en pr谩ctica gran parte del conocimiento que hemos generado en esta tesis
    corecore