    A clustering approach to extract data from HTML tables

    HTML tables have become pervasive on the Web. Extracting their data automatically is difficult because finding the relationships between their cells is not trivial due to the many different layouts, encodings, and formats available. In this article, we introduce Melva, which is an unsupervised domain-agnostic proposal to extract data from HTML tables without requiring any external knowledge bases. It relies on a clustering approach that helps make label cells apart from value cells and establish their relationships. We compared Melva to four competitors on more than 3 000 HTML tables from the Wikipedia and the Dresden Web Table Corpus. The conclusion is that our proposal is 21.70% better than the best unsupervised competitor and equals the best supervised competitor regarding effectiveness, but it is 99.14% better regarding efficiencyMinisterio de Ciencia e Innovación PID2020-112540RB-C44Ministerio de Economía y Competitividad TIN2016-75394-RJunta de Andalucía P18-RT-106

    TOMATE: A heuristic-based approach to extract data from HTML tables

    Extracting data from user-friendly HTML tables is difficult because of their different lay outs, formats, and encoding problems. In this article, we present a new proposal that first applies several pre-processing heuristics to clean the tables, then performs functional anal ysis, and finally applies some post-processing heuristics to produce the output. Our most important contribution is regarding functional analysis, which we address by projecting the cells onto a high-dimensional feature space in which a standard clustering technique is used to make the meta-data cells apart from the data cells. We experimented with two large repositories of real-world HTML tables and our results confirm that our proposal can extract data from them with an F1 score of 89:50% in just 0:09 CPU seconds per table. We confronted our proposal with several competitors and the statistical analysis confirmed its superiority in terms of effectiveness, while it keeps very competitive in terms of efficiency.Ministerio de Economía y Competitividad TIN2013-40848-RMinisterio de Economía y Competitividad TIN2016-75394-RJunta de Andalucía P18-RT-1060Ministerio de Ciencia e Innovación PID2020-112540RB-C4

    Table-to-Text: Generating Descriptive Text for Scientific Tables from Randomized Controlled Trials

    Unprecedented amounts of data have been generated in the biomedical domain, and the bottleneck for biomedical research has shifted from data generation to data management, interpretation, and communication. Therefore, it is highly desirable to develop systems to assist in text generation from biomedical data, which will greatly improve the dissemination of scientific findings. However, very few studies have investigated issues of data-to-text generation in the biomedical domain. Here I present a systematic study for generating descriptive text from tables in randomized clinical trials (RCT) articles, which includes: (1) an information model for representing RCT tables; (2) annotated corpora containing pairs of RCT table and descriptive text, and labeled structural and semantic information of RCT tables; (3) methods for recognizing structural and semantic information of RCT tables; (4) methods for generating text from RCT tables, evaluated by a user study on three aspects: relevance, grammatical quality, and matching. The proposed hybrid text generation method achieved a low bilingual evaluation understudy (BLEU) score of 5.69; but human review achieved scores of 9.3, 9.9 and 9.3 for relevance, grammatical quality and matching, respectively, which are comparable to review of original human-written text. To the best of our knowledge, this is the first study to generate text from scientific tables in the biomedical domain. The proposed information model, labeled corpora and developed methods for recognizing tables and generating descriptive text could also facilitate other biomedical and informatics research and applications

    Автоматизована система обробки структурованих даних

    Магістерська дисертація містить 111 сторінок, 18 рисунків, 36 таблиць, 9 додатків, 40 джерел. Об’єкт дослідження: структуровані дані. Предмет дослідження: автоматизація обробки структурованих даних. Актуальність теми: щодня для прийняття рішень та аналізу минулих подій та прогнозування використовується обробка даних. Низька гнучкість інформаційних систем відносно постійних змін у світі викликає потребу у вторинній обробці даних, яка часто проводиться вручну. Дане дослідження спрямоване на створення програмного рішення, яке забезпечить інтуїтивний та простий у використанні функціонал з автоматизованої обробки структурованих даних невеликих масштабів, використання якого значно скоротить час обробки даних людиною та, відповідно, час отримання з них дорогоцінної інформації. Мета дослідження: Метою дослідження є аналіз методів автоматизації обробки структуровних даних невеликого розміру та реалізація автоматизованої системи обробки структурованих даних. Відповідно до мети дослідження були поставлені його задачі: • проаналізувати предметну область та існуючі рішення проблеми; • сформувати вимоги до системи; • спроєктувати, розробити та протестувати автоматизовану систему обробки структурованих даних у відповідності до поставлених вимог; • провести маркетинговий аналіз стартап-проєкту на основі розробленої системи.The Master's thesis contains: 111 pages, 18 figures, 36 tables, 9 appendices, 40 sources. Object of research: structured data. Subject of research: automation of structured data processing. Relevance: data processing is used every day to make decisions, perform analysis of past event and prognosis of future ones. The often low flexibility of information systems causes the need for secondary data processing, often performed by hand. This thesis aims at creating a software solution providing a simple and intuitive functionality for automated small structured data processing. It’s usage will significantly decrease the time needed for a person to process moderate amounts of data and, as a result, also decrease the time needed to gain information from said data. The aim of the study: analysis of methods for structured small data processing and creation of an implementation of an automated structured data processing system. In order to achieve the aim of the study, following tasks have been set: • analyse the subject area and existing solutions; • formulate the requirements the system has to meet; • design, implement and test the automated structured data processing system according to set requirements; • conduct the marketing analysis for a startup project based on the implemented system