23 research outputs found

    Extracting Web Information using Representation Patterns

    Get PDF
    Feeding decision support systems with Web information typically requires sifting through an unwieldy amount of information that is available in human-friendly formats only. Our focus is on a scalable proposal to extract information from semi-structured documents in a structured format, with an emphasis on it being scalable and open. By semi-structured we mean that it must focus on informa tion that is rendered using regular formats, not free text; by scal able, we mean that the system must require a minimum amount of human intervention and it must not be targeted to extracting in formation from a particular domain or web site; by open, we mean that it must extract as much useful information as possible and not be subject to any pre-defined data model. In the literature, there is only one open but not scalable proposal, since it requires human supervision on a per-domain basis. In this paper, we present a new proposal that relies on a number of heuristics to identify patterns that are typically used to represent the information in a web docu ment. Our experimental results confirm that our proposal is very competitive in terms of effectiveness and efficiency.Ministerio de Economía y Competitividad TIN2016-75394-RMinisterio de Economía y Competitividad TIN2013-40848-

    On exploring data lakes by finding compact, isolated clusters

    Get PDF
    Data engineers are very interested in data lake technologies due to the incredible abun dance of datasets. They typically use clustering to understand the structure of the datasets before applying other methods to infer knowledge from them. This article presents the first proposal that explores how to use a meta-heuristic to address the problem of multi-way single-subspace automatic clustering, which is very appropriate in the context of data lakes. It was confronted with five strong competitors that combine the state-of-the-art attribute selection proposal with three classical single-way clustering proposals, a recent quantum-inspired one, and a recent deep-learning one. The evaluation focused on explor ing their ability to find compact and isolated clusterings as well as the extent to which such clusterings can be considered good classifications. The statistical analyses conducted on the experimental results prove that it ranks the first regarding effectiveness using six stan dard coefficients and it is very efficient in terms of CPU time, not to mention that it did not result in any degraded clusterings or timeouts. Summing up: this proposal contributes to the array of techniques that data engineers can use to explore their data lakesMinisterio de Economía y Competitividad TIN2016-75394-RMinisterio de Ciencia e Innovación PID2020-112540RB-C44Junta de Andalucía P18-RT-1060Junta de Andalucía US-138137

    On Extracting Data from Tables that are Encoded using HTML

    Get PDF
    Tables are a common means to display data in human-friendly formats. Many authors have worked on proposals to extract those data back since this has many interesting applications. In this article, we summarise and compare many of the proposals to extract data from tables that are encoded using HTML and have been published between 2000 and 2018. We first present a vocabulary that homogenises the terminology used in this field; next, we use it to summarise the proposals; finally, we compare them side by side. Our analysis highlights several challenges to which no proposal provides a conclusive solution and a few more that have not been addressed sufficiently; simply put, no proposal provides a complete solution to the problem, which seems to suggest that this research field shall keep active in the near future. We have also realised that there is no consensus regarding the datasets and the methods used to evaluate the proposals, which hampers comparing the experimental results.Ministerio de Economía y Competitividad TIN2013-40848-RMinisterio de Economía y Competitividad TIN2016-75394-

    A hybrid quantum approach to leveraging data from HTML tables

    Get PDF
    The Web provides many data that are encoded using HTML tables. This facilitates rendering them, but obfuscates their structure and makes it difficult for automated business processes to leverage them. This has motivated many authors to work on proposals to extract them as automatically as possible. In this article, we present a new unsupervised proposal that uses a hybrid approach in which a standard computer is used to perform pre and post-processing tasks and a quantum computer is used to perform the core task: guessing whether the cells have labels or values. The problem is addressed using a clustering approach that is known to be NP using standard computers, but our proposal can solve it in polynomial time, which implies a significant performance improvement. It is novel in that it relies on an entropy-preservation metaphor that has proven to work very well on two large collections of real-world tables from the Wikipedia and the Dresden Web Table Corpus. Our experiments prove that our proposal can beat the state-of-the-art proposal in terms of both effectiveness and efficiency; the key difference is that our proposal is totally unsupervised, whereas the state-of-the-art proposal is supervised.Ministerio de Economía y Competitividad TIN2016-75394-RMinisterio de Ciencia e Innovación PID2020-112540RB-C44Junta de Andalucía P18-RT-106

    A clustering approach to extract data from HTML tables

    Get PDF
    HTML tables have become pervasive on the Web. Extracting their data automatically is difficult because finding the relationships between their cells is not trivial due to the many different layouts, encodings, and formats available. In this article, we introduce Melva, which is an unsupervised domain-agnostic proposal to extract data from HTML tables without requiring any external knowledge bases. It relies on a clustering approach that helps make label cells apart from value cells and establish their relationships. We compared Melva to four competitors on more than 3 000 HTML tables from the Wikipedia and the Dresden Web Table Corpus. The conclusion is that our proposal is 21.70% better than the best unsupervised competitor and equals the best supervised competitor regarding effectiveness, but it is 99.14% better regarding efficiencyMinisterio de Ciencia e Innovación PID2020-112540RB-C44Ministerio de Economía y Competitividad TIN2016-75394-RJunta de Andalucía P18-RT-106

    TOMATE: A heuristic-based approach to extract data from HTML tables

    Get PDF
    Extracting data from user-friendly HTML tables is difficult because of their different lay outs, formats, and encoding problems. In this article, we present a new proposal that first applies several pre-processing heuristics to clean the tables, then performs functional anal ysis, and finally applies some post-processing heuristics to produce the output. Our most important contribution is regarding functional analysis, which we address by projecting the cells onto a high-dimensional feature space in which a standard clustering technique is used to make the meta-data cells apart from the data cells. We experimented with two large repositories of real-world HTML tables and our results confirm that our proposal can extract data from them with an F1 score of 89:50% in just 0:09 CPU seconds per table. We confronted our proposal with several competitors and the statistical analysis confirmed its superiority in terms of effectiveness, while it keeps very competitive in terms of efficiency.Ministerio de Economía y Competitividad TIN2013-40848-RMinisterio de Economía y Competitividad TIN2016-75394-RJunta de Andalucía P18-RT-1060Ministerio de Ciencia e Innovación PID2020-112540RB-C4

    On the synthesis of metadata tags for HTML files

    Get PDF
    RDFa, JSON-LD, Microdata, and Microformats allow to endow the data in HTML files with metadata tags that help software agents understand them. Unluckily, there are many HTML files that do not have any metadata tags, which has motivated many authors to work on proposals to synthesize them. But they have some problems: the authors either provide an overall picture of their designs without too many details on the techniques behind the scenes or focus on the techniques but do not describe the design of the software systems that support them; many of them cannot deal with data that are encoded using semistructured formats like forms, listings, or tables; and the few proposals that can work on tables can deal with horizontal listings only. In this article, we describe the design of a system that overcomes the previous limitations using a novel embedding approach that has proven to outperform four state-of-the-art techniques on a repository with randomly selected HTML files from 40 differ ent sites. According to our experimental analysis, our proposal can achieve an F1 score that outperforms the others by 10.14%; this difference was confirmed to be statistically significant at the standard confidence level.Junta de Andalucía P18-RT-1060Ministerio de Economía y Competitividad TIN2013-40848-RMinisterio de Economía y Competitividad TIN2016-75394-

    Tocilizumab in refractory Caucasian Takayasu's arteritis: a multicenter study of 54 patients and literature review

    Get PDF
    Objective: To assess the efficacy and safety of tocilizumab (TCZ) in Caucasian patients with refractory Takayasu's arteritis (TAK) in clinical practice. Methods: A multicenter study of Caucasian patients with refractory TAK who received TCZ. The outcome variables were remission, glucocorticoid-sparing effect, improvement in imaging techniques, and adverse events. A comparative study between patients who received TCZ as monotherapy (TCZMONO) and combined with conventional disease modifying anti-rheumatic drugs (cDMARDs) (TCZCOMBO) was performed. Results: The study comprised 54 patients (46 women/8 men) with a median [interquartile range (IQR)] age of 42.0 (32.5-50.5) years. TCZ was started after a median (IQR) of 12.0 (3.0-31.5) months since TAK diagnosis. Remission was achieved in 12/54 (22.2%), 19/49 (38.8%), 23/44 (52.3%), and 27/36 (75%) patients at 1, 3, 6, and 12 months, respectively. The prednisone dose was reduced from 30.0 mg/day (12.5-50.0) to 5.0 (0.0-5.6) mg/day at 12 months. An improvement in imaging findings was reported in 28 (73.7%) patients after a median (IQR) of 9.0 (6.0-14.0) months. Twenty-three (42.6%) patients were on TCZMONO and 31 (57.4%) on TCZCOMBO: MTX (n = 28), cyclosporine A (n = 2), azathioprine (n = 1). Patients on TCZCOMBO were younger [38.0 (27.0-46.0) versus 45.0 (38.0-57.0)] years; difference (diff) [95% confidence interval (CI) = -7.0 (-17.9, -0.56] with a trend to longer TAK duration [21.0 (6.0-38.0) versus 6.0 (1.0-23.0)] months; diff 95% CI = 15 (-8.9, 35.5), and higher c-reactive protein [2.4 (0.7-5.6) versus 1.3 (0.3-3.3)] mg/dl; diff 95% CI = 1.1 (-0.26, 2.99). Despite these differences, similar outcomes were observed in both groups (log rank p = 0.862). Relevant adverse events were reported in six (11.1%) patients, but only three developed severe events that required TCZ withdrawal. Conclusion: TCZ in monotherapy, or combined with cDMARDs, is effective and safe in patients with refractory TAK of Caucasian origin.Funding: This work was partially supported by RETICS Programs, RD08/0075 (RIER), RD12/0009/0013 and RD16/0012 from “Instituto de Salud Carlos III” (ISCIII) (Spain)

    Clonal chromosomal mosaicism and loss of chromosome Y in elderly men increase vulnerability for SARS-CoV-2

    Full text link
    The pandemic caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2, COVID-19) had an estimated overall case fatality ratio of 1.38% (pre-vaccination), being 53% higher in males and increasing exponentially with age. Among 9578 individuals diagnosed with COVID-19 in the SCOURGE study, we found 133 cases (1.42%) with detectable clonal mosaicism for chromosome alterations (mCA) and 226 males (5.08%) with acquired loss of chromosome Y (LOY). Individuals with clonal mosaic events (mCA and/or LOY) showed a 54% increase in the risk of COVID-19 lethality. LOY is associated with transcriptomic biomarkers of immune dysfunction, pro-coagulation activity and cardiovascular risk. Interferon-induced genes involved in the initial immune response to SARS-CoV-2 are also down-regulated in LOY. Thus, mCA and LOY underlie at least part of the sex-biased severity and mortality of COVID-19 in aging patients. Given its potential therapeutic and prognostic relevance, evaluation of clonal mosaicism should be implemented as biomarker of COVID-19 severity in elderly people. Among 9578 individuals diagnosed with COVID-19 in the SCOURGE study, individuals with clonal mosaic events (clonal mosaicism for chromosome alterations and/or loss of chromosome Y) showed an increased risk of COVID-19 lethality
    corecore