Search CORE

110 research outputs found

Web Content Extraction - a Meta-Analysis of its Past and Thoughts on its Future

Author: Crescenzi Valter
Gottron Thomas
Merialdo Paolo
Palacios Rodrigo
Weninger Tim
Publication venue
Publication date: 01/01/2015
Field of study

In this paper, we present a meta-analysis of several Web content extraction algorithms, and make recommendations for the future of content extraction on the Web. First, we find that nearly all Web content extractors do not consider a very large, and growing, portion of modern Web pages. Second, it is well understood that wrapper induction extractors tend to break as the Web changes; heuristic/feature engineering extractors were thought to be immune to a Web site's evolution, but we find that this is not the case: heuristic content extractor performance also tends to degrade over time due to the evolution of Web site forms and practices. We conclude with recommendations for future work that address these and other findings.Comment: Accepted for publication in SIGKDD Exploration

arXiv.org e-Print Archive

Archivio della Ricerca - Università di Roma 3

Towards Knowledge Discovery from the Vatican Secret Archives. In Codice Ratio -- Episode 1: Machine Transcription of the Manuscripts

Author: Firmani Donatella
Maiorino Marco
Merialdo Paolo
Nieddu Elena
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2018
Field of study

In Codice Ratio is a research project to study tools and techniques for analyzing the contents of historical documents conserved in the Vatican Secret Archives (VSA). In this paper, we present our efforts to develop a system to support the transcription of medieval manuscripts. The goal is to provide paleographers with a tool to reduce their efforts in transcribing large volumes, as those stored in the VSA, producing good transcriptions for significant portions of the manuscripts. We propose an original approach based on character segmentation. Our solution is able to deal with the dirty segmentation that inevitably occurs in handwritten documents. We use a convolutional neural network to recognize characters and language models to compose word transcriptions. Our approach requires minimal training efforts, making the transcription process more scalable as the production of training sets requires a few pages and can be easily crowdsourced. We have conducted experiments on manuscripts from the Vatican Registers, an unreleased corpus containing the correspondence of the popes. With training data produced by 120 high school students, our system has been able to produce good transcriptions that can be used by paleographers as a solid basis to speedup the transcription process at a large scale.Comment: Donatella Firmani, Marco Maiorino, Paolo Merialdo, and Elena Nieddu. 2018. Towards Knowledge Discovery from the Vatican Secret Archives. In Codice Ratio - Episode 1: Machine Transcription of the Manuscripts. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '18). ACM, New York, NY, USA, 263-27

arXiv.org e-Print Archive

Crossref

Archivio della Ricerca - Università di Roma 3

Archivio della ricerca- Università di Roma La Sapienza

CERTEM: Explaining and Debugging Black-box Entity Resolution Systems with CERTA

Author: Divesh Srivastava
Donatella Firmani
Nick Koudas
Paolo Merialdo
Tommaso Teofili
Publication venue: place:New York
Publication date: 01/01/2022
Field of study

Entity resolution (ER) aims at identifying record pairs that refer to the same real-world entity. Recent works have focused on deep learning (DL) techniques, to solve this problem. While such works have brought tremendous enhancements in terms of effectiveness in solving the ER problem, understanding their matching predictions is still a challenge, because of the intrinsic opaqueness of DL based solutions. Interpreting and trusting the predictions made by ER systems is crucial for humans in order to employ such methods in decision making pipelines. We demonstrate certem an explanation system for ER based on certa, a recently introduced explainability framework for ER, that is able to provide both saliency explana- tions, which associate each attribute with a saliency score, and counterfactual explanations, which provide examples of values that can flip a prediction. In this demonstration we will showcase how certem can be effectively employed to better understand and debug the behavior of state-of-the-art DL based ER systems on data from publicly available ER benchmarks

Archivio della ricerca- Università di Roma La Sapienza

Table2Vec: Neural Word and Entity Embeddings for Table Population and Retrieval

Author: Wolfgang Lehner Robert Wrembel Julian Eberius
Gentile Anna Lisa
Milne David
Paolo Merialdo Denilson Barbosa
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 31/05/2019
Field of study

Tables contain valuable knowledge in a structured form. We employ neural language modeling approaches to embed tabular data into vector spaces. Specifically, we consider different table elements, such caption, column headings, and cells, for training word and entity embeddings. These embeddings are then utilized in three particular table-related tasks, row population, column population, and table retrieval, by incorporating them into existing retrieval models as additional semantic similarity signals. Evaluation results show that table embeddings can significantly improve upon the performance of state-of-the-art baselines.Comment: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '19), 201

arXiv.org e-Print Archive

Crossref

Experiences and Lessons Learned from the SIGMOD Entity Resolution Programming Contests

Author: Bergamaschi Sonia
Chu Xu
De Angelis Andrea
Firmani Donatella
Li Peng
Mazzei Maurizio
Merialdo Paolo
Piai Federico
Simonini Giovanni
Wu Renzhi
Zecchini Luca
Publication venue
Publication date: 01/01/2023
Field of study

We report our experience in running three editions (2020, 2021, 2022) of the SIGMOD programming contest, a well-known event for students to engage in solving exciting data management problems. During this period we had the opportunity of introducing participants to the entity resolution task, which is of paramount importance in the data integration community. We aim at sharing the executive decisions, made by the people co-authoring this report, and the lessons learned

Archivio istituzionale della ricerca - Università di Modena e Reggio Emilia

Design and Maintenance of Web-Based Information Systems

Author: Paolo Merialdo
Publication venue
Publication date: 01/01/1998
Field of study

Dottorato di ricerca in informatica medica. 10 cicloConsiglio Nazionale delle Ricerche - Biblioteca Centrale - P.le Aldo Moro, 7 Rome; Biblioteca Nazionale Centrale - P.za Cavalleggeri, 1, Florence / CNR - Consiglio Nazionale delle RichercheSIGLEITItal

CiteSeerX

OpenGrey Repository

The Startup Ecosystem: a Quick Tour

Author: Paolo Merialdo
Publication venue: Curran Associates inc.
Publication date: 01/01/2015
Field of study

Archivio della Ricerca - Università di Roma 3

Design and development of data-intensive web sites: The araneus approach

Author: GIANSALVATORE MECCA
PAOLO ATZENI
PAOLO MERIALDO
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2003
Field of study

Archivio della Ricerca - Università di Roma 3

Efficient Techniques for Effective Wrapper Induction

Author: CRESCENZI VALTER
MERIALDO PAOLO
Publication venue
Publication date: 01/01/2006
Field of study

Archivio della Ricerca - Università di Roma 3