Search CORE

201,496 research outputs found

Column-specific Context Extraction for Web Tables

Author: Braunschweig Katrin
Eberius Julian
Lehner Wolfgang
Thiele Maik
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 14/06/2022
Field of study

Relational Web tables have become an important resource for applications such as factual search and entity augmentation. A major challenge for an automatic identification of relevant tables on the Web is the fact that many of these tables have missing or non-informative column labels. Research has focused largely on recovering the meaning of columns by inferring class labels from the instances using external knowledge bases. The table context, which often contains additional information on the table's content, is frequently considered as an indicator for the general content of a table, but not as a source for column-specific details. In this paper, we propose a novel approach to identify and extract column-specific information from the context of Web tables. In our extraction framework, we consider different techniques to extract directly as well as indirectly related phrases. We perform a number of experiments on Web tables extracted from Wikipedia. The results show that column-specific information extracted using our simple heuristic significantly boost precision and recall for table and column search

Qucosa

HSSS - Hochschulschriftenserver der SLUB

Technische Universität Dresden: Qucosa

Structured digital tables on the Semantic Web: toward a structured digital literature

Author: Gatterbauer W
Good B
Jensen L
Kei‐Hoi Cheung
Lam HY
Mark B Gerstein
Matthias Samwald
Raymond K Auerbach
Publication venue: Nature Publishing Group
Publication date: 24/08/2010
Field of study

In parallel to the growth in bioscience databases, biomedical publications have increased exponentially in the past decade. However, the extraction of high-quality information from the corpus of scientific literature has been hampered by the lack of machine-interpretable content, despite text-mining advances. To address this, we propose creating a structured digital table as part of an overall effort in developing machine-readable, structured digital literature. In particular, we envision transforming publication tables into standardized triples using Semantic Web approaches. We identify three canonical types of tables (conveying information about properties, networks, and concept hierarchies) and show how more complex tables can be built from these basic types. We envision that authors would create tables initially using the structured triples for canonical types and then have them visually rendered for publication, and we present examples for converting representative tables into triples. Finally, we discuss how ‘stub' versions of structured digital tables could be a useful bridge for connecting together the literature with databases, allowing the former to more precisely document the later

Crossref

PubMed Central

Access to Research at National University of Ireland, Galway

Information extraction from Webpages based on DOM distances

Author: B.-Y. Kang
J. Nielsen
J. Zhang
P. Li
P.Y. Lee
R. Baeza-Yates
R. Khare
S. Gupta
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

Retrieving information from Internet is a difficult task as it is demonstrated by the lack of real-time tools able to extract information from webpages. The main cause is that most webpages in Internet are implemented using plain (X)HTML which is a language that lacks structured semantic information. For this reason much of the efforts in this area have been directed to the development of techniques for URLs extraction. This field has produced good results implemented by modern search engines. But, contrarily, extracting information from a single webpage has produced poor results or very limited tools. In this work we define a novel technique for information extraction from single webpages or collections of interconnected webpages. This technique is based on DOM distances to retrieve information. This allows the technique to work with any webpage and, thus, to retrieve information online. Our implementation and experiments demonstrate the usefulness of the technique.Castillo, C.; Valero Llinares, H.; Guadalupe Ramos, J.; Silva Galiana, JF. (2012). Information extraction from Webpages based on DOM distances. En Computational Linguistics and Intelligent Text Processing. Springer Verlag (Germany). 181-193. doi:10.1007/978-3-642-28601-8_16S181193Dalvi, B., Cohen, W.W., Callan, J.: Websets: Extracting sets of entities from the web using unsupervised information extraction. Technical report, Carnegie Mellon School of computer Science (2011)Kushmerick, N., Weld, D.S., Doorenbos, R.: Wrapper induction for information extraction. In: Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence (IJCAI 1997) (1997)Cohen, W.W., Hurst, M., Jensen, L.S.: A flexible learning system for wrapping tables and lists in html documents. In: Proceedings of the international World Wide Web conference (WWW 2002), pp. 232–241 (2002)Lee, P.Y., Hui, S.C., Fong, A.C.M.: Neural networks for web content filtering. IEEE Intelligent Systems 17(5), 48–57 (2002)Anti-Porn Parental Controls Software. Porn Filtering (March 2010), http://www.tueagles.com/anti-porn/Kang, B.-Y., Kim, H.-G.: Web page filtering for domain ontology with the context of concept. IEICE - Trans. Inf. Syst. E90, D859–D862 (2007)Henzinger, M.: The Past, Present and Future of Web Information Retrieval. In: Proceedings of the 23th ACM Symposium on Principles of Database Systems (2004)W3C Consortium. Resource Description Framework (RDF), www.w3.org/RDFW3C Consortium. Web Ontology Language (OWL), www.w3.org/2004/OWLMicroformats.org. The Official Microformats Site (2009), http://microformats.orgKhare, R., Çelik, T.: Microformats: a Pragmatic Path to the Semantic Web. In: Proceedings of the 15h International Conference on World Wide Web, pp. 865–866 (2006)Khare, R.: Microformats: The Next (Small) Thing on the Semantic Web? IEEE Internet Computing 10(1), 68–75 (2006)Gupta, S., et al.: Automating Content Extraction of HTML Documents. World Wide Archive 8(2), 179–224 (2005)Li, P., Liu, M., Lin, Y., Lai, Y.: Accelerating Web Content Filtering by the Early Decision Algorithm. IEICE Transactions on Information and Systems E91-D, 251–257 (2008)W3C Consortium, Document Object Model (DOM), www.w3.org/DOMBaeza-Yates, R., Castillo, C.: Crawling the Infinite Web: Five Levels Are Enough. In: Leonardi, S. (ed.) WAW 2004. LNCS, vol. 3243, pp. 156–167. Springer, Heidelberg (2004)Micarelli, A., Gasparetti, F.: Adaptative Focused Crawling. In: The Adaptative Web, pp. 231–262 (2007)Nielsen, J.: Designing Web Usability: The Practice of Simplicity. New Riders Publishing, Indianapolis (2010) ISBN 1-56205-810-XZhang, J.: Visualization for Information Retrieval. The Information Retrieval Series. Springer, Heidelberg (2007) ISBN 3-54075-1475Hearst, M.A.: TileBars: Visualization of Term Distribution Information. In: Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems, Denver, CO, pp. 59–66 (May 1995)Gottron, T.: Evaluating Content Extraction on HTML Documents. In: Proceedings of the 2nd International Conference on Internet Technologies and Applications, pp. 123–132 (2007)Apache Foundation. The Apache crawler Nutch (2010), http://nutch.apache.or

Crossref

RiuNet

A hybrid filtering approach for question answering

Author: Adafre Sisay Fissaha
van Genabith Josef
Publication venue
Publication date: 01/01/2009
Field of study

We describe a question answering system that took part in the bilingual CLEFQA task (German-English) where German is the source language and English the target language.We used the BableFish online translation system to translate the German questions into English. The system is targeted at Factoid and Denition questions. Our focus in designing the current system is on testing our online methods which are based on information extraction and linguistic ltering methods. Our system does not make use of precompiled tables or Gazetteers but uses Web snippets to rerank candidate answers extracted from the document collections. WordNet is also used as a lexical resource in the system. Our question answering system consists of the following core components: Question Anal- ysis, Passage Retrieval, Sentence Analysis and Answer Selection. These components employ various Natural Language Processing (NLP) and Machine Learning (ML) tools, a set of heuristics and dierent lexical resources. Seamless integration of the various components is one of the major challenges of QA system development. In order to facilitate our develop- ment process, we used the Unstructured Information Management Architecture (UIMA) as our underlying framework

Irish Universities

DCU Online Research Access Service

Towards a Large Corpus of Richly Annotated Web Tables for Knowledge Base Population

Author: Altaf Memon Junaid
Braukmann Philipp
Cazzoli Lorenzo
Cimiano Philipp
Ell Basil
Hakimov Sherzod
Kaupmann Fabian
Mancino Amerigo
Rother Kai
Saini Abhishek
Publication venue
Publication date: 01/01/2017
Field of study

Ell B, Hakimov S, Braukmann P, et al. Towards a Large Corpus of Richly Annotated Web Tables for Knowledge Base Population. Presented at the Fifth international workshop on Linked Data for Information Extraction (LD4IE) at ISWC2017, Vienna.Web Table Understanding in the context of Knowledge Base Population and the Semantic Web is the task of i) linking the content of tables retrieved from the Web to an RDF knowledge base, ii) of building hypotheses about the tables' structures and contents, iii) of extracting novel information from these tables, and iv) of adding this new information to a knowledge base. Knowledge Base Population has gained more and more interest in the last years due to the increased demand in large knowledge graphs which became relevant for Artificial Intelligence applications such as Question Answering and Semantic Search. In this paper we describe a set of basic tasks which are relevant for Web Table Understanding in the mentioned context. These tasks incrementally enrich a table with hypotheses about the table's content. In doing so, in the case of multiple interpretations, selecting one interpretation and thus deciding against other interpretations is avoided as much as possible. By postponing these decision, we enable table understanding approaches to decide by themselves, thus increasing the usability of the annotated table data. We present statistics from analyzing and annotating 1.000.000 tables from the Web Table Corpus 2015 and make this dataset as well as our code available online

Publications at Bielefeld University

Interactive Tuples Extraction from Semi-Structured Data

Author: Gilleron Rémi
Marty Patrick
Tommasi Marc
Torre Fabien
Publication venue: HAL CCSD
Publication date: 18/12/2006
Field of study

International audienceThis paper studies from a machine learning viewpoint the problem of extracting tuples of a target n-ary relation from tree structured data like XML or XHTML documents. Our system can extract, without any post-processing, tuples for all data structures including nested, rotated and cross tables. The wrapper induction algorithm we propose is based on two main ideas. It is incremental: partial tuples are extracted by increasing length. It is based on a representation-enrichment procedure: partial tuples of length i are encoded with the knowledge of extracted tu- ples of length i − 1. The algorithm is then set in a friendly interactive wrapper induction system for Web documents. We evaluate our system on several information extraction tasks over corporate Web sites. It achieves state-of-the-art results on simple data structures and succeeds on complex data structures where previous approaches fail. Experiments also show that our interactive framework significantly reduces the number of user interactions needed to build a wrapper

HAL - Lille 3

Crossref

INRIA a CCSD electronic archive server

Entity Discovery and Annotation in Tables

Author: Quercini Gianluca
Reynaud-Delaître Chantal
Publication venue: HAL CCSD
Publication date: 18/03/2013
Field of study

International audienceThe Web is rich of tables (e.g., HTML tables, speadsheets, Google Fusion tables) that host a considerable wealth of high-quality relational data. Unlike unstructured texts, tables usually favour the automatic extraction of data because of their regular structure and properties. The data extraction is usually complemented by the annotation of the table, which determines its semantics by identifying a type for each column, the relations between columns, if any, and the entities that occur in each cell. In this paper, we focus on the problem of discovering and annotating entities intables. More specifically, we describe an algorithm that identifies the rows of a table that contain information on entities of specific types (e.g., restaurant, museum, theatre) derived from an ontology and determines the cells in which the names of those entities occur. We implemented this algorithm while developing a faceted browser over a repository of RDF data on points of interest of cities that we extracted from Google Fusion Tables. We claim that our algorithm complements the existing approaches, which annotate entities in a table based on a pre-compiled reference catalogue that lists the types of a finite set of entities; as a result, they are unable to discover and annotate entities that do not belong to the reference catalogue. Instead, we train our algorithm to look for information on previously unseen entities on the Web so as to annotate them with the correct type

INRIA a CCSD electronic archive server

Ontology extraction from tables on the Web

Author: Masahiro Tanaka
Toru Ishida
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2006
Field of study

Previous works on information extraction from tables make use of prior knowledge such as a cognition model of tables or lexical knowledge bases for specific domains. However, we often need to interpret table structures in each table differently and to treat lexicons in various domains to more fully utilize the broad range of tables available on the Web. The method proposed in this paper uses relations represented by structures to extract an ontology from a ta-ble. Once the interpretations of table structures are given by humans, the table structures are automatically general-ized to extract relations from the whole table. We define a formal representation of generalized table structure based on the adjacency of cells and iterative structures. As the re-sult of the comparison with a method proposed in a previous work, it was shown that our method is suited to extraction of various relations which are needed for descriptions in RDF/OWL.

CiteSeerX

Crossref