Search CORE

41 research outputs found

Extracting N-ary Facts from Wikipedia Table Clusters

Author: Cafarella Michael J
Cafarella Michael J
Hadley Wickham
Lehmberg Oliver
Pennington Jeffrey
Rosenberg Andrew
Wang J
Zhu Erkang
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 19/10/2020
Field of study

Tables in Wikipedia articles contain a wealth of knowledge that would be useful for many applications if it were structured in a more coherent, queryable form. An important problem is that many of such tables contain the same type of knowledge, but have different layouts and/or schemata. Moreover, some tables refer to entities that we can link to Knowledge Bases (KBs), while others do not. Finally, some tables express entity-attribute relations, while others contain more complex n-ary relations. We propose a novel knowledge extraction technique that tackles these problems. Our method first transforms and clusters similar tables into fewer unified ones to overcome the problem of table diversity. Then, the unified tables are linked to the KB so that knowledge about popular entities propagates to the unpopular ones. Finally, our method applies a technique that relies on functional dependencies to judiciously interpret the table and extract n-ary relations. Our experiments over 1.5M Wikipedia tables show that our clustering can group many semantically similar tables. This leads to the extraction of many novel n-ary relations

Crossref

VU Research Portal

CWI's Institutional Repository

A Search Engine for Natural Language Applications ABSTRACT

Author: Michael J. Cafarella
Oren Etzioni
Publication venue
Publication date
Field of study

Many modern natural language-processing applications utilize search engines to locate large numbers of Web documents or to compute statistics over the Web corpus. Yet Web search engines are designed and optimized for simple human queries—they are not well suited to support such applications. As a result, these applications are forced to issue millions of successive queries resulting in unnecessary search engine load and in slow applications with limited scalability. In response, this paper introduces the Bindings Engine (be), which supports queries containing typed variables and string-processing functions. For example, in response to the query “powerful 〈noun〉 ” be will return all the nouns in its index that immediately follow the word “powerful”, sorted by frequency. In response to the query “Cities such as ProperNoun(Head(〈NounPhrase〉))”, be will return a list of proper nouns likely to be city names. be’s novel neighborhood index enables it to do so with O(k) random disk seeks and O(k) serial disk reads, where k is the number of non-variable terms in its query. As a result, be can yield several orders of magnitude speedup for largescale language-processing applications. The main cost is a modest increase in space to store the index. We report on experiments validating these claims, and analyze how be’s space-time tradeoff scales with the size of its index and the number of variable types. Finally, we describe how a bebased application extracts thousands of facts from the Web at interactive speeds in response to simple user queries

CiteSeerX