41 research outputs found
Extracting N-ary Facts from Wikipedia Table Clusters
Tables in Wikipedia articles contain a wealth of knowledge that would be useful for many applications if it were structured in a more coherent, queryable form. An important problem is that many of such tables contain the same type of knowledge, but have different layouts and/or schemata. Moreover, some tables refer to entities that we can link to Knowledge Bases (KBs), while others do not. Finally, some tables express entity-attribute relations, while others contain more complex n-ary relations. We propose a novel knowledge extraction technique that tackles these problems. Our method first transforms and clusters similar tables into fewer unified ones to overcome the problem of table diversity. Then, the unified tables are linked to the KB so that knowledge about popular entities propagates to the unpopular ones. Finally, our method applies a technique that relies on functional dependencies to judiciously interpret the table and extract n-ary relations. Our experiments over 1.5M Wikipedia tables show that our clustering can group many semantically similar tables. This leads to the extraction of many novel n-ary relations
A Search Engine for Natural Language Applications ABSTRACT
Many modern natural language-processing applications utilize search engines to locate large numbers of Web documents or to compute statistics over the Web corpus. Yet Web search engines are designed and optimized for simple human queries—they are not well suited to support such applications. As a result, these applications are forced to issue millions of successive queries resulting in unnecessary search engine load and in slow applications with limited scalability. In response, this paper introduces the Bindings Engine (be), which supports queries containing typed variables and string-processing functions. For example, in response to the query “powerful 〈noun〉 ” be will return all the nouns in its index that immediately follow the word “powerful”, sorted by frequency. In response to the query “Cities such as ProperNoun(Head(〈NounPhrase〉))”, be will return a list of proper nouns likely to be city names. be’s novel neighborhood index enables it to do so with O(k) random disk seeks and O(k) serial disk reads, where k is the number of non-variable terms in its query. As a result, be can yield several orders of magnitude speedup for largescale language-processing applications. The main cost is a modest increase in space to store the index. We report on experiments validating these claims, and analyze how be’s space-time tradeoff scales with the size of its index and the number of variable types. Finally, we describe how a bebased application extracts thousands of facts from the Web at interactive speeds in response to simple user queries