4,837 research outputs found
Information Extraction on Para-Relational Data.
Para-relational data (such as spreadsheets and diagrams) refers to a type of nearly
relational data that shares the important qualities of relational data but does not
present itself in a relational format. Para-relational data often conveys highly valuable
information and is widely used in many different areas. If we can convert para-relational
data into the relational format, many existing tools can be leveraged for a
variety of interesting applications, such as data analysis with relational query systems
and data integration applications.
This dissertation aims to convert para-relational data into a high-quality relational
form with little user assistance. We have developed four standalone systems, each
addressing a specific type of para-relational data. Senbazuru is a prototype spreadsheet
database management system that extracts relational information from a large
number of spreadsheets. Anthias is an extension of the Senbazuru system to convert
a broader range of spreadsheets into a relational format. Lyretail is an extraction
system to detect long-tail dictionary entities on webpages. Finally, DiagramFlyer is
a web-based search system that obtains a large number of diagrams automatically
extracted from web-crawled PDFs. Together, these four systems demonstrate that
converting para-relational data into the relational format is possible today, and also
suggest directions for future systems.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/120853/1/chenzhe_1.pd
Improving Hypernymy Extraction with Distributional Semantic Classes
In this paper, we show how distributionally-induced semantic classes can be
helpful for extracting hypernyms. We present methods for inducing sense-aware
semantic classes using distributional semantics and using these induced
semantic classes for filtering noisy hypernymy relations. Denoising of
hypernyms is performed by labeling each semantic class with its hypernyms. On
the one hand, this allows us to filter out wrong extractions using the global
structure of distributionally similar senses. On the other hand, we infer
missing hypernyms via label propagation to cluster terms. We conduct a
large-scale crowdsourcing study showing that processing of automatically
extracted hypernyms using our approach improves the quality of the hypernymy
extraction in terms of both precision and recall. Furthermore, we show the
utility of our method in the domain taxonomy induction task, achieving the
state-of-the-art results on a SemEval'16 task on taxonomy induction.Comment: In Proceedings of the 11th Conference on Language Resources and
Evaluation (LREC 2018). Miyazaki, Japa
Adaptive text mining: Inferring structure from sequences
Text mining is about inferring structure from sequences representing natural language text, and may be defined as the process of analyzing text to extract information that is useful for particular purposes. Although hand-crafted heuristics are a common practical approach for extracting information from text, a general, and generalizable, approach requires adaptive techniques. This paper studies the way in which the adaptive techniques used in text compression can be applied to text mining. It develops several examples: extraction of hierarchical phrase structures from text, identification of keyphrases in documents, locating proper names and quantities of interest in a piece of text, text categorization, word segmentation, acronym extraction, and structure recognition. We conclude that compression forms a sound unifying principle that allows many text mining problems to be tacked adaptively
- …