88 research outputs found
INRIASAC: Simple Hypernym Extraction Methods
Given a set of terms from a given domain, how can we structure them into a
taxonomy without manual intervention? This is the task 17 of SemEval 2015. Here
we present our simple taxonomy structuring techniques which, despite their
simplicity, ranked first in this 2015 benchmark. We use large quantities of
text (English Wikipedia) and simple heuristics such as term overlap and
document and sentence co-occurrence to produce hypernym lists. We describe
these techniques and pre-sent an initial evaluation of results.Comment: SemEval 2015, Jun 2015, Denver, United State
Determining the Characteristic Vocabulary for a Specialized Dictionary using Word2vec and a Directed Crawler
Specialized dictionaries are used to understand concepts in specific domains,
especially where those concepts are not part of the general vocabulary, or
having meanings that differ from ordinary languages. The first step in creating
a specialized dictionary involves detecting the characteristic vocabulary of
the domain in question. Classical methods for detecting this vocabulary involve
gathering a domain corpus, calculating statistics on the terms found there, and
then comparing these statistics to a background or general language corpus.
Terms which are found significantly more often in the specialized corpus than
in the background corpus are candidates for the characteristic vocabulary of
the domain. Here we present two tools, a directed crawler, and a distributional
semantics package, that can be used together, circumventing the need of a
background corpus. Both tools are available on the web
On the Place of Text Data in Lifelogs, and Text Analysis via Semantic Facets
Current research in lifelog data has not paid enough attention to analysis of
cognitive activities in comparison to physical activities. We argue that as we
look into the future, wearable devices are going to be cheaper and more
prevalent and textual data will play a more significant role. Data captured by
lifelogging devices will increasingly include speech and text, potentially
useful in analysis of intellectual activities. Analyzing what a person hears,
reads, and sees, we should be able to measure the extent of cognitive activity
devoted to a certain topic or subject by a learner. Test-based lifelog records
can benefit from semantic analysis tools developed for natural language
processing. We show how semantic analysis of such text data can be achieved
through the use of taxonomic subject facets and how these facets might be
useful in quantifying cognitive activity devoted to various topics in a
person's day. We are currently developing a method to automatically create
taxonomic topic vocabularies that can be applied to this detection of
intellectual activity
Transforming Wikipedia into an Ontology-based Information Retrieval Search Engine for Local Experts using a Third-Party Taxonomy
Wikipedia is widely used for finding general information about a wide variety
of topics. Its vocation is not to provide local information. For example, it
provides plot, cast, and production information about a given movie, but not
showing times in your local movie theatre. Here we describe how we can connect
local information to Wikipedia, without altering its content. The case study we
present involves finding local scientific experts. Using a third-party
taxonomy, independent from Wikipedia's category hierarchy, we index information
connected to our local experts, present in their activity reports, and we
re-index Wikipedia content using the same taxonomy. The connections between
Wikipedia pages and local expert reports are stored in a relational database,
accessible through as public SPARQL endpoint. A Wikipedia gadget (or plugin)
activated by the interested user, accesses the endpoint as each Wikipedia page
is accessed. An additional tab on the Wikipedia page allows the user to open up
a list of teams of local experts associated with the subject matter in the
Wikipedia page. The technique, though presented here as a way to identify local
experts, is generic, in that any third party taxonomy, can be used in this to
connect Wikipedia to any non-Wikipedia data source.Comment: Joint Second Workshop on Language and Ontology \& Terminology and
Knowledge Structures (LangOnto2 + TermiKS) LO2TKS, May 2016, Portoroz,
Slovenia. 201
Mining social media to create personalized recommendations for tourist visits
International audiencePhoto sharing platforms users often annotate their trip photos with landmark names. These annotations can be aggregated in order to recommend lists of popular visitor attractions similar to those found in classical tourist guides. However, individual tourist preferences can vary significantly so good recommendations should be tailored to individual tastes. Here we pose this visit personalization as a collaborative filtering problem. We mine the record of visited landmarks exposed in online user data to build a user-user similarity matrix. When a user wants to visit a new destination, a list of potentially interesting visitor attractions is produced based on the experience of like-minded users who already visited that destination. We compare our recommender to a baseline which simulates classical tourist guides on a large sample of Flickr users
The WWW as a Resource for Lexicography
International audienceUntil the appearance of the Brown Corpus with its 1 million words in the 1960s and then, on a larger scale, the British National Corpus (the BNC) with its 100 million words, the lexicographer had to rely pretty much on his or her intuition (and amassed scraps of papers) to describe how words were used. Since the task of a lexicographer was to summarize the senses and usages of a word, that person was called upon to be very well read, with a good memory, and a great sensitivity to nuance. These qualities are still and always will be needed when one must condense the description of a great variety of phenomena into a fixed amount of space. But what if this last constraint, a fixed amount of space, disappears? One can then imagine fuller descriptions of how words are used. Taking this imaginative step, the FrameNet project has begun collecting new, fuller descriptions into a new type of lexicographical resource in which '[e] ach entry will in principle provide an exhaustive account of the semantic and syntactic combinatorial properties of one "lexical unit" (i.e., one word in one of its uses).' (Fillmore & Atkins 1998) This ambition to provide an exhaustive accounting of these properties implies access to a large number of examples of words in use. Though the Brown Corpus and the British National Corpus can provide a certain number of these, the World Wide Web (WWW) presents a vastly larger collection of examples of language use. The WWW is a new resource for lexicographers in their task of describing word patterns and their meanings. In this chapter, we look at the WWW as a corpus, and see how this will change how lexicographers model word meaning
The Future of Linguistics and Lexicographers: Will there be Lexicographers in the year 3000 ?
International audienceLexicography has been a respected tradition of scientific pursuit since the eighteenth century.Will the science of lexicography survive into the third millennium? I will discuss what present Âday computational linguistics can offer the lexicographer in the lexicographic task. I will show that the data can be purified in clearer and clearer forms through approximate linguistics. I will then speculate about what aspects of the lexicographic task can be automated, what tasks remain incumbent on humans, and wonder whether a new vision of the lexicon will • emerge
Estimating the Number of Concepts
International audienceMost Natural Language Processing systems have been built around the idea of a word being something found between white spaces and punctuation. This is a normal and efficient way to proceed. Tasks such as Word Sense Disambigua-tion, Machine Translation, or even indexing rarely go beyond the single word. Language models used in NLP applications are built on the word, with a few multiword expressions taken as exceptions. But future NLP systems will neces-sarily venture out into the uncharted areas of multiword expressions. The di-mensions and the topology of multiword concepts are unknown: Are there hun-dreds of thousands or tens of millions? Which words participate in multiword concepts and which do not? As the corpus grows, will their number keep on increasing? In this paper, I estimate the number of multiword concepts that are used in English, systematically probing the Web as our corpus
Taxonomy Induction using Hypernym Subsequences
We propose a novel, semi-supervised approach towards domain taxonomy
induction from an input vocabulary of seed terms. Unlike all previous
approaches, which typically extract direct hypernym edges for terms, our
approach utilizes a novel probabilistic framework to extract hypernym
subsequences. Taxonomy induction from extracted subsequences is cast as an
instance of the minimumcost flow problem on a carefully designed directed
graph. Through experiments, we demonstrate that our approach outperforms
stateof- the-art taxonomy induction approaches across four languages.
Importantly, we also show that our approach is robust to the presence of noise
in the input vocabulary. To the best of our knowledge, no previous approaches
have been empirically proven to manifest noise-robustness in the input
vocabulary
- …