278 research outputs found
Towards a Universal Wordnet by Learning from Combined Evidenc
Lexical databases are invaluable sources of knowledge about words and their meanings, with numerous applications in areas like NLP, IR, and AI. We propose a methodology for the automatic construction of a large-scale multilingual lexical database where words of many languages are hierarchically organized in terms of their meanings and their semantic relations to other words. This resource is bootstrapped from WordNet, a well-known English-language resource. Our approach extends WordNet with around 1.5 million meaning links for 800,000 words in over 200 languages, drawing on evidence extracted from a variety of resources including existing (monolingual) wordnets, (mostly bilingual) translation dictionaries, and parallel corpora. Graph-based scoring functions and statistical learning techniques are used to iteratively integrate this information and build an output graph. Experiments show that this wordnet has a high level of precision and coverage, and that it can be useful in applied tasks such as cross-lingual text classification
Improving Hypernymy Extraction with Distributional Semantic Classes
In this paper, we show how distributionally-induced semantic classes can be
helpful for extracting hypernyms. We present methods for inducing sense-aware
semantic classes using distributional semantics and using these induced
semantic classes for filtering noisy hypernymy relations. Denoising of
hypernyms is performed by labeling each semantic class with its hypernyms. On
the one hand, this allows us to filter out wrong extractions using the global
structure of distributionally similar senses. On the other hand, we infer
missing hypernyms via label propagation to cluster terms. We conduct a
large-scale crowdsourcing study showing that processing of automatically
extracted hypernyms using our approach improves the quality of the hypernymy
extraction in terms of both precision and recall. Furthermore, we show the
utility of our method in the domain taxonomy induction task, achieving the
state-of-the-art results on a SemEval'16 task on taxonomy induction.Comment: In Proceedings of the 11th Conference on Language Resources and
Evaluation (LREC 2018). Miyazaki, Japa
Ontology-Based MEDLINE Document Classification
An increasing and overwhelming amount of biomedical information is available in the research literature mainly in the form of free-text. Biologists need tools that automate their information search and deal with the high volume and ambiguity of free-text. Ontologies can help automatic information processing by providing standard concepts and information about the relationships between concepts. The Medical Subject Headings (MeSH) ontology is already available and used by MEDLINE indexers to annotate the conceptual content of biomedical articles. This paper presents a domain-independent method that uses the MeSH ontology inter-concept relationships to extend the existing MeSH-based representation of MEDLINE documents. The extension method is evaluated within a document triage task organized by the Genomics track of the 2005 Text REtrieval Conference (TREC). Our method for extending the representation of documents leads to an improvement of 17% over a non-extended baseline in terms of normalized utility, the metric defined for the task. The SVMlight software is used to classify documents
Usage-Driven Unified Model for User Profile and Data Source Profile Extraction
This thesis addresses a problem related to usage analysis in information retrieval
systems. Indeed, we exploit the history of search queries as support of analysis to
extract a profile model. The objective is to characterize the user and the data source
that interact in a system to allow different types of comparison (user-to-user, sourceto-
source, user-to-source). According to the study we conducted on the work done on
profile model, we concluded that the large majority of the contributions are strongly
related to the applications within they are proposed. As a result, the proposed
profile models are not reusable and suffer from several weaknesses. For instance,
these models do not consider the data source, they lack of semantic mechanisms and
they do not deal with scalability (in terms of complexity). Therefore, we propose
a generic model of user and data source profiles. The characteristics of this model
are the following. First, it is generic, being able to represent both the user and the
data source. Second, it enables to construct the profiles in an implicit way based on histories of search queries. Third, it defines the profile as a set of topics of interest,
each topic corresponding to a semantic cluster of keywords extracted by a specific
clustering algorithm. Finally, the profile is represented according to the vector space
model. The model is composed of several components organized in the form of a
framework, in which we assessed the complexity of each component.
The main components of the framework are:
• a method for keyword queries disambiguation
• a method for semantically representing search query logs in the form of a
taxonomy;
• a clustering algorithm that allows fast and efficient identification of topics of
interest as semantic clusters of keywords;
• a method to identify user and data source profiles according to the generic
model.
This framework enables in particular to perform various tasks related to usage-based
structuration of a distributed environment. As an example of application, the framework
is used to the discovery of user communities, and the categorization of data
sources. To validate the proposed framework, we conduct a series of experiments
on real logs from the search engine AOL search, which demonstrate the efficiency
of the disambiguation method in short queries, and show the relation between the
quality based clustering and the structure based clustering.Die Arbeit befasst sich mit der Nutzungsanalyse von Informationssuchsystemen.
Auf Basis vergangener Anfragen sollen Nutzungsprofile ermittelt werden. Diese Profile
charakterisieren die im Netz interagierenden Anwender und Datenquellen und
ermöglichen somit Vergleiche von Anwendern, Anwendern und Datenquellen wie
auch Vergleiche von Datenquellen. Die Arbeit am Profil-Modell und die damit verbundenen
Studien zeigten, dass praktisch alle Beiträge stark auf die entsprechende
Anwendung angepasst sind. Als Ergebnis sind die vorgeschlagenen Profil-Modelle
nicht wiederverwendbar; darüber hinaus weisen sie mehrere Schwächen auf. Die
Modelle sind zum Beispiel nicht für Datenquellen einsetzbar, Mechanismen für semantische
Analysen sind nicht vorhanden oder sie verfügen übe keine adequate
Skalierbarkeit (Komplexität). Um das Ziel von Nutzerprofilen zu erreichen wurde
ein einheitliches Modell entwickelt. Dies ermöglicht die Modellierung von beiden Elementen:
Nutzerprofilen und Datenquellen. Ein solches Nutzerprofil wird als Menge
von Themenbereichen definiert, welche das Verhalten des Anwenders (Suchanfragen)
beziehungsweise die Inhalte der Datenquelle charakterisieren. Das Modell ermöglicht
die automatische Profilerstellung auf Basis der vergangenen Suchanfragen, welches
unmittelbar zur Verfügung steht. Jeder Themenbereich korrespondiert einem Cluster
von Schlüsselwörtern, die durch einen semantischen Clustering-Algorithmus extrahiert
werden. Das Modell umfasst mehrere Komponenten, welche als Framework
strukturiert sind. Die Komplexität jeder einzelner Komponente ist dabei festgehalten
worden. Die wichtigsten Komponenten sind die Folgenden:
• eine Methode zur Anfragen Begriffsklärung
• eine Methode zur semantischen Darstellung der Logs als Taxonomie
• einen Cluster-Algorithmus, der Themenbereiche (Anwender-Interessen,
Datenquellen-Inhalte) über semantische Cluster der Schlüsselbegriffe identifiziert
• eine Methode zur Berechnung des Nutzerprofils und des Profils der Datenquellen
ausgehend von einem einheitlichen Modell
Als Beispiel der vielfältigen Einsatzmöglichkeiten hinsichtlich Nutzerprofilen wurde
das Framework abschließend auf zwei Beispiel-Szenarien angewendet: die Ermittlung
von Anwender-Communities und die Kategorisierung von Datenquellen. Das
Framework wurde durch Experimente validiert, welche auf Suchanfrage-Logs von
AOL Search basieren. Die Effizienz der Verfahren wurde für kleine Anfragen demonstriert
und zeigt die Beziehung zwischen dem Qualität-basiertem Clustering und dem
Struktur-basiertem Clustering.La problématique traitée dans la thèse s’inscrit dans le cadre de l’analyse d’usage
dans les systèmes de recherche d’information. En effet, nous nous intéressons à
l’utilisateur à travers l’historique de ses requêtes, utilisées comme support d’analyse
pour l’extraction d’un profil d’usage. L’objectif est de caractériser l’utilisateur et les
sources de données qui interagissent dans un réseau afin de permettre des comparaisons
utilisateur-utilisateur, source-source et source-utilisateur. Selon une étude que
nous avons menée sur les travaux existants sur les modèles de profilage, nous avons
conclu que la grande majorité des contributions sont fortement liés aux applications
dans lesquelles ils étaient proposés. En conséquence, les modèles de profils proposés
ne sont pas réutilisables et présentent plusieurs faiblesses. Par exemple, ces modèles
ne tiennent pas compte de la source de données, ils ne sont pas dotés de mécanismes
de traitement sémantique et ils ne tiennent pas compte du passage à l’échelle (en
termes de complexité). C’est pourquoi, nous proposons dans cette thèse un modèle
d’utilisateur et de source de données basé sur l’analyse d’usage. Les caractéristiques
de ce modèle sont les suivantes. Premièrement, il est générique, permettant
de représenter à la fois un utilisateur et une source de données. Deuxièmement,
il permet de construire le profil de manière implicite à partir de l’historique de requêtes
de recherche. Troisièmement, il définit le profil comme un ensemble de centres
d’intérêts, chaque intérêt correspondant à un cluster sémantique de mots-clés déterminé
par un algorithme de clustering spécifique. Et enfin, dans ce modèle le profil
est représenté dans un espace vectoriel. Les différents composants du modèle sont
organisés sous la forme d’un framework, la complexité de chaque composant y est
evaluée. Le framework propose :
• une methode pour la désambiguisation de requêtes ;
• une méthode pour la représentation sémantique des logs sous la forme d’une
taxonomie ;
• un algorithme de clustering qui permet l’identification rapide et efficace des centres d’intérêt représentés par des clusters sémantiques de mots clés ;
• une méthode pour le calcul du profil de l’utilisateur et du profil de la source
de données à partir du modèle générique.
Le framework proposé permet d’effectuer différentes tâches liées à la structuration
d’un environnement distribué d’un point de vue usage. Comme exemples
d’application, le framework est utilisé pour la découverte de communautés d’utilisateurs
et la catégorisation de sources de données. Pour la validation du framework,
une série d’expérimentations est menée en utilisant des logs du moteur de recherche
AOL-search, qui ont démontrées l’efficacité de la désambiguisation sur des requêtes
courtes, et qui ont permis d’identification de la relation entre le clustering basé sur
une fonction de qualité et le clustering basé sur la structure
Extraction automatique de paraphrases à partir de petits corpus
International audienceThis paper presents a versatile system intended to acquire paraphrastic phrases from a small-size representative corpus. In order to decrease the time spent on the elaboration of resources for NLP system (for example for Information Extraction), we suggest to use a knowledge acquisition module that helps extracting new information despite linguistic variation. This knowledge is semi-automatically derived from the text collection, in interaction with a large semantic network.Cet article présente un système permettant d'acquérir de manière semi-automatique des paraphrases à partir de corpus représentatifs de petite taille. Afin de réduire le temps passé à l'élaboration de ressources pour des systèmes de traitement des langues (notamment l'extraction d'information), nous décrivons un module qui vise à extraire ces connaissances en prenant en compte la variation linguistique. Les connaissances sont directement extraites des textes à l'aide d'un réseau sémantique de grande taille
Automatic Concept Extraction in Semantic Summarization Process
The Semantic Web offers a generic infrastructure for interchange, integration and creative reuse of structured data, which can help to cross some of the boundaries that Web 2.0 is facing. Currently, Web 2.0 offers poor query possibilities apart from searching by keywords or tags. There has been a great deal of interest in the development of semantic-based systems to facilitate knowledge representation and extraction and content integration [1], [2]. Semantic-based approach to retrieving relevant material can be useful to address issues like trying to determine the type or the quality of the information suggested from a personalized environment. In this context, standard keyword search has a very limited effectiveness. For example, it cannot filter for the type of information, the level of information or the quality of information.
Potentially, one of the biggest application areas of content-based exploration might be personalized searching framework (e.g., [3],[4]). Whereas search engines provide nowadays largely anonymous information, new framework might highlight or recommend web pages related to key concepts. We can consider semantic information representation as an important step towards a wide efficient manipulation and retrieval of information [5], [6], [7]. In the digital library community a flat list of attribute/value pairs is often assumed to be available. In the Semantic Web community, annotations are often assumed to be an instance of an ontology. Through the ontologies the system will express key entities and relationships describing resources in a formal machine-processable representation. An ontology-based knowledge representation could be used for content analysis and object recognition, for reasoning processes and for enabling user-friendly and intelligent multimedia content search and retrieval.
Text summarization has been an interesting and active research area since the 60’s. The definition and assumption are that a small portion or several keywords of the original long document can represent the whole informatively and/or indicatively. Reading or processing this shorter version of the document would save time and other resources [8]. This property is especially true and urgently needed at present due to the vast availability of information. Concept-based approach to represent dynamic and unstructured information can be useful to address issues like trying to determine the key concepts and to summarize the information exchanged within a personalized environment.
In this context, a concept is represented with a Wikipedia article. With millions of articles and thousands of contributors, this online repository of knowledge is the largest and fastest growing encyclopedia in existence.
The problem described above can then be divided into three steps:
• Mapping of a series of terms with the most appropriate Wikipedia article (disambiguation).
• Assigning a score for each item identified on the basis of its importance in the given context.
• Extraction of n items with the highest score.
Text summarization can be applied to many fields: from information retrieval to text mining processes and text display. Also in personalized searching framework text summarization could be very useful.
The chapter is organized as follows: the next Section introduces personalized searching framework as one of the possible application areas of automatic concept extraction systems. Section three describes the summarization process, providing details on system architecture, used methodology and tools. Section four provides an overview about document summarization approaches that have been recently developed. Section five summarizes a number of real-world applications which might benefit from WSD. Section six introduces Wikipedia and WordNet as used in our project. Section seven describes the logical structure of the project, describing software components and databases. Finally, Section eight provides some consideration..
- …