Search CORE

9 research outputs found

Recherche d'information et fouille de textes

Author: Bellot Patrice
Grau Brigitte
Publication venue: HAL CCSD
Publication date: 01/01/2014
Field of study

National audienceIntroduction Comprendre un texte est un but que l'Intelligence Artificielle (IA) s'est fixé depuis ses débuts et les premiers travaux apportant des réponses ont vu le jour dans les années 70s. Depuis lors, le thème est toujours d'actualité, bien que les buts et méthodes qu'il recouvre aient considérablement évolués. Il est donc nécessaire de regarder de plus près ce qui se cache derrière cette dénomination générale de « compréhension de texte ». Les premiers travaux, qui ont eu lieu du milieu des années 70 jusqu'au milieu des années 80 [Charniak 1972; Dyer 1983; Schank et al. 1977], étudiaient des textes relatant de courtes histoires et comprendre signifiait mettre en évidence les tenants et aboutissants de l'histoire-les sujets traités, les événements décrits, les relations de causalité les reliant-ainsi que le rôle de chaque personnage, ses motivations et ses intentions. La compréhension était vue comme un processus d'inférence visant à expliciter tout l'implicite présent dans un texte en le retrouvant à partir des connaissances sémantiques et pragmatiques dont disposait la machine. Cela présupposait une modélisation préalable de ces connaissances. On rejoint ici les travaux effectués sur les différents formalismes de représentation des connaissances en IA, décrivant d'une part les sens associés aux mots de la langue (réseaux sémantiques vs logique, et notamment graphes conceptuels [Sowa 1984] et d'autre part les connaissances pragmatiques [Schank 1982]. Tous ces travaux ont montré leur limite dès lors qu'il s'agissait de modéliser manuellement ces connaissances pour tous les domaines, ou de les apprendre automatiquement. Le problème de la compréhension automatique en domaine ouvert restait donc entier. Puisque le problème ainsi posé est insoluble en l'état des connaissances, une approche alternative consiste à le redéfinir et à le décomposer en sous-tâches potentiellement plus faciles à résoudre. Ainsi la compréhension de texte peut être redéfinie selon différents points de vue sur le texte qui permettent de répondre à des besoins spécifiques. De même qu'un lecteur ne lit pas un texte de façon identique selon qu'il veut évaluer sa pertinence par rapport à un thème qui l'intéresse (tâche de type recherche documentaire), qu'il veut classer des documents, prendre connaissances des événements relatés ou rechercher une information précise, de même les processus automatiques seront multiples et s'intéresseront à des aspects différents du texte en fonction de la tâche visée. Suivant le type de connaissance cherché dans un document, le lecteur n'extraira du texte que l'information qui l'intéresse et s'appuiera pour cela sur les indices et sur les connaissances qui lui permettent de réaliser sa tâche de lecture, et donc de compréhension, sans avoir à tout assimiler. On peut alors parler de compréhension à niveaux variables, qui va permettre d'accéder à des niveaux de sens différents. Cette démarche est bien illustrée par les travaux en extraction d'information, évalués dans le cadre des conférences MUC [Grishman and Sundheim 1996], qui ont eu lieu de la fin des années 1980 jusqu'en 1998. L'extraction d'information consistait alors à modéliser un besoin d'information par un patron, décrit par un ensemble d'attributs typés, et à chercher à remplir ces attributs selon l'information contenue dans les textes. C'est ainsi que se sont notamment développées les recherches sur les « entités nommées » (à savoir le repérage de noms de personne, d'organisation, de lieu, de date, etc.) et sur les relations entre ces entités. C'est aussi dans cette optique que se sont développées les approches se situant au niveau du document, que ce soit pour la recherche d'information ou pour en déterminer la structur

HAL AMU

Linking named entities to Wikipedia

Author: Radford William Edward John
Publication venue: Faculty of Engineering and Information Technologies, School of Information Technologies
Publication date: 01/01/2015
Field of study

Natural language is fraught with problems of ambiguity, including name reference. A name in text can refer to multiple entities just as an entity can be known by different names. This thesis examines how a mention in text can be linked to an external knowledge base (KB), in our case, Wikipedia. The named entity linking (NEL) task requires systems to identify the KB entry, or Wikipedia article, that a mention refers to; or, if the KB does not contain the correct entry, return NIL. Entity linking systems can be complex and we present a framework for analysing their different components, which we use to analyse three seminal systems which are evaluated on a common dataset and we show the importance of precise search for linking. The Text Analysis Conference (TAC) is a major venue for NEL research. We report on our submissions to the entity linking shared task in 2010, 2011 and 2012. The information required to disambiguate entities is often found in the text, close to the mention. We explore apposition, a common way for authors to provide information about entities. We model syntactic and semantic restrictions with a joint model that achieves state-of-the-art apposition extraction performance. We generalise from apposition to examine local descriptions specified close to the mention. We add local description to our state-of-the-art linker by using patterns to extract the descriptions and matching against this restricted context. Not only does this make for a more precise match, we are also able to model failure to match. Local descriptions help disambiguate entities, further improving our state-of-the-art linker. The work in this thesis seeks to link textual entity mentions to knowledge bases. Linking is important for any task where external world knowledge is used and resolving ambiguity is fundamental to advancing research into these problems

Sydney eScholarship

Enhancing knowledge acquisition systems with user generated and crowdsourced resources

Author: Xu Fang
Publication venue: Fakultät 7 - Naturwissenschaftlich-Technische Fakultät II. Fachrichtung 7.4 - Mechatronik
Publication date: 01/01/2012
Field of study

This thesis is on leveraging knowledge acquisition systems with collaborative data and crowdsourcing work from internet. We propose two strategies and apply them for building effective entity linking and question answering (QA) systems. The first strategy is on integrating an information extraction system with online collaborative knowledge bases, such as Wikipedia and Freebase. We construct a Cross-Lingual Entity Linking (CLEL) system to connect Chinese entities, such as people and locations, with corresponding English pages in Wikipedia. The main focus is to break the language barrier between Chinese entities and the English KB, and to resolve the synonymy and polysemy of Chinese entities. To address those problems, we create a cross-lingual taxonomy and a Chinese knowledge base (KB). We investigate two methods of connecting the query representation with the KB representation. Based on our CLEL system participating in TAC KBP 2011 evaluation, we finally propose a simple and effective generative model, which achieved much better performance. The second strategy is on creating annotation for QA systems with the help of crowd- sourcing. Crowdsourcing is to distribute a task via internet and recruit a lot of people to complete it simultaneously. Various annotated data are required to train the data-driven statistical machine learning algorithms for underlying components in our QA system. This thesis demonstrates how to convert the annotation task into crowdsourcing micro-tasks, investigate different statistical methods for enhancing the quality of crowdsourced anno- tation, and ﬁnally use enhanced annotation to train learning to rank models for passage ranking algorithms for QA.Gegenstand dieser Arbeit ist das Nutzbarmachen sowohl von Systemen zur Wissener- fassung als auch von kollaborativ erstellten Daten und Arbeit aus dem Internet. Es werden zwei Strategien vorgeschlagen, welche für die Erstellung effektiver Entity Linking (Disambiguierung von Entitätennamen) und Frage-Antwort Systeme eingesetzt werden. Die erste Strategie ist, ein Informationsextraktions-System mit kollaborativ erstellten Online- Datenbanken zu integrieren. Wir entwickeln ein Cross-Linguales Entity Linking-System (CLEL), um chinesische Entitäten, wie etwa Personen und Orte, mit den entsprechenden Wikipediaseiten zu verknüpfen. Das Hauptaugenmerk ist es, die Sprachbarriere zwischen chinesischen Entitäten und englischer Datenbank zu durchbrechen, und Synonymie und Polysemie der chinesis- chen Entitäten aufzulösen. Um diese Probleme anzugehen, erstellen wir eine cross linguale Taxonomie und eine chinesische Datenbank. Wir untersuchen zwei Methoden, die Repräsentation der Anfrage und die Repräsentation der Datenbank zu verbinden. Schließlich stellen wir ein einfaches und effektives generatives Modell vor, das auf unserem System für die Teilnahme an der TAC KBP 2011 Evaluation basiert und eine erheblich bessere Performanz erreichte. Die zweite Strategie ist, Annotationen für Frage-Antwort-Systeme mit Hilfe von "Crowd- sourcing" zu erstellen. "Crowdsourcing" bedeutet, eine Aufgabe via Internet an eine große Menge an angeworbene Menschen zu verteilen, die diese simultan erledigen. Verschiedene annotierte Daten sind notwendig, um die datengetriebenen statistischen Lernalgorithmen zu trainieren, die unserem Frage-Antwort System zugrunde liegen. Wir zeigen, wie die Annotationsaufgabe in Mikro-Aufgaben für das Crowdsourcing umgewan- delt werden kann, wir untersuchen verschiedene statistische Methoden, um die Qualität der Annotation aus dem Crowdsourcing zu erweitern, und schließlich nutzen wir die erwei- erte Annotation, um Modelle zum Lernen von Ranglisten von Textabschnitten zu trainieren

THU QUANTA at TAC 2009 KBP and RTE Track

Author: Fan Bu
Fangtao Li
Minlie Huang
Xiaoyan Zhu
Yang Tang
Zhicheng Zheng
Publication venue
Publication date: 01/01/2009
Field of study

Text Analysis Conference (TAC 2009)This paper describes the systems of THU QUANTA in Text Analysis Conference (TAC) 2009. We participated in the Knowledge Base Population (KBP) track, and the Recognizing Textual Entailment (RTE) track. For the KBP track, we investigate two ranking strategies for Entity Linking task. We employ a Listwise “Learning to Rank” model and Augmenting Naïve Bayes model to rank the candidate. We try to use learned patterns to solve the Slot Filling task. For the RTE track, we propose an interesting method, SEGraph (Semantic Elements based Graph). This method divides the Hypothesis and Text into two types of semantic elements: Entity Semantic Element and Relation Semantic Element. The SEGraph is then constructed, with Entity Elements as nodes, and Relation Elements as edges for both Text and Hypothesis. Finally we recognize the textual entailment based on the SEGraph of Text and SEGraph of Hypothesis. The evaluation results show that our proposed two frame-works are very effective for KBP and RTE tasks, respectively

CiteSeerX

International Development Research Centre: IDRC Digital Library