12 research outputs found

    Improving Term Extraction with Terminological Resources

    Full text link
    Studies of different term extractors on a corpus of the biomedical domain revealed decreasing performances when applied to highly technical texts. The difficulty or impossibility of customising them to new domains is an additional limitation. In this paper, we propose to use external terminologies to influence generic linguistic data in order to augment the quality of the extraction. The tool we implemented exploits testified terms at different steps of the process: chunking, parsing and extraction of term candidates. Experiments reported here show that, using this method, more term candidates can be acquired with a higher level of reliability. We further describe the extraction process involving endogenous disambiguation implemented in the term extractor YaTeA

    Acronyms as an integral part of multi–word term recognition - A token of appreciation

    Get PDF
    Term conflation is the process of linking together different variants of the same term. In automatic term recognition approaches, all term variants should be aggregated into a single normalized term representative, which is associated with a single domain–specific concept as a latent variable. In a previous study, we described FlexiTerm, an unsupervised method for recognition of multi–word terms from a domain–specific corpus. It uses a range of methods to normalize three types of term variation – orthographic, morphological and syntactic variation. Acronyms, which represent a highly productive type of term variation, were not supported. In this study, we describe how the functionality of FlexiTerm has been extended to recognize acronyms and incorporate them into the term conflation process. The main contribution of this study is not acronym recognition per se, but rather its integration with other types of term variation into the term conflation process. We evaluated the effects of term conflation in the context of information retrieval as one of its most prominent applications. On average, relative recall increased by 32 percent points, whereas index compression factor increased by 7 percent points. Therefore, evidence suggests that integration of acronyms provides non–trivial improvement of term conflation

    A Task-based Evaluation of French Morphological Resources and Tools

    Get PDF
    Morphology is a key component for many Language Technology applications. However, morphological relations, especially those relying on the derivation and compounding processes, are often addressed in a superficial manner. In this article, we focus on assessing the relevance of deep and motivated morphological knowledge in Natural Language Processing applications. We first describe an annotation experiment whose goal is to evaluate the role of morphology for one task, namely Question Answering (QA). We then highlight the kind of linguistic knowledge that is necessary for this particular task and propose a qualitative analysis of morphological phenomena in order to identify the morphological processes that are most relevant. Based on this study, we perform an intrinsic evaluation of existing tools and resources for French morphology, in order to quantify their coverage. Our conclusions provide helpful insights for using and building appropriate morphological resources and tools that could have a significant impact on the application performance

    Expansion of Multi-Word Terms for Indexing and Retrieval Using Morphology and Syntax

    No full text
    A system for the automatic production of controlled index terms is presented using linguistically-motivated techniques. This includes a finite-state part of speech tagger, a derivational morphological processor for analysis and generation, and a unificationbased shallow-level parser using transformational rules over syntactic patterns. The contribution of this research is the success- ful combination of parsing over a seed term list coupled with derivational morphology to achieve greater coverage of multi-word terms for indexing and retrieval. Final results are evaluated for precision and recall, and implications for indexing and retrieval are discussed

    Terminology Integration in Statistical Machine Translation

    Get PDF
    Elektroniskā versija nesatur pielikumusPromocijas darbs apraksta autora izpētītas metodes un izstrādātus rīkus divvalodu terminoloģijas integrācijai statistiskās mašīntulkošanas sistēmās. Autors darbā piedāvā inovatīvas metodes terminu integrācijai SMT sistēmu trenēšanas fāzē (ar statiskas integrācijas palīdzību) un tulkošanas fāzē (ar dinamiskas integrācijas palīdzību). Darbā uzmanība pievērsta ne tikai metodēm terminu integrācijai SMT, bet arī metodēm valodas resursu, kas nepieciešami dažādu uzdevumu veikšanai terminu integrācijas SMT darbplūsmās, ieguvei. Piedāvātās metodes ir novērtētas automātiskas un manuālas novērtēšanas eksperimentos. Iegūtie rezultāti parāda, ka statiskās un dinamiskās integrācijas metodes ļauj būtiski uzlabot tulkošanas kvalitāti. Darbā aprakstītie rezultāti ir aprobēti vairākos pētniecības projektos un ieviesti praktiskos risinājumos. Atslēgvārdi: statistiskā mašīntulkošana, terminoloģija, starpvalodu informācijas izvilkšanaThe doctoral thesis describes methods and tools researched and developed by the author for bilingual terminology integration into statistical machine translation systems. The author presents novel methods for terminology integration in SMT systems during training (through static integration) and during translation (through dynamic integration). The work focusses not only on the SMT integration techniques, but also on methods for acquisition of linguistic resources that are necessary for different tasks involved in workflows for terminology integration in SMT systems. The proposed methods have been evaluated using automatic and manual evaluation methods. The results show that both static and dynamic integration methods allow increasing translation quality. The thesis describes also areas where the methods have been approbated in practice. Keywords: statistical machine translation, terminology, cross-lingual information extractio

    The Generation of Compound Nominals to Represent the Essence of Text The COMMIX System

    Get PDF
    This thesis concerns the COMMIX system, which automatically extracts information on what a text is about, and generates that information in the highly compacted form of compound nominal expressions. The expressions generated are complex and may include novel terms which do not appear themselves in the input text. From the practical point of view, the work is driven by the need for better representations of content: for representations which are shorter and more concise than would appear in an abstract, yet more informative and representative of the actual aboutness than commonly occurs in indexing expressions and key terms. This additional layer of representation is referred to in this work as pertaining to the essence of a particular text. From a theoretical standpoint, the thesis shows how the compound nominal as a construct can be successfully employed in these highly informative representations. It involves an exploration of the claim that there is sufficient semantic information contained within the standard dictionary glosses for individual words to enable the construction of useful and highly representative novel compound nominal expressions, without recourse to standard syntactic and statistical methods. It shows how a shallow semantic approach to content identification which is based on lexical overlap can produce some very encouraging results. The methodology employed, and described herein, is domain-independent, and does not require the specification of templates with which the input text must comply. In these two respects, the methodology developed in this work avoids two of the most common problems associated with information extraction. As regards the evaluation of this type of work, the thesis introduces and utilises the notion of percentage attainment value, which is used in conjunction with subjects' opinions about the degree to which the aboutness terms succeed in indicating the subject matter of the texts for which they were generated

    Classificação de Documentos

    Get PDF
    Dissertação apresentada na Faculdade de Ciências e Tecnologia da Universidade No Lisboa para obtenção de grau de Mestre em Engenharia de InformáticaNo presente trabalho de investigação pretende-se automatizar o processo de classificação temática de documentos. Foram utilizadas três técnicas de selecção de termos, com três classificadores automáticos, e sete representações de documentos: palavra, multi-palavra, pentagrama, e cadeias dos primeiros 4, 5 e 6 caracteres individualmente, e globalmente. Entre as técnicas de selecção de termos encontra-se a medida do Terceiro Momento em relação à média. Esta medida foi recentemente proposta, por o Professor Joaquim Ferreira da Silva, e considerou-se importante realizar um estudo comparativo da sua performance em relação a outras medidas, já muito conhecidas e comprovada a sua aplicabilidade. As medidas escolhidas foram: Chi-Square e Information Gain. Existem medidas de selecção de termos que demonstram melhores resultados conforme o classificador utilizado, e por isso, as medidas foram experimentadas com diferentes classificadores: K-Nearest Neighbour, RIPPER e Support Vector Machines. São classificadores que na área de classificação demonstraram bons resultados, e assim, avaliou-se o seu desempenho com as diferentes medidas de selecção de termos. Nos resultados experimentais, em que foi utilizado o corpus da Reuters-21578, pode-se observar que o desempenho obtido com a técnica do terceiro momento é superior, ou equivalente, à obtida com as medidas de selecção de termos Chi-Square e Information Gain. Utilizando diferentes representações de documentos é possível obter um desempenho, com os três classificadores, equivalente ao obtido com a representação de documentos por palavra

    Mejoras en la usabilidad de la web a través de una estructura complementaria

    Get PDF
    La Web ha motivado la generación de herramientas que permiten, con distintos grados de sofisticación y precisión, manipular sus contenidos. Para ello, tratan una serie de problemas, relacionados con la naturaleza imperfecta y cambiante de todas las actividades humanas. Ésta se refleja en fenómenos como las ambigüedades, contradicciones y errores de los textos almacenados. Esta tesis presenta una propuesta para complementar la administración de contenidos en la Web y de esta manera facilitar el proceso de recuperación de información. Se presenta un prototipo, denominado Web Intelligent Handler (WIH), que implementa una serie de algoritmos básicos para manipular algunas características morfosintácticas de textos en castellano y, en base a ellas, obtener una representación resumida y alternativa de su contenido. En este contexto, se define una nueva métrica de ponderación para reflejar parte de la esencia morfosintáctica de los sintagmas. Además se define un esquema de interacción entre los módulos para regular la explotación de los textos. También se explora la capacidad de los algoritmos propuestos en el tratamiento de los textos, considerándolos como una colección de sintagmas, sujeta a factores tales como contradicciones, ambigüedades y errores. Otro aporte de esta tesis es la posibilidad de evaluar matemáticamente y de manera automática tipos de estilos de texto y perfiles de escritura. Se proponen los estilos literario, técnico y mensajes. También se proponen los perfiles documento, foro de intercambio, índice Web y texto de sitio blog. Se evalúan los tres estilos y los cuatro perfiles mencionados, los que se comportan como distintos grados de una escala de estilos y perfiles, respectivamente, cuando se los evalúa con la métrica morfosintáctica aquí definida. Adicionalmente, utilizando la misma métrica, es posible realizar una valoración aproximada y automática de la calidad de cualquier tipo de texto. Esta calificación resulta ser invariante a la cantidad de palabras, temática y perfil, pero relacionada con el estilo del escrito en cuestión.The Web motivated a set of tools for content handling with several levels of sophistication and precision. To do so, they deal with many unsolved problems in saved texts. All of them are related to the mutable and imperfect essence of human beings such as ambiguities, contradictions and misspellings. This theses presents a proposal to complement the Web content management and therefore to provide support to the information retrieval activity. A prototype named Web Intelligent Handler (WIH) is introduced to implement a set of algorithms that manage some morpho-syntactical features in Spanish texts. These features are also used to get a brief and alternate representation of its content. Within this framework, a new weighting metric is designed to reflect part of the syntagm morpho-syntactical essence. A module interaction approach is also outlined to rule the text processing output. Besides, this thesis analyzes the algorithms ability to handle texts considering them as a collection of syntagms affected by certain factors such as contradictions, ambiguities and misspellings. Perhaps, the main contribution of this thesis is the possibility to automatically mathematical evaluation of text styles and profiles. Three initial three styles are proposed here: literary, technical and message. Furthermore, the following writer profiles are proposed also: document, foro, Web-index and blog. All the three styles and four profiles were evaluated. They behave respectively as a part of a graduated scale of styles and profiles when the morpho-syntactical metric defined here is used. It is also possible to perform a kind of automatic rough text quality valuation. This is invariant to the text word quantity, topic and profile, but it is related to its style.Facultad de Informátic

    Exploitation de connaissances sémantiques externes dans les représentations vectorielles en recherche documentaire

    Get PDF
    The work presented in this thesis deals with several problems met in information retrieval (IR), task which one can summarise as identifying, in a collection of "documents", a subset of documents carrying a sought information, i.e.. relevant for a request expressed by a user. In the case of textual documents, to which we limited ourselves within the framework of this thesis, a significant part of the difficulty lies in ambiguity inherent to human languages. The interaction with the user is also approached in our work, by studying a tool enabling a natural language access to a database. Finally, some techniques which permit the visualisation of large collections of documents are also presented. In this document we first of all describe the principal models of IR by highlighting the relations which exist with some manual technics of IR and document retrieval, developed during the past centuries. We present the principle of document indexing, allowing us to represent documents in a multidimensional space, and the use of this representation by a vectorial model. After having reviewed the principal improvements made these last years with vectorial research systems, including the preprocessings of collections, the indexing mechanism and measurements of similarities between documents, we detail some recent usecases of additional semantic resources (semantic dictionaries, thesaurus, networks, ontologies) reported in scientific literature for the indexing task. We then present more in detail the semantic indexing principle of textual documents by using a thesaurus, consisting in integrating in the document's representation space at least part of the informational contents of hierarchical semantic resources. We propose a general framework allowing us to describe and position various possible techniques to carry out the semantic indexing by adapting, if possible, the specificity of the descriptions resulting from the semantic resources to the data to be represented. We use this framework to describe three families of criteria usable for semantic indexing, each one having its own characteristics. For each of these families, we give the specific algorithms allowing the computation of the criteria. The first two families allow us to consider several criteria already known in feature selection. Moreover we show that, unfortunately, many of these criteria are in fact not very effective for the considered task. The third family allows us to introduce a completely new criterion, the Minimum Redundancy Cut criterion (MRC), built on the basis of the information theory and allowing us to obtain index terms having a probability of occurrence in the collection of documents as well balanced as possible. Finally, we treat the case of semantic index independent of the data (statically choosen), allowing a parameterisation of the level of generality of the index terms. Some of the criteria suggested for semantic indexing has been empirically evaluated. To judge their relevance, we used a well known vectorial system (the Smart IR system) and measured the performances of IR obtained with various reference collections. Those collections was indexed on the basis of the studied criterion, by taking into account the strongly structuring semantic relation of hyper/hyponymy ("is-a" relation), given by two different semantic resources. By comparing results obtained with the performances of a traditional indexing (using the lemmas of the words as representation space), we can show on one hand the relevance of the semantic indexings (in RD) and on the other hand the quality of the proposed criterion (MRC). Concerning man-machine interaction, we present a general outline allowing to build in a relatively fast and systematic way systems with mixed initiative, giving the human user a large (and natural) latitude in the control of the dialogue. This outline is usable in typical database research-task applications (where the database is hidden to the user, but the latter knows exactly which information they wish to find) as well as advice-task applications, for which the users does not necessarily have a precise idea of their needs, and uses the system not only for specifing their wishes, but also a set of propositions as a final result. We particularly stress the techniques allowing us to obtain a robust system, able to deal with speech recognizer failures. Concerning the visualisation of large textual data collections, we present an application of the correspondences analysis (allowing to highlight similarities and oppositions for various groups of entity, built on the basis of additional features present in the DB) to the case of patents data. In addition, we propose a method (based on the bootstrap replication principle) allowing us to determine a confidence interval for relative positionings of various groups, thus permit to immediately judge the reliability of the visually apparent similarities or oppositions
    corecore