36 research outputs found

    Representation and Processing of Composition, Variation and Approximation in Language Resources and Tools

    Get PDF
    In my habilitation dissertation, meant to validate my capacity of and maturity for directingresearch activities, I present a panorama of several topics in computational linguistics, linguisticsand computer science.Over the past decade, I was notably concerned with the phenomena of compositionalityand variability of linguistic objects. I illustrate the advantages of a compositional approachto the language in the domain of emotion detection and I explain how some linguistic objects,most prominently multi-word expressions, defy the compositionality principles. I demonstratethat the complex properties of MWEs, notably variability, are partially regular and partiallyidiosyncratic. This fact places the MWEs on the frontiers between different levels of linguisticprocessing, such as lexicon and syntax.I show the highly heterogeneous nature of MWEs by citing their two existing taxonomies.After an extensive state-of-the art study of MWE description and processing, I summarizeMultiflex, a formalism and a tool for lexical high-quality morphosyntactic description of MWUs.It uses a graph-based approach in which the inflection of a MWU is expressed in function ofthe morphology of its components, and of morphosyntactic transformation patterns. Due tounification the inflection paradigms are represented compactly. Orthographic, inflectional andsyntactic variants are treated within the same framework. The proposal is multilingual: it hasbeen tested on six European languages of three different origins (Germanic, Romance and Slavic),I believe that many others can also be successfully covered. Multiflex proves interoperable. Itadapts to different morphological language models, token boundary definitions, and underlyingmodules for the morphology of single words. It has been applied to the creation and enrichmentof linguistic resources, as well as to morphosyntactic analysis and generation. It can be integratedinto other NLP applications requiring the conflation of different surface realizations of the sameconcept.Another chapter of my activity concerns named entities, most of which are particular types ofMWEs. Their rich semantic load turned them into a hot topic in the NLP community, which isdocumented in my state-of-the art survey. I present the main assumptions, processes and resultsissued from large annotation tasks at two levels (for named entities and for coreference), parts ofthe National Corpus of Polish construction. I have also contributed to the development of bothrule-based and probabilistic named entity recognition tools, and to an automated enrichment ofProlexbase, a large multilingual database of proper names, from open sources.With respect to multi-word expressions, named entities and coreference mentions, I pay aspecial attention to nested structures. This problem sheds new light on the treatment of complexlinguistic units in NLP. When these units start being modeled as trees (or, more generally, asacyclic graphs) rather than as flat sequences of tokens, long-distance dependencies, discontinu-ities, overlapping and other frequent linguistic properties become easier to represent. This callsfor more complex processing methods which control larger contexts than what usually happensin sequential processing. Thus, both named entity recognition and coreference resolution comesvery close to parsing, and named entities or mentions with their nested structures are analogous3to multi-word expressions with embedded complements.My parallel activity concerns finite-state methods for natural language and XML processing.My main contribution in this field, co-authored with 2 colleagues, is the first full-fledged methodfor tree-to-language correction, and more precisely for correcting XML documents with respectto a DTD. We have also produced interesting results in incremental finite-state algorithmics,particularly relevant to data evolution contexts such as dynamic vocabularies or user updates.Multilingualism is the leitmotif of my research. I have applied my methods to several naturallanguages, most importantly to Polish, Serbian, English and French. I have been among theinitiators of a highly multilingual European scientific network dedicated to parsing and multi-word expressions. I have used multilingual linguistic data in experimental studies. I believethat it is particularly worthwhile to design NLP solutions taking declension-rich (e.g. Slavic)languages into account, since this leads to more universal solutions, at least as far as nominalconstructions (MWUs, NEs, mentions) are concerned. For instance, when Multiflex had beendeveloped with Polish in mind it could be applied as such to French, English, Serbian and Greek.Also, a French-Serbian collaboration led to substantial modifications in morphological modelingin Prolexbase in its early development stages. This allowed for its later application to Polishwith very few adaptations of the existing model. Other researchers also stress the advantages ofNLP studies on highly inflected languages since their morphology encodes much more syntacticinformation than is the case e.g. in English.In this dissertation I am also supposed to demonstrate my ability of playing an active rolein shaping the scientific landscape, on a local, national and international scale. I describemy: (i) various scientific collaborations and supervision activities, (ii) roles in over 10 regional,national and international projects, (iii) responsibilities in collective bodies such as program andorganizing committees of conferences and workshops, PhD juries, and the National UniversityCouncil (CNU), (iv) activity as an evaluator and a reviewer of European collaborative projects.The issues addressed in this dissertation open interesting scientific perspectives, in whicha special impact is put on links among various domains and communities. These perspectivesinclude: (i) integrating fine-grained language data into the linked open data, (ii) deep parsingof multi-word expressions, (iii) modeling multi-word expression identification in a treebank as atree-to-language correction problem, and (iv) a taxonomy and an experimental benchmark fortree-to-language correction approaches

    Head-Driven Phrase Structure Grammar

    Get PDF
    Head-Driven Phrase Structure Grammar (HPSG) is a constraint-based or declarative approach to linguistic knowledge, which analyses all descriptive levels (phonology, morphology, syntax, semantics, pragmatics) with feature value pairs, structure sharing, and relational constraints. In syntax it assumes that expressions have a single relatively simple constituent structure. This volume provides a state-of-the-art introduction to the framework. Various chapters discuss basic assumptions and formal foundations, describe the evolution of the framework, and go into the details of the main syntactic phenomena. Further chapters are devoted to non-syntactic levels of description. The book also considers related fields and research areas (gesture, sign languages, computational linguistics) and includes chapters comparing HPSG with other frameworks (Lexical Functional Grammar, Categorial Grammar, Construction Grammar, Dependency Grammar, and Minimalism)

    Neural Combinatory Constituency Parsing

    Get PDF
    東京都立大学Tokyo Metropolitan University博士(情報科学)doctoral thesi

    Term selection in information retrieval

    Get PDF
    Systems trained on linguistically annotated data achieve strong performance for many language processing tasks. This encourages the idea that annotations can improve any language processing task if applied in the right way. However, despite widespread acceptance and availability of highly accurate parsing software, it is not clear that ad hoc information retrieval (IR) techniques using annotated documents and requests consistently improve search performance compared to techniques that use no linguistic knowledge. In many cases, retrieval gains made using language processing components, such as part-of-speech tagging and head-dependent relations, are offset by significant negative effects. This results in a minimal positive, or even negative, overall impact for linguistically motivated approaches compared to approaches that do not use any syntactic or domain knowledge. In some cases, it may be that syntax does not reveal anything of practical importance about document relevance. Yet without a convincing explanation for why linguistic annotations fail in IR, the intuitive appeal of search systems that ‘understand’ text can result in the repeated application, and mis-application, of language processing to enhance search performance. This dissertation investigates whether linguistics can improve the selection of query terms by better modelling the alignment process between natural language requests and search queries. It is the most comprehensive work on the utility of linguistic methods in IR to date. Term selection in this work focuses on identification of informative query terms of 1-3 words that both represent the semantics of a request and discriminate between relevant and non-relevant documents. Approaches to word association are discussed with respect to linguistic principles, and evaluated with respect to semantic characterization and discriminative ability. Analysis is organised around three theories of language that emphasize different structures for the identification of terms: phrase structure theory, dependency theory and lexicalism. The structures identified by these theories play distinctive roles in the organisation of language. Evidence is presented regarding the value of different methods of word association based on these structures, and the effect of method and term combinations. Two highly effective, novel methods for the selection of terms from verbose queries are also proposed and evaluated. The first method focuses on the semantic phenomenon of ellipsis with a discriminative filter that leverages diverse text features. The second method exploits a term ranking algorithm, PhRank, that uses no linguistic information and relies on a network model of query context. The latter focuses queries so that 1-5 terms in an unweighted model achieve better retrieval effectiveness than weighted IR models that use up to 30 terms. In addition, unlike models that use a weighted distribution of terms or subqueries, the concise terms identified by PhRank are interpretable by users. Evaluation with newswire and web collections demonstrates that PhRank-based query reformulation significantly improves performance of verbose queries up to 14% compared to highly competitive IR models, and is at least as good for short, keyword queries with the same models. Results illustrate that linguistic processing may help with the selection of word associations but does not necessarily translate into improved IR performance. Statistical methods are necessary to overcome the limits of syntactic parsing and word adjacency measures for ad hoc IR. As a result, probabilistic frameworks that discover, and make use of, many forms of linguistic evidence may deliver small improvements in IR effectiveness, but methods that use simple features can be substantially more efficient and equally, or more, effective. Various explanations for this finding are suggested, including the probabilistic nature of grammatical categories, a lack of homomorphism between syntax and semantics, the impact of lexical relations, variability in collection data, and systemic effects in language systems

    Head-Driven Phrase Structure Grammar

    Get PDF
    Head-Driven Phrase Structure Grammar (HPSG) is a constraint-based or declarative approach to linguistic knowledge, which analyses all descriptive levels (phonology, morphology, syntax, semantics, pragmatics) with feature value pairs, structure sharing, and relational constraints. In syntax it assumes that expressions have a single relatively simple constituent structure. This volume provides a state-of-the-art introduction to the framework. Various chapters discuss basic assumptions and formal foundations, describe the evolution of the framework, and go into the details of the main syntactic phenomena. Further chapters are devoted to non-syntactic levels of description. The book also considers related fields and research areas (gesture, sign languages, computational linguistics) and includes chapters comparing HPSG with other frameworks (Lexical Functional Grammar, Categorial Grammar, Construction Grammar, Dependency Grammar, and Minimalism)

    Proceedings

    Get PDF
    Proceedings of the Ninth International Workshop on Treebanks and Linguistic Theories. Editors: Markus Dickinson, Kaili Müürisep and Marco Passarotti. NEALT Proceedings Series, Vol. 9 (2010), 268 pages. © 2010 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/15891

    Computational models for multilingual negation scope detection

    Get PDF
    Negation is a common property of languages, in that there are few languages, if any, that lack means to revert the truth-value of a statement. A challenge to cross-lingual studies of negation lies in the fact that languages encode and use it in different ways. Although this variation has been extensively researched in linguistics, little has been done in automated language processing. In particular, we lack computational models of processing negation that can be generalized across language. We even lack knowledge of what the development of such models would require. These models however exist and can be built by means of existing cross-lingual resources, even when annotated data for a language other than English is not available. This thesis shows this in the context of detecting string-level negation scope, i.e. the set of tokens in a sentence whose meaning is affected by a negation marker (e.g. ‘not’). Our contribution has two parts. First, we investigate the scenario where annotated training data is available. We show that Bi-directional Long Short Term Memory (BiLSTM) networks are state-of-the-art models whose features can be generalized across language. We also show that these models suffer from genre effects and that for most of the corpora we have experimented with, high performance is simply an artifact of the annotation styles, where negation scope is often a span of text delimited by punctuation. Second, we investigate the scenario where annotated data is available in only one language, experimenting with model transfer. To test our approach, we first build NEGPAR, a parallel corpus annotated for negation, where pre-existing annotations on English sentences have been edited and extended to Chinese translations. We then show that transferring a model for negation scope detection across languages is possible by means of structured neural models where negation scope is detected on top of a cross-linguistically consistent representation, Universal Dependencies. On the other hand, we found cross-lingual lexical information only to help very little with performance. Finally, error analysis shows that performance is better when a negation marker is in the same dependency substructure as its scope and that some of the phenomena related to negation scope requiring lexical knowledge are still not captured correctly. In the conclusions, we tie up the contributions of this thesis and we point future work towards representing negation scope across languages at the level of logical form as well

    Can humain association norm evaluate latent semantic analysis?

    Get PDF
    This paper presents the comparison of word association norm created by a psycholinguistic experiment to association lists generated by algorithms operating on text corpora. We compare lists generated by Church and Hanks algorithm and lists generated by LSA algorithm. An argument is presented on how those automatically generated lists reflect real semantic relations

    Automatically identifying facet roles from comparative structures to support biomedical text summarization

    Get PDF
    Within the context of biomedical scholarly articles, comparison sentences represent a rhetorical structure commonly used to communicate findings. More generally, comparison sentences are rich with information about how the properties of one or more entities relate one another. So far, in the biomedical domain, the emphasis has been on recognizing comparative sentences in the text. This dissertation goes beyond sentence-level recognition and aims to automate the identification of the integral parts of a comparison sentence which are called comparative facets and include: compared entities, the basis or the endpoint of comparison as well as the result or the relationship that binds the entities and the basis. Only the sentences that contain each of the four facets are of interest in this thesis. With respect to the first compared entity, the system achieves an average F1 on a random sample of short (between 11 and 21 words long) sentences of 0.65; medium (between 22 and <= 28 words) sentences 0.70; long (between 29 and <=36 words) sentences 0.60 and very long (more than 36 words), 0.60. With respect to the basis of comparison prediction (the endpoint), the average F1 measure ranged from 0.66 on short, 0.57 on medium, 0.56 on long, and 0.50 on very long sentences. The average F1 achieved with respect to the second entity compared ranged from 0.91 on short, 0.85 on medium, 0.81 on long and 0.72 on very long sentences. In the area of semantic relation identification, the performance achieved was also sensitive to sentence length: the average F1 measure on short sentences was 0.80; it was 0.71, 0.56, and 0.51 on medium, long, and very long sentences respectively. Thus, the methods developed in this dissertation work better on sentences that are shorter (<= 28 words) and on those that do not contain multiple claims or disjunctive conjunctions. When applied to a previously unseen collection of breast cancer articles, the performance achieved with respect to the identification of compared entities and the endpoint was comparable to the results achieved on the collection that was used for building and testing the models. This result is promising with respect to the potential of this model being applied on other collections of scholarly articles in the biomedical sciences

    Cross-lingual Semantic Parsing with Categorial Grammars

    Get PDF
    Humans communicate using natural language. We need to make sure that computers can understand us so that they can act on our spoken commands or independently gain new insights from knowledge that is written down as text. A “semantic parser” is a program that translates natural-language sentences into computer commands or logical formulas–something a computer can work with. Despite much recent progress on semantic parsing, most research focuses on English, and semantic parsers for other languages cannot keep up with the developments. My thesis aims to help close this gap. It investigates “cross-lingual learning” methods by which a computer can automatically adapt a semantic parser to another language, such as Dutch. The computer learns by looking at example sentences and their translations, e.g., “She likes to read books”/”Ze leest graag boeken”. Even with many such examples, learning which word means what and how word meanings combine into sentence meanings is a challenge, because translations are rarely word-for-word. They exhibit grammatical differences and non-literalities. My thesis presents a method for tackling these challenges based on the grammar formalism Combinatory Categorial Grammar. It shows that this is a suitable formalism for this purpose, that many structural differences between sentences and their translations can be dealt with in this framework, and that a (rudimentary) semantic parser for Dutch can be learned cross-lingually based on one for English. I also investigate methods for building large corpora of texts annotated with logical formulas to further study and improve semantic parsers
    corecore