162 research outputs found

    Complexity of Lexical Descriptions and its Relevance to Partial Parsing

    Get PDF
    In this dissertation, we have proposed novel methods for robust parsing that integrate the flexibility of linguistically motivated lexical descriptions with the robustness of statistical techniques. Our thesis is that the computation of linguistic structure can be localized if lexical items are associated with rich descriptions (supertags) that impose complex constraints in a local context. However, increasing the complexity of descriptions makes the number of different descriptions for each lexical item much larger and hence increases the local ambiguity for a parser. This local ambiguity can be resolved by using supertag co-occurrence statistics collected from parsed corpora. We have explored these ideas in the context of Lexicalized Tree-Adjoining Grammar (LTAG) framework wherein supertag disambiguation provides a representation that is an almost parse. We have used the disambiguated supertag sequence in conjunction with a lightweight dependency analyzer to compute noun groups, verb groups, dependency linkages and even partial parses. We have shown that a trigram-based supertagger achieves an accuracy of 92.1‰ on Wall Street Journal (WSJ) texts. Furthermore, we have shown that the lightweight dependency analysis on the output of the supertagger identifies 83‰ of the dependency links accurately. We have exploited the representation of supertags with Explanation-Based Learning to improve parsing effciency. In this approach, parsing in limited domains can be modeled as a Finite-State Transduction. We have implemented such a system for the ATIS domain which improves parsing eciency by a factor of 15. We have used the supertagger in a variety of applications to provide lexical descriptions at an appropriate granularity. In an information retrieval application, we show that the supertag based system performs at higher levels of precision compared to a system based on part-of-speech tags. In an information extraction task, supertags are used in specifying extraction patterns. For language modeling applications, we view supertags as syntactically motivated class labels in a class-based language model. The distinction between recursive and non-recursive supertags is exploited in a sentence simplification application

    Extracting Temporal and Causal Relations between Events

    Full text link
    Structured information resulting from temporal information processing is crucial for a variety of natural language processing tasks, for instance to generate timeline summarization of events from news documents, or to answer temporal/causal-related questions about some events. In this thesis we present a framework for an integrated temporal and causal relation extraction system. We first develop a robust extraction component for each type of relations, i.e. temporal order and causality. We then combine the two extraction components into an integrated relation extraction system, CATENA---CAusal and Temporal relation Extraction from NAtural language texts---, by utilizing the presumption about event precedence in causality, that causing events must happened BEFORE resulting events. Several resources and techniques to improve our relation extraction systems are also discussed, including word embeddings and training data expansion. Finally, we report our adaptation efforts of temporal information processing for languages other than English, namely Italian and Indonesian.Comment: PhD Thesi

    Recognition and normalization of temporal expressions in Serbian medical narratives

    Get PDF
    The temporal dimension emerges as one of the essential concepts in the field of medicine, providing a basis for the proper interpretation and understanding of medically relevant information, often recorded only in unstructured texts. Automatic processing of temporal expressions involves their identification and formalization in a language understandable to computers. This paper aims to apply the existing system for automatic processing of temporal expressions in Serbian natural language texts to medical narrative texts, to evaluate the system’s efficiency in recognition and normalization of temporal expressions and to determine the degree of necessary adaptation according to the characteristics and requirements of the medical domain

    List Construction in Finland-Swedish Sign Language

    Get PDF
    Finland-Swedish Sign Language (FinSSL) is an endangered minority signed language used by approximately 90 deaf and 100 hearing persons in Finland and a smaller group of users in Sweden. Finland-Swedish Sign Language is in need of revitalization, and this study contributes to this with a detailed description of the form and usage of list construction in informational monologues published between 2014 and 2019. This study examines the use of list constructions in FinSSL. In list construction and its basic form, the non-dominant hand ‘counts’ and its fingertips are associated with entities while the dominant hand is used for pointing at these non-dominant hand’s fingers. In previous studies, list constructions have been called, for example, digital enumeration, finger(tip) loci or enumeration, and list buoys. List constructions have been described as a simultaneous expression involving the use of a numeral sign, since the non-dominant list hand often borrows its handshape from a corresponding numeral sign. The list hand can be held in place throughout a stretch of discourse (perseverating) or the hand can alternate between perseveration, simultaneous presentation of list fingers, sequential presentation of the list fingers, and various mixed versions of these. This study focuses on how FinSSL signers use list constructions in informational videos published on Teckeneko (www.teckeneko.fi). Teckeneko is a web-based information channel and broadcast service administered by the association Finlandssvenska teckensprĂ„kiga rf and the media company Moxio AB. The data for this study consists of 48 videos (2 hours and 16 minutes) where the list construction was used 241 times by seven different signers. The data was first annotated with the ELAN annotation program, and then the usage-events were analyzed by using Cognitive Grammar as the theoretical framework. In this analysis, list constructions consist of a list hand and a pointing device. The list hand and its fingers represent the different listed entities. The other hand acts as the pointing device and directs attention to the referents on the list hand fingers. The results of this study are the detailed description of the list construction usage in informational videos signed in FinSSL. The signers were found to use list construction, e.g., in enumerating topics in the video in question or for a project, events, dates, program numbers, participants, and organizations. The signers also used list construction for grouping the enumerated entities and for referring to the group of entities instead of individual entities. The results show a more nuanced understanding of the use of list constructions in FinSSL and in signed languages in general but also a need for further research on list constructions in other types of data.Listakonstruktio suomenruotsalaisessa viittomakielessĂ€ TĂ€mĂ€ vĂ€itöstutkimus kĂ€sittelee listakonstruktiota suomenruotsalaisessa viittomakielessĂ€. Suomenruotsalainen viittomakieli on toinen Suomessa kĂ€ytettĂ€vistĂ€ viittomakielistĂ€. Se on vakavasti uhanalainen kieli, jolla on Suomessa noin sata natiivia kielenkĂ€yttĂ€jÀÀ. TĂ€mĂ€ on ensimmĂ€inen vĂ€itöstutkimus suomenruotsalaisen viittomakielen kieliopista. Tutkimus on deskriptiivinen, ja se on teoreettiselta viitekehykseltÀÀn kognitiivisen kieliopin alalta. Työn teoreettiseen osaan on kerĂ€tty laajalti kuvauksia listakonstruktiosta ja sen kĂ€ytöstĂ€ ja tutkimuksesta muista viittomakielistĂ€ ympĂ€ri maailmaa. Listakonstruktiossa viittoja ojentaa toisesta kĂ€destÀÀn (nk. listakĂ€si) yhdestĂ€ viiteen sormea ja toisella kĂ€dellÀÀn (nk. osoitin) osoittaa joko yhteen tai useampaan nĂ€istĂ€ ojennetuista sormista. Listakonstruktiota kĂ€ytetÀÀn, kun viittoja listaa asioita ja paikantaa listattavat asiat nĂ€ihin listakĂ€den sormiin. ListakĂ€den ojennettujen sormien lukumÀÀrĂ€ riippuu siitĂ€, montako asiaa viittojan listalla on. Tarvittavat sormet voidaan ojentaa joko kaikki kerralla (nk. simultaaninen lista) tai yksitellen listan edetessĂ€ (nk. sekventiaalinen lista). ListakĂ€si voi myös olla joko nĂ€kyvillĂ€, listasormet ojennettuina, koko sen ajan, kun listakonstruktiota tuotetaan (nk. pysyvĂ€ lista), tai listakĂ€si voi osallistua listattavien asioiden viittomiseen ja ottaa listamuodon, kun on seuraavan listasormiin tehtĂ€vĂ€n osoituksen vuoro. VĂ€itöstutkimuksessa kuvataan, kuinka suomenruotsalaista viittomakieltĂ€ kĂ€yttĂ€vĂ€t hyödyntĂ€vĂ€t listakonstruktiota informatiivisissa monologeissa, jotka on julkaistu Teckeneko-sivustolla (teckeneko.fi) ja todetaan, ettĂ€ kĂ€yttö on monipuolista ja luovaa. Osoittavan kĂ€den kĂ€simuoto ja sen tekemĂ€ liike nimittĂ€in poikkeaa usein prototyyppisestĂ€, pelkĂ€llĂ€ etusormella tehtĂ€vĂ€stĂ€ osoituksesta ja kosketuksesta yhden listakĂ€den sormenpÀÀstĂ€. Osoittava kĂ€simuoto voi sisĂ€ltÀÀ sekĂ€ etu- ettĂ€ keskisormet ja tĂ€ten on mahdollista koskettaa kahta listakĂ€den sormenpÀÀtĂ€ yhtĂ€ aikaa ja nĂ€in viitata kahteen listan kohtaan simultaanisesti. Osoittava kĂ€si voi myös melkein suoran liikkeen ja pienen kosketuksen sijaan tehdĂ€ pyörĂ€htĂ€vĂ€n liikkeen listakĂ€den ojennettujen sormien ympĂ€ri tai linjamaisen liikkeen ojennettujen sormenpĂ€iden yli tai vieressĂ€. TĂ€llĂ€ pyörĂ€htĂ€vĂ€llĂ€ tai linjamaisella liikkeellĂ€ viittoja viittaa listan asioihin yhtenĂ€ ryhmĂ€nĂ€, tai jos linjamainen liike ei kosketakaan kaikkia listakĂ€den ojennettujen sormien sormenpĂ€itĂ€, viittoja voi ryhmitellĂ€ listan asiat kahdeksi ryhmĂ€ksi: nĂ€mĂ€, joihin koskettiin, ja nuo, joihin ei koskettu. Listkonstruktionen i finlandssvenskt teckensprĂ„k Doktorsavhandlingen behandlar om listkonstruktionen i finlandssvenskt teckensprĂ„k som Ă€r ett av de tvĂ„ teckensprĂ„ken i Finland. Det finlandssvenska teckensprĂ„ket Ă€r ett allvarligt hotat sprĂ„k med ungefĂ€r hundra nativa sprĂ„kanvĂ€ndare i Finland. Det hĂ€r Ă€r den första doktorsavhandlingen som fokuserar det finlandssvenska teckensprĂ„kets grammatik. Studien Ă€r deskriptiv och har kognitiv lingvistik som sin teoretiska ram. I avhandlingens teoretiska del har samlats beskrivningar av hur listkonstruktionen Ă€r beskriven och hur den anvĂ€nds i flera teckensprĂ„k runt omkring i vĂ€rlden. DĂ„ en person tecknar en listkonstruktion, visar hen med ena handen (den s.k. listhanden) ett till fem utstrĂ€ckta fingrar och pekar med den andra handen (den s.k. pekhanden) antingen pĂ„ ett eller flera av listhandens fingrar. Listkonstruktionen anvĂ€nds dĂ„ man listar olika saker eller enheter och dessa listenheter placeras pĂ„ listhandens fingrar. Antalet utstrĂ€ckta listfingrar beror pĂ„ antalet enheter pĂ„ listan. Dessa fingrar kan strĂ€ckas ut antigen alla pĂ„ en gĂ„ng (s.k. simultan lista) eller i tur och ordning dĂ„ listan framskrider (s.k. sekventiell lista). Listhanden kan ocksĂ„ hĂ„llas kvar i listformen under hela den tiden som listkonstruktionen produceras (s.k. permanent lista), eller listhanden kan förlora listformen under den tiden listhanden deltar i tecknandet av de listade sakerna och kan Ă„teruppta listformen dĂ„ det Ă€r dags för följande pekning pĂ„ listfingrarna. I den hĂ€r doktorsavhandlingen beskrivs hur de som tecknar finlandssvenskt teckensprĂ„k utnyttjar listkonstruktionen i informativa monologer som Ă€r publicerade pĂ„ Teckeneko (teckeneko.fi). Studien visar att anvĂ€ndningen Ă€r mĂ„ngsidig och kreativ. NĂ€mligen, pekhandens handform och rörelsen den handen gör skiljer sig ofta frĂ„n den prototypiska pekningen. Den prototypiska pekningen görs med ett pekfinger och rörelsen Ă€r mot ett listfinger och slutar med kontakt mellan ett av listhandens fingrar och pekhanden. Studien visar att pekhandens handform kan innehĂ„lla bĂ„de pek- och mittfingret vilket möjliggör att ha kontakt med tvĂ„ listfingrar samtidigt och dĂ€rmed hĂ€nvisa till tvĂ„ listpunkter simultant. Pekhanden kan ocksĂ„ göra en cirkulĂ€r rörelse runtomkring eller en nĂ€stan rak linjerörelse över eller nĂ€ra de utstrĂ€ckta listfingrarna i stĂ€llet för en rörelse till ett finger. Den som tecknar kan med denna cirkulĂ€ra eller linjĂ€ra rörelse hĂ€nvisa till de listade enheterna i en grupp, eller gruppera enheterna i tvĂ„ grupper om nĂ„got finger lĂ€mnas utanför den linjĂ€ra rörelsen: de som den pekande handen rörde vid och de som den pekande handen inte rörde vid

    Computational linguistics in the Netherlands 1996 : papers from the 7th CLIN meeting, November 15, 1996, Eindhoven

    Get PDF

    Computational linguistics in the Netherlands 1996 : papers from the 7th CLIN meeting, November 15, 1996, Eindhoven

    Get PDF

    Adjectivization in Russian: Analyzing participles by means of lexical frequency and constraint grammar

    Get PDF
    This dissertation explores the factors that restrict and facilitate adjectivization in Russian, an affixless part-of-speech change leading to ambiguity between participles and adjectives. I develop a theoretical framework based on major approaches to adjectivization, and assess the effect of the factors on ambiguity in the empirical data. I build a linguistic model using the Constraint Grammar formalism. The model utilizes the factors of adjectivization and corpus frequencies as formal constraints for differentiating between participles and adjectives in a disambiguation task. The main question that is explored in this dissertation is which linguistic factors allow for the differentiation between adjectivized and unambiguous participles. Another question concerns which factors, syntactic or morphological, predict ambiguity in the corpus data and resolve it in the disambiguation model. In the theoretical framework, the syntactic context signals whether a participle is adjectivized, whereas internal morphosemantic properties (that is, tense, voice, and lexical meaning) cause or prevent adjectivization. The exploratory analysis of these factors in the corpus data reveals diverse results. The syntactic factor, the adverb of measure and degree očenÊč ‘very’, which is normally used with adjectives, also combines with participles, and is strongly associated with semantic classes of their base verbs. Nonetheless, the use of očenÊč with a participle only indicates ambiguity when other syntactic factors of adjectivization are in place. The lexical frequency (including the ranks of base verbs and the ratios of participles to other verbal forms) and several morphological types of participles strongly predict ambiguity. Furthermore, past passive and transitive perfective participles not only have the highest mean ratios among the other morphological types of participles, but are also strong predictors of ambiguity. The linguistic model using weighted syntactic rules shows the highest accuracy in disambiguation compared to the models with weighted morphological rules or the rule based on weights only. All of the syntactic, morphological, and weighted rules combined show the best performance results. Weights are the most effective for removing residual ambiguity (similar to the statistical baseline model), but are outperformed by the models that use factors of adjectivization as constraints

    Eesti keele ĂŒldvaldkonna tekstide laia kattuvusega automaatne sĂŒndmusanalĂŒĂŒs

    Get PDF
    Seoses tekstide suuremahulise digitaliseerimisega ning digitaalse tekstiloome jĂ€rjest laiema levikuga on tohutul hulgal loomuliku keele tekste muutunud ja muutumas masinloetavaks. Masinloetavus omab potentsiaali muuta tekstimassiivid inimeste jaoks lihtsamini hallatavaks, nt lubada rakendusi nagu automaatne sisukokkuvĂ”tete tegemine ja tekstide pĂ”hjal kĂŒsimustele vastamine, ent paraku ei ulatu praegused automaatanalĂŒĂŒsi vĂ”imalused tekstide sisu tegeliku mĂ”istmiseni. Oletatakse, tekstide sisu mĂ”istvale automaatanalĂŒĂŒsile viib meid lĂ€hemale sĂŒndmusanalĂŒĂŒs – kuna paljud tekstid on narratiivse ĂŒlesehitusega, tĂ”lgendatavad kui „sĂŒndmuste kirjeldused”, peaks tekstidest sĂŒndmuste eraldamine ja formaalsel kujul esitamine pakkuma alust mitmete „teksti mĂ”istmist” nĂ”udvate keeletehnoloogia rakenduste loomisel. KĂ€esolevas vĂ€itekirjas uuritakse, kuivĂ”rd saab eestikeelsete tekstide sĂŒndmusanalĂŒĂŒsi kĂ€sitleda kui avatud sĂŒndmuste hulka ja ĂŒldvaldkonna tekste hĂ”lmavat automaatse lingvistilise analĂŒĂŒsi ĂŒlesannet. Probleemile lĂ€henetakse eesti keele automaatanalĂŒĂŒsi kontekstis uudsest, sĂŒndmuste ajasemantikale keskenduvast perspektiivist. Töös kohandatakse eesti keelele TimeML mĂ€rgendusraamistik ja luuakse raamistikule toetuv automaatne ajavĂ€ljendite tuvastaja ning ajasemantilise mĂ€rgendusega (sĂŒndmusviidete, ajavĂ€ljendite ning ajaseoste mĂ€rgendusega) tekstikorpus; analĂŒĂŒsitakse korpuse pĂ”hjal inimmĂ€rgendajate kooskĂ”la sĂŒndmusviidete ja ajaseoste mÀÀramisel ning lĂ”puks uuritakse vĂ”imalusi ajasemantika-keskse sĂŒndmusanalĂŒĂŒsi laiendamiseks geneeriliseks sĂŒndmusanalĂŒĂŒsiks sĂŒndmust vĂ€ljendavate keelendite samaviitelisuse lahendamise nĂ€itel. Töö pakub suuniseid tekstide ajasemantika ja sĂŒndmusstruktuuri mĂ€rgenduse edasiarendamiseks tulevikus ning töös loodud keeleressurssid vĂ”imaldavad nii konkreetsete lĂ”pp-rakenduste (nt automaatne ajakĂŒsimustele vastamine) katsetamist kui ka automaatsete mĂ€rgendustööriistade edasiarendamist.  Due to massive scale digitalisation processes and a switch from traditional means of written communication to digital written communication, vast amounts of human language texts are becoming machine-readable. Machine-readability holds a potential for easing human effort on searching and organising large text collections, allowing applications such as automatic text summarisation and question answering. However, current tools for automatic text analysis do not reach for text understanding required for making these applications generic. It is hypothesised that automatic analysis of events in texts leads us closer to the goal, as many texts can be interpreted as stories/narratives that are decomposable into events. This thesis explores event analysis as broad-coverage and general domain automatic language analysis problem in Estonian, and provides an investigation starting from time-oriented event analysis and tending towards generic event analysis. We adapt TimeML framework to Estonian, and create an automatic temporal expression tagger and a news corpus manually annotated for temporal semantics (event mentions, temporal expressions, and temporal relations) for the language; we analyse consistency of human annotation of event mentions and temporal relations, and, finally, provide a preliminary study on event coreference resolution in Estonian news. The current work also makes suggestions on how future research can improve Estonian event and temporal semantic annotation, and the language resources developed in this work will allow future experimentation with end-user applications (such as automatic answering of temporal questions) as well as provide a basis for developing automatic semantic analysis tools
    • 

    corecore