27 research outputs found

    Advances in Manipulation and Recognition of Digital Ink

    Get PDF
    Handwriting is one of the most natural ways for a human to record knowledge. Recently, this type of human-computer interaction has received increasing attention due to the rapid evolution of touch-based hardware and software. While hardware support for digital ink reached its maturity, algorithms for recognition of handwriting in certain domains, including mathematics, are lacking robustness. Simultaneously, users may possess several pen-based devices and sharing of training data in adaptive recognition setting can be challenging. In addition, resolution of pen-based devices keeps improving making the ink cumbersome to process and store. This thesis develops several advances for efficient processing, storage and recognition of handwriting, which are applicable to the classification methods based on functional approximation. In particular, we propose improvements to classification of isolated characters and groups of rotated characters, as well as symbols of substantially different size. We then develop an algorithm for adaptive classification of handwritten mathematical characters of a user. The adaptive algorithm can be especially useful in the cloud-based recognition framework, which is described further in the thesis. We investigate whether the training data available in the cloud can be useful to a new writer during the training phase by extracting styles of individuals with similar handwriting and recommending styles to the writer. We also perform factorial analysis of the algorithm for recognition of n-grams of rotated characters. Finally, we show a fast method for compression of linear pieces of handwritten strokes and compare it with an enhanced version of the algorithm based on functional approximation of strokes. Experimental results demonstrate validity of the theoretical contributions, which form a solid foundation for the next generation handwriting recognition systems

    Un réseau local intelligent

    Get PDF

    Creating Digital Editions for Corpus Linguistics : The case of Potage Dyvers, a family of six Middle English recipe collections

    Get PDF
    This thesis presents a corpus-linguistically oriented digital documentary edition of six 15th-century culinary recipe collections, known as the Potage Dyvers family, with an introduction to its historical context and an analysis of its dialectal and structural features, and defines an editorial framework for producing such editions for the purposes of corpus linguistic research. Traditionally historical corpora have been compiled from printed editions not originally designed to serve as corpus linguistic data. Recently, both the digitalisation of textual editing and the turning of corpus compilers towards original sources have blurred the boundaries between these two crafts, placing corpus compilers into an editorial role. Despite the fact that traditional editorial approaches have been recognised as largely incompatible with the needs of linguistic research, and the established methods of corpus encoding do not satisfactorily represent the documentary context of manuscript texts, no explicitly linguistic editorial approach has so far been designed for editing manuscript sources for use in corpora. Even most digital editions, despite their advanced representational capabilities, are literary or historical in orientation and thus do not provide an adequate model. The editorial framework described here and the edition based on it have been explicitly designed to answer the needs of historical corpus linguistics. First, it aims at faithfully modelling the manuscript as a historical artefact, including both its textual content and its visual and material paratext, whose communicative importance has also been recognised by many historical linguists. Second, it presents this model in a form which allows not only the study of both text and paratext using corpus linguistic methods, but also allows resulting analytical metadata to be linked back to the edition, shared with other scholars, and used as the basis for further study. The edition itself is provided as a digital appendix to the thesis in the form of both a digital data archive encoded in TEI XML and three editorial presentations of this data, and serves not only as a demonstration of the editorial approach, but also provides a valuable new research resource. The choice of material is based on the insight that utilitarian texts like recipes provide valuable material especially for historical pragmatics and discourse studies. As one of the first vernacular text types, recipes also provide an excellent opportunity to study the diachronic development of a single textual genre. The Potage Dyvers family is the second largest known family of Middle English recipe collections, surviving in six physically diverse manuscripts. Of these, four were edited in 1888 by conflating them into two collections, but their complex interrelationships have so far escaped systematic study. The structural analysis of the six Potage Dyvers versions indicates that the family, containing a total of 371 unique recipes, in fact consists of three sibling pairs of MSS. Two of these contain largely the same material but in a different order, while the third shares only a core of 89 recipes with the others, deriving a large number of recipes from other sources. In terms of their language, all of the six versions exhibit mainly Midlands forms and combine dialectally unmarked forms with more local variants from different areas, reflecting the 15th-century loss of dialectal distinctions which has not yet reached orthographic or morphological uniformity, and indicating possible metropolitan associations.Tämä väitöskirja tarjoaa korpuslingvistisesti suuntautuneen digitaalisen tekstiedition kuudesta samankaltaisesta 1400-luvun englanninkielisestä ruokareseptikokoelmasta, jotka tunnetaan nimellä Potage Dyvers. Väitöskirja sisältää johdannon tekstien historialliseen kontekstiin sekä murrepiirteisiin ja tekstirakenteeseen pohjautuvat analyysit niiden todennäköisestä alkuperästä ja keskinäisistä suhteista. Väitöskirja kartoittaa historiallisen kielentutkimuksen käsikirjoituseditiolle asettamat vaatimukset ja määrittelee yksityiskohtaisen ohjeiston niiden täyttämiseksi. Historialliset tekstikorpukset on perinteisesti koottu digitoimalla painettuja tekstieditioita joita ei ole suunniteltu kielitieteelliseksi aineistoksi. Viime vuosina tekstieditioiden digitaalistuminen ja korpuslingvistien lisääntynyt kiinnostus alkuperäisiä dokumenttilähteitä kohtaan ovat häivyttäneet tekstieditoinnin ja kielikorpusten kokoamisen välistä rajaa. Vaikka yhtäältä perinteisten editointimenetelmien ongelmat kielentutkimuksen suhteen ja toisaalta aiempien historiallisten kielikorpusten tapa jättää huomiotta käsikirjoitustekstien materiaalinen konteksti on havaittu ongelmallisiksi, ei historiallisten käsikirjoituslähteiden esittämiseen tekstikorpuksissa ole kehitetty juurikaan menetelmiä. Väitöskirjan sisältämä ja kuvaama editio on suunniteltu erityisesti historiallisen korpuslingvistiikan tarpeisiin. Se pyrkii mallintamaan käsikirjoituksen historiallisena esineenä, tallentaen digitaalisesti paitsi tekstin, myös sen viestinnällisen merkityksen kannalta olennaisen materiaalisen kontekstin. Tämä malli esitetään muodossa, joka mahdollistaa paitsi sekä tekstin että materiaalisen kontekstin tutkimisen korpusmenetelmin, myös tutkimuksen tuloksena syntyvän metatiedon liittämisen alkuperäiseen editioon ja käyttämisen myöhemmän tutkimuksen pohjana. Itse editio joka toimii paitsi esimerkkinä editointimenetelmän käytöstä, myös itsessään arvokkaana tutkimusaineistona sisältyy väitöskirjan digitaalisiin liitteisiin sekä TEI XML -muotoisena digitaalisena data-arkistona että kolmessa erilaisessa esitysmuodossa. Keskiaikaiset reseptitekstit on valittu edition aineistoksi, koska niiden kaltaiset käytännölliset tekstit ovat arvokasta materiaalia esimerkiksi historialliselle pragmatiikalle ja diskurssintutkimukselle. Yhtenä vanhimmista kansankielisistä tekstilajeista reseptit myös tarjoavat erinomaisen tilaisuuden yksittäisen tekstilajin historiallisen kehityksen tutkimiseen. Kuutena eri versiona säilynyt Potage Dyvers on toiseksi suurin keskienglanninkielisten reseptikokoelmien ryhmä. Sen neljästä versiosta on olemassa vuonna 1888 julkaistu editio jossa ne esitettiin kahtena erillisenä tekstinä, mutta versioiden välisiä monimutkaisia suhteita ei ole tutkittu järjestelmällisesti. Versioiden välinen rakenneanalyysi osoittaa, että tämä yhteensä 371 ainutkertaista reseptiä sisältävä ryhmä koostuu itse asiassa kolmesta keskenään samankaltaisten kokoelmien parista. Näistä pareista kaksi sisältävät pääosin samat reseptit mutta hyvin eri järjestyksessä, kun taas kolmas jakaa muiden kanssa vain 89 reseptiä joihin se yhdistää suuren määrän reseptejä muista lähteistä. Kieleltään kaikki kuusi versiota edustavat pääosin Midlandsin alueen kielimuotoa, mutta murteellisesti värittömien muotojen suosiminen ja yhdistyminen useiden eri alueiden paikallisiin piirteisiin heijastaa 1400-luvulla tapahtunutta kieliasun yhtenäistymistä edeltävää murre-erojen tasaantumista, ja on erityisen tyypillistä Lontoon suurkaupunkialueen kielimuodolle

    Representation and Processing of Composition, Variation and Approximation in Language Resources and Tools

    Get PDF
    In my habilitation dissertation, meant to validate my capacity of and maturity for directingresearch activities, I present a panorama of several topics in computational linguistics, linguisticsand computer science.Over the past decade, I was notably concerned with the phenomena of compositionalityand variability of linguistic objects. I illustrate the advantages of a compositional approachto the language in the domain of emotion detection and I explain how some linguistic objects,most prominently multi-word expressions, defy the compositionality principles. I demonstratethat the complex properties of MWEs, notably variability, are partially regular and partiallyidiosyncratic. This fact places the MWEs on the frontiers between different levels of linguisticprocessing, such as lexicon and syntax.I show the highly heterogeneous nature of MWEs by citing their two existing taxonomies.After an extensive state-of-the art study of MWE description and processing, I summarizeMultiflex, a formalism and a tool for lexical high-quality morphosyntactic description of MWUs.It uses a graph-based approach in which the inflection of a MWU is expressed in function ofthe morphology of its components, and of morphosyntactic transformation patterns. Due tounification the inflection paradigms are represented compactly. Orthographic, inflectional andsyntactic variants are treated within the same framework. The proposal is multilingual: it hasbeen tested on six European languages of three different origins (Germanic, Romance and Slavic),I believe that many others can also be successfully covered. Multiflex proves interoperable. Itadapts to different morphological language models, token boundary definitions, and underlyingmodules for the morphology of single words. It has been applied to the creation and enrichmentof linguistic resources, as well as to morphosyntactic analysis and generation. It can be integratedinto other NLP applications requiring the conflation of different surface realizations of the sameconcept.Another chapter of my activity concerns named entities, most of which are particular types ofMWEs. Their rich semantic load turned them into a hot topic in the NLP community, which isdocumented in my state-of-the art survey. I present the main assumptions, processes and resultsissued from large annotation tasks at two levels (for named entities and for coreference), parts ofthe National Corpus of Polish construction. I have also contributed to the development of bothrule-based and probabilistic named entity recognition tools, and to an automated enrichment ofProlexbase, a large multilingual database of proper names, from open sources.With respect to multi-word expressions, named entities and coreference mentions, I pay aspecial attention to nested structures. This problem sheds new light on the treatment of complexlinguistic units in NLP. When these units start being modeled as trees (or, more generally, asacyclic graphs) rather than as flat sequences of tokens, long-distance dependencies, discontinu-ities, overlapping and other frequent linguistic properties become easier to represent. This callsfor more complex processing methods which control larger contexts than what usually happensin sequential processing. Thus, both named entity recognition and coreference resolution comesvery close to parsing, and named entities or mentions with their nested structures are analogous3to multi-word expressions with embedded complements.My parallel activity concerns finite-state methods for natural language and XML processing.My main contribution in this field, co-authored with 2 colleagues, is the first full-fledged methodfor tree-to-language correction, and more precisely for correcting XML documents with respectto a DTD. We have also produced interesting results in incremental finite-state algorithmics,particularly relevant to data evolution contexts such as dynamic vocabularies or user updates.Multilingualism is the leitmotif of my research. I have applied my methods to several naturallanguages, most importantly to Polish, Serbian, English and French. I have been among theinitiators of a highly multilingual European scientific network dedicated to parsing and multi-word expressions. I have used multilingual linguistic data in experimental studies. I believethat it is particularly worthwhile to design NLP solutions taking declension-rich (e.g. Slavic)languages into account, since this leads to more universal solutions, at least as far as nominalconstructions (MWUs, NEs, mentions) are concerned. For instance, when Multiflex had beendeveloped with Polish in mind it could be applied as such to French, English, Serbian and Greek.Also, a French-Serbian collaboration led to substantial modifications in morphological modelingin Prolexbase in its early development stages. This allowed for its later application to Polishwith very few adaptations of the existing model. Other researchers also stress the advantages ofNLP studies on highly inflected languages since their morphology encodes much more syntacticinformation than is the case e.g. in English.In this dissertation I am also supposed to demonstrate my ability of playing an active rolein shaping the scientific landscape, on a local, national and international scale. I describemy: (i) various scientific collaborations and supervision activities, (ii) roles in over 10 regional,national and international projects, (iii) responsibilities in collective bodies such as program andorganizing committees of conferences and workshops, PhD juries, and the National UniversityCouncil (CNU), (iv) activity as an evaluator and a reviewer of European collaborative projects.The issues addressed in this dissertation open interesting scientific perspectives, in whicha special impact is put on links among various domains and communities. These perspectivesinclude: (i) integrating fine-grained language data into the linked open data, (ii) deep parsingof multi-word expressions, (iii) modeling multi-word expression identification in a treebank as atree-to-language correction problem, and (iv) a taxonomy and an experimental benchmark fortree-to-language correction approaches

    Proceedings of the 15th ISWC workshop on Ontology Matching (OM 2020)

    Get PDF
    15th International Workshop on Ontology Matching co-located with the 19th International Semantic Web Conference (ISWC 2020)International audienc
    corecore