48 research outputs found

    Sequence-Labeling RoBERTa Model for Dependency-Parsing in Classical Chinese and Its Application to Vietnamese and Thai

    Get PDF
    2023 8th International Conference on Business and Industrial Research (ICBIR), Bangkok, Thailand. 18-19 May 2023The author and his colleagues have been developing classical Chinese treebank using Universal Dependencies. We also developed RoBERTa-Classical-Chinese model pre-trained with classical Chinese texts of 1.7 billion characters. In this paper we describe how to finetune sequence-labeling RoBERTa model for dependency-parsing in classical Chinese. We introduce “goeswith”-labeled edges into the directed acyclic graphs of Universal Dependencies in order to resolve the mismatch between the token length of RoBERTa-Classical-Chinese and the word length in classical Chinese. We utilize [MASK]token of RoBERTa model to handle outgoing edges and to produce the adjacency-matrices for the graphs of Universal Dependencies. Our RoBERTa-UDgoeswith model outperforms other dependency-parsers in classical Chinese on LAS/MLAS/BLEX benchmark scores. Then we apply our methods to other isolating languages. For Vietnamese we introduce “goeswith”-labeled edges to separate words into space-separated syllables, and finetune RoBERTa and PhoBERT models. For Thai we try three kinds of tokenizers, character-wise tokenizer, quasi-syllable tokenizer, and SentencePiece, to produce RoBERTa models

    The MARCELL Legislative Corpus

    Get PDF

    Building a MbyĂĄ Treebank

    Get PDF
    This presentation relates the ongoing construction of a multilayer corpus of MbyĂĄ (Tupi Guarani: Argentina, Brazil, Paraguay). It will discuss (i) corpus composition (ii) ethical, linguistic and technological issues in corpus design and annotation, and (iii) usefulness for leveraging legacy texts in documenting language variation and recent evolution. (session 1.1.6

    Commerce Numérique: traffic signals for crossroads between cultures.

    Get PDF
    Commerce is a French literary journal - founded by Princess Margherita Caetani - which relied on the collaboration of three prestigious writers: Paul ValĂ©ry, LĂ©on-Paul Fargue, and ValĂ©ry Larbaud. The journal is composed of twenty-nine volumes published between 1924 and 1932. Each volume includes different literary material like poems and novels, written by both well- known and unknown writers, who also translated important authors like Joyce, T.S. Eliot, Pirandello, Ungaretti, Saint-John Perse, Rilke, and Hofmannsthal. Considering the historical, literary, and cultural importance of the journal Commerce, our project “Commerce numĂ©rique” aims to digitize and to make the journal contents freely available online to both the general public and the research community. This article describes the way in which the journal was encoded. Particular importance is also given to the encoding of poems present in Commerce. Some poems are in the original language and are accompanied by their French translation, other poems are in the French-translated form without the original text. In order to fully and accurately express the phenomena and their structures, we adopted some aspects of the TEI framework that will be explained in detail. The French translation of a Moroccan Arabic poem from the 13th century is also considered. The original Arabic poem is interesting because it presents aspects of both the Moroccan dialect and the oral text. The study and the encoding of the Arabic poem in parallel to its translation highlight some important structural differences between Arabic poetry and Western poetry

    DESQ: Frequent Sequence Mining with Subsequence Constraints

    Full text link
    Frequent sequence mining methods often make use of constraints to control which subsequences should be mined. A variety of such subsequence constraints has been studied in the literature, including length, gap, span, regular-expression, and hierarchy constraints. In this paper, we show that many subsequence constraints---including and beyond those considered in the literature---can be unified in a single framework. A unified treatment allows researchers to study jointly many types of subsequence constraints (instead of each one individually) and helps to improve usability of pattern mining systems for practitioners. In more detail, we propose a set of simple and intuitive "pattern expressions" to describe subsequence constraints and explore algorithms for efficiently mining frequent subsequences under such general constraints. Our algorithms translate pattern expressions to compressed finite state transducers, which we use as computational model, and simulate these transducers in a way suitable for frequent sequence mining. Our experimental study on real-world datasets indicates that our algorithms---although more general---are competitive to existing state-of-the-art algorithms.Comment: Long version of the paper accepted at the IEEE ICDM 2016 conferenc

    When linguistics meets web technologies. Recent advances in modelling linguistic linked data

    Get PDF
    This article provides an up-to-date and comprehensive survey of models (including vocabularies, taxonomies and ontologies) used for representing linguistic linked data (LLD). It focuses on the latest developments in the area and both builds upon and complements previous works covering similar territory. The article begins with an overview of recent trends which have had an impact on linked data models and vocabularies, such as the growing influence of the FAIR guidelines, the funding of several major projects in which LLD is a key component, and the increasing importance of the relationship of the digital humanities with LLD. Next, we give an overview of some of the most well known vocabularies and models in LLD. After this we look at some of the latest developments in community standards and initiatives such as OntoLex-Lemon as well as recent work which has been in carried out in corpora and annotation and LLD including a discussion of the LLD metadata vocabularies META-SHARE and lime and language identifiers. In the following part of the paper we look at work which has been realised in a number of recent projects and which has a significant impact on LLD vocabularies and models

    Using the Web Infrastructure for Real Time Recovery of Missing Web Pages

    Get PDF
    Given the dynamic nature of the World Wide Web, missing web pages, or 404 Page not Found responses, are part of our web browsing experience. It is our intuition that information on the web is rarely completely lost, it is just missing. In whole or in part, content often moves from one URI to another and hence it just needs to be (re-)discovered. We evaluate several methods for a \justin- time approach to web page preservation. We investigate the suitability of lexical signatures and web page titles to rediscover missing content. It is understood that web pages change over time which implies that the performance of these two methods depends on the age of the content. We therefore conduct a temporal study of the decay of lexical signatures and titles and estimate their half-life. We further propose the use of tags that users have created to annotate pages as well as the most salient terms derived from a page\u27s link neighborhood. We utilize the Memento framework to discover previous versions of web pages and to execute the above methods. We provide a work ow including a set of parameters that is most promising for the (re-)discovery of missing web pages. We introduce Synchronicity, a web browser add-on that implements this work ow. It works while the user is browsing and detects the occurrence of 404 errors automatically. When activated by the user Synchronicity offers a total of six methods to either rediscover the missing page at its new URI or discover an alternative page that satisfies the user\u27s information need. Synchronicity depends on user interaction which enables it to provide results in real time
    corecore