14 research outputs found

    #OrdinaryMeaning: Using Twitter as a Corpus in Statutory Analysis

    Get PDF

    Parallel Corpus Research and Target Language Representativeness: The Contrastive, Typological, and Translation Mining Traditions

    Get PDF
    This paper surveys the strategies that the Contrastive, Typological, and Translation Mining parallel corpus traditions rely on to deal with the issue of target language representativeness of translations. On the basis of a comparison of the corpus architectures and research designs of the three traditions, we argue that they have each developed their own representativeness strategies: (i) monolingual control corpora (Contrastive tradition), (ii) limits on the scope of research questions (Typological tradition), and (iii) parallel control corpora (Translation Mining tradition). We introduce normalized pointwise mutual information (NPMI) as a bi-directional measure of cross-linguistic association, allowing for an easy comparison of the outcomes of different traditions and the impact of the monolingual and parallel control corpus representativeness strategies. We further argue that corpus size has a major impact on the reliability of the monolingual control corpus strategy and that a sequential parallel control corpus strategy is preferable for smaller corpora

    Model izrade učeničkog korpusa

    Get PDF
    Korpusi danas bez dvojbe zauzimaju važno mjesto kao alat u raznim lingvističkim istraživanjima. Interes za učenjem stranih jezika potaknuo je dodatnu specijalizaciju i razvoj učeničkih korpusa. Veći svjetski jezici njihovu su važnost brzo prepoznali te su u skladu s tim najbrojniji učenički korpusi engleskoga jezika. Objava korpusa International Corpus of Learner English otvorila je vrata razvoju učeničkih korpusa i na osnovu njega nastali su mnogi istraživački radovi i novi korpusi. Osim razvoja učeničkih korpusa za druge veće jezike poput francuskog, španjolskog i njemačkog, u zadnjih nekoliko godina došlo je do razvoja učeničkih korpusa i manjih, ali nama bližih slavenskih jezika poput češkog i slovenskog, a naposljetku i korpusa za sam hrvatski jezik. Međutim, da bi brojni korpusi koji nastaju u posljednje vrijeme mogli biti temelj za korisne lingvističke analize koje će doprinijeti razumijevanju međujezika i procesa učenja stranog jezika, potrebno je prije same izrade razmisliti o brojnim pitanjima. Neka od tih pitanja odnose se na same učenike (dob, spol, materinski jezik), dok se druga odnose na vrstu teksta koji se prikuplja i način prikupljanja i obrade. Česta je pogreška mnogih lingvista i drugih istraživača uključenih u izradu učeničkih korpusa preskakanje koraka određivanja bitnih varijabli u fazi projektiranja korpusa, što za posljedicu ima nedostatak formalnog postupka izrade. Upotrebljivost korpusa izravno ovisi o pažnji koja je posvećena određivanju varijabli i načinu prikupljanja građe. Iz tog razloga oportunistički prikupljeni korpusi teško mogu predstavljati temelj za neke relevantne lingvističke analize i usporedbe jezika učenika i izvornih govornika. Iako je iz svega navedenog vidljivo da je standardizacija postupka izrade i anotacije korpusa od presudne važnosti za daljnji razvoj u području, opće prihvaćeni standard prema kojem bi se izrađivali i označavali svi novi korpusi još uvijek ne postoji

    Naturalistic Emotional Speech Corpora with Large Scale Emotional Dimension Ratings

    Get PDF
    The investigation of the emotional dimensions of speech is dependent on large sets of reliable data. Existing work has been carried out on the creation of emotional speech corpora and the acoustic analysis of emotional speech and this research seeks to buildupon this work while suggesting new methods and areas of potential. A review of the literature determined that a two dimensional emotional model of activation and evaluation was the ideal method for representing the emotional states expressed inspeech. Two case studies were carried out to investigate methods of obtaining naturalunderlying emotional speech in a high quality audio environment, the results of which were used to design a final experimental procedure to elicit natural underlying emotional speech. The speech obtained in this experiment was used in the creation ofa speech corpus that was underpinned by a persistent backend database that incorporated a three-tiered annotation methodology. This methodology was used to comprehensively annotate the metadata, acoustic data and emotional data of the recorded speech. Structuring the three levels of annotation and the assets in a persistent backend database allowed interactive web-based tools to be developed; aweb-based listening tool was developed to obtain a large amount of ratings for the assets that were then written back to the database for analysis. Once a large amount of ratings had been obtained, statistical analysis was used to determine the dimensionalrating for each asset. Acoustic analysis of the underlying emotional speech was then carried out and determined that certain acoustic parameters were correlated with the activation dimension of the dimensional model. This substantiated some of thefindings in the literature review and further determined that spectral energy was strongly correlated with the activation dimension in relation to underlying emotional speech. The lack of a correlation for certain acoustic parameters in relation to the evaluation dimension was also determined, again substantiating some of the findings in the literature.The work contained in this thesis makes a number of contributions to the field: the development of an experimental design to elicit natural underlying emotional speech in a high quality audio environment; the development and implementation of acomprehensive three-tiered corpus annotation methodology; the development and implementation of large scale web based listening tests to rate the emotional dimensions of emotional speech; the determination that certain acoustic parameters are correlated with the activation dimension of a dimensional emotional model inrelation to natural underlying emotional speech and the determination that certain acoustic parameters are not correlated with the evaluation dimension of a twodimensional emotional model in relation to natural underlying emotional speech

    Knowledge Organization and Terminology: application to Cork

    Get PDF
    This PhD thesis aims to prove the relevance of texts within the conceptual strand of terminological work. Our methodology serves to demonstrate how linguists can infer knowledge information from texts and subsequently systematise it, either through semiformal or formal representations. We mainly focus on the terminological analysis of specialised corpora resorting to semi-automatic tools for text analysis to systematise lexical-semantic relationships observed in specialised discourse context and subsequent modelling of the underlying conceptual system. The ultimate goal of this methodology is to propose a typology that can help lexicographers to write definitions. Based on the double dimension of Terminology, we hypothesise that text and logic modelling do not go hand in hand since the latter does not directly relate to the former. We highlight that knowledge and language are crucial for knowledge systematisation, albeit keeping in mind that they pertain to different levels of analysis, for they are not isomorphic. To meet our goals, we resorted to specialised texts produced within the industry of cork. These texts provide us with a test bed made of knowledge-rich data which enable us to demonstrate our deductive mechanisms employing the Aristotelian formula: X=Y+DC through the linguistic and conceptual analysis of the semi-automatically extracted textual data. To explore the corpus, we resorted to text mining strategies where regular expressions play a central role. The final goal of this study is to create a terminological resource for the cork industry, where two types of resources interlink, namely the CorkCorpus and the OntoCork. TermCork is a project that stems from the organisation of knowledge in the specialised field of cork. For that purpose, a terminological knowledge database is being developed to feed an e-dictionary. This e-dictionary is designed as a multilingual and multimodal product, where several resources, namely linguistic and conceptual ones are paired. OntoCork is a micro domain-ontology where the concepts are enriched with natural language definitions and complemented with images, either annotated with metainformation or enriched with hyperlinks to additional information, such as a lexicographic resource. This type of e-dictionary embodies what we consider a useful terminological tool in the current digital information society: accounting for its main features, along with an electronic format that can be integrated into the Semantic Web due to its interoperability data format. This aspect emphasises its contribution to reduce ambiguity as much as possible and to increase effective communication between experts of the domain, future experts, and language professionals.Cette thèse vise à prouver la pertinence des textes dans le volet conceptuel du travail terminologique. Notre méthodologie sert à démontrer comment les linguistes peuvent déduire des informations de connaissance à partir de textes et les systématiser par la suite, soit à travers des représentations semi-formelles ou formelles. Nous nous concentrons principalement sur l'analyse terminologique de corpus spécialisé faisant appel à des outils semi-automatiques d'analyse de texte pour systématiser les relations lexico-sémantiques observées dans un contexte de discours spécialisé et la modélisation ultérieure du système conceptuel sous-jacent. L’objectif de cette méthodologie est de proposer une typologie qui peut aider les lexicographes à rédiger des définitions. Sur la base de la double dimension de la terminologie, nous émettons l'hypothèse que la modélisation textuelle et logique ne va pas de pair puisque cette dernière n'est pas directement liée à la première. Nous soulignons que la connaissance et le langage sont essentiels pour la systématisation des connaissances, tout en gardant à l'esprit qu'ils appartiennent à différents niveaux d'analyse, car ils ne sont pas isomorphes. Pour atteindre nos objectifs, nous avons eu recours à des textes spécialisés produits dans l'industrie du liège. Ces textes nous fournissent un banc d'essai constitué de données riches en connaissances qui nous permettent de démontrer nos mécanismes déductifs utilisant la formule aristotélicienne : X = Y + DC à travers l'analyse linguistique et conceptuelle des données textuelles extraites semi-automatiquement. Pour l'exploitation du corpus, nous avons recours à des stratégies de text mining où les expressions régulières jouent un rôle central. Le but de cette étude est de créer une ressource terminologique pour l'industrie du liège, où deux types de ressources sont liés, à savoir le CorkCorpus et l'OntoCork. TermCork est un projet qui découle de l'organisation des connaissances dans le domaine spécialisé du liège. À cette fin, une base de données de connaissances terminologiques est en cours de développement pour alimenter un dictionnaire électronique. Cet edictionnaire est conçu comme un produit multilingue et multimodal, où plusieurs ressources, à savoir linguistiques et conceptuelles, sont jumelées. OntoCork est une micro-ontologie de domaine où les concepts sont enrichis de définitions de langage naturel et complétés par des images, annotées avec des méta-informations ou enrichies d'hyperliens vers des informations supplémentaires. Ce type de dictionnaire électronique désigne ce que nous considérons comme un outil terminologique utile dans la société de l'information numérique actuelle : la prise en compte de ses principales caractéristiques, ainsi qu'un format électronique qui peut être intégré dans le Web sémantique en raison de son format de données d'interopérabilité. Cet aspect met l'accent sur sa contribution à réduire autant que possible l'ambiguïté et à accroître l'efficacité de la communication entre les experts du domaine, les futurs experts et les professionnels de la langue

    Alati i tražilice za pretraživanje korpusa

    Get PDF
    Ovaj rad služi kao prikaz nekih od dostupnih arhitektura, odnosno alata i tražilica za pretraživanje korpusa. U prvom dijelu rada prikazana je teorijska pozadina korpusa, njihova općenita definicija, podjela i povijesni razvoj. Zatim je dan detaljan pregled odabranih alata i tražilica za pretraživanje korpusa, kao i primjeri korpusa koji se njima koriste. U ovom dijelu navedeni su i neki od hrvatskih korpusa i alata kojima služe kao operativna podloga. U trećem dijelu rada opisano je anketno istraživanje o uporabi korpusa u obrazovne ili istraživačke svrhe od strane korisnika aktivnih u nekima od jezičnih područja