10 research outputs found

    Dependency Parsing using Prosody Markers from a Parallel Text

    Get PDF
    Proceedings of the Ninth International Workshop on Treebanks and Linguistic Theories. Editors: Markus Dickinson, Kaili Müürisep and Marco Passarotti. NEALT Proceedings Series, Vol. 9 (2010), 127-138. © 2010 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/15891

    Open-source resources and standards for Arabic word structure analysis: Fine grained morphological analysis of Arabic text corpora

    Get PDF
    Morphological analyzers are preprocessors for text analysis. Many Text Analytics applications need them to perform their tasks. The aim of this thesis is to develop standards, tools and resources that widen the scope of Arabic word structure analysis - particularly morphological analysis, to process Arabic text corpora of different domains, formats and genres, of both vowelized and non-vowelized text. We want to morphologically tag our Arabic Corpus, but evaluation of existing morphological analyzers has highlighted shortcomings and shown that more research is required. Tag-assignment is significantly more complex for Arabic than for many languages. The morphological analyzer should add the appropriate linguistic information to each part or morpheme of the word (proclitic, prefix, stem, suffix and enclitic); in effect, instead of a tag for a word, we need a subtag for each part. Very fine-grained distinctions may cause problems for automatic morphosyntactic analysis – particularly probabilistic taggers which require training data, if some words can change grammatical tag depending on function and context; on the other hand, finegrained distinctions may actually help to disambiguate other words in the local context. The SALMA – Tagger is a fine grained morphological analyzer which is mainly depends on linguistic information extracted from traditional Arabic grammar books and prior knowledge broad-coverage lexical resources; the SALMA – ABCLexicon. More fine-grained tag sets may be more appropriate for some tasks. The SALMA –Tag Set is a theory standard for encoding, which captures long-established traditional fine-grained morphological features of Arabic, in a notation format intended to be compact yet transparent. The SALMA – Tagger has been used to lemmatize the 176-million words Arabic Internet Corpus. It has been proposed as a language-engineering toolkit for Arabic lexicography and for phonetically annotating the Qur’an by syllable and primary stress information, as well as, fine-grained morphological tagging

    Digital Classical Philology

    Get PDF
    The buzzwords “Information Society” and “Age of Access” suggest that information is now universally accessible without any form of hindrance. Indeed, the German constitution calls for all citizens to have open access to information. Yet in reality, there are multifarious hurdles to information access – whether physical, economic, intellectual, linguistic, political, or technical. Thus, while new methods and practices for making information accessible arise on a daily basis, we are nevertheless confronted by limitations to information access in various domains. This new book series assembles academics and professionals in various fields in order to illuminate the various dimensions of information's inaccessability. While the series discusses principles and techniques for transcending the hurdles to information access, it also addresses necessary boundaries to accessability.This book describes the state of the art of digital philology with a focus on ancient Greek and Latin. It addresses problems such as accessibility of information about Greek and Latin sources, data entry, collection and analysis of Classical texts and describes the fundamental role of libraries in building digital catalogs and developing machine-readable citation systems

    Sentiment analysis and resources for informal Arabic text on social media

    Get PDF
    Online content posted by Arab users on social networks does not generally abide by the grammatical and spelling rules. These posts, or comments, are valuable because they contain users’ opinions towards different objects such as products, policies, institutions, and people. These opinions constitute important material for commercial and governmental institutions. Commercial institutions can use these opinions to steer marketing campaigns, optimize their products and know the weaknesses and/ or strengths of their products. Governmental institutions can benefit from the social networks posts to detect public opinion before or after legislating a new policy or law and to learn about the main issues that concern citizens. However, the huge size of online data and its noisy nature can hinder manual extraction and classification of opinions present in online comments. Given the irregularity of dialectal Arabic (or informal Arabic), tools developed for formally correct Arabic are of limited use. This is specifically the case when employed in sentiment analysis (SA) where the target of the analysis is social media content. This research implemented a system that addresses this challenge. This work can be roughly divided into three blocks: building a corpus for SA and manually tagging it to check the performance of the constructed lexicon-based (LB) classifier; building a sentiment lexicon that consists of three different sets of patterns (negative, positive, and spam); and finally implementing a classifier that employs the lexicon to classify Facebook comments. In addition to providing resources for dialectal Arabic SA and classifying Facebook comments, this work categorises reasons behind incorrect classification, provides preliminary solutions for some of them with focus on negation, and uses regular expressions to detect the presence of lexemes. This work also illustrates how the constructed classifier works along with its different levels of reporting. Moreover, it compares the performance of the LB classifier against Naïve Bayes classifier and addresses how NLP tools such as POS tagging and Named Entity Recognition can be employed in SA. In addition, the work studies the performance of the implemented LB classifier and the developed sentiment lexicon when used to classify other corpora used in the literature, and the performance of lexicons used in the literature to classify the corpora constructed in this research. With minor changes, the classifier can be used in domain classification of documents (sports, science, news, etc.). The work ends with a discussion of research questions arising from the research reported

    Collaborative Research Practices and Shared Infrastructures for Humanities Computing

    Get PDF
    The volume collect the proceedings of the 2nd Annual Conference of the Italian Association for Digital Humanities (Aiucd 2013), which took place at the Department of Information Engineering of the University of Padua, 11-12 December 2013. The general theme of Aiucd 2013 was “Collaborative Research Practices and Shared Infrastructures for Humanities Computing” so we particularly welcomed submissions on interdisciplinary work and new developments in the field, encouraging proposals relating to the theme of the conference, or more specifically: interdisciplinarity and multidisciplinarity, legal and economic issues, tools and collaborative methodologies, measurement and impact of collaborative methodologies, sharing and collaboration methods and approaches, cultural institutions and collaborative facilities, infrastructures and digital libraries as collaborative environments, data resources and technologies sharing

    Collaborative Research Practices and Shared Infrastructures for Humanities Computing

    Get PDF
    The volume collect the proceedings of the 2nd Annual Conference of the Italian Association for Digital Humanities (Aiucd 2013), which took place at the Department of Information Engineering of the University of Padua, 11-12 December 2013. The general theme of Aiucd 2013 was “Collaborative Research Practices and Shared Infrastructures for Humanities Computing” so we particularly welcomed submissions on interdisciplinary work and new developments in the field, encouraging proposals relating to the theme of the conference, or more specifically: interdisciplinarity and multidisciplinarity, legal and economic issues, tools and collaborative methodologies, measurement and impact of collaborative methodologies, sharing and collaboration methods and approaches, cultural institutions and collaborative facilities, infrastructures and digital libraries as collaborative environments, data resources and technologies sharing

    Subjektin sijamuoto 700- ja 800-luvun toscanalaisten asiakirjojen latinassa

    Get PDF
    The object of this study is the case marking of the subject in early medieval charter Latin. The work explores whether and how the nominative/accusative-type morphosyntactic alignment changed into a semantically motivated (active/inactive) alignment in Late Latin before the disappearance of the case system. It is known that the accusative originally the case of the direct object extended in Late Latin to the subject function in which Classical Latin allowed only the nominative. On this basis, it has been postulated that in Late Latin the nominative/accusative contrast was (re)semanticized so that the nominative came to encode all the Agent-like arguments and the accusative all the Patient-like arguments. The study examines which semantic and syntactic factors determine the selection of the subject case in each subject/finite verb combination in the Late Latin Charter Treebank (LLCT). The LLCT is an annotated corpus of Latin charter texts (c. 200,000 words) written in Tuscany between AD 714 and 869. The central result of the study is that the Latin of the LLCT shows a semantically based morphosyntactic alignment in those parts of nominal declension where the morphological contrast between nominative- and accusative-based forms is morphophonologically intact. The following picture of intransitivity split turns up: the low-animacy subjects of the LLCT occur more often in the accusative than do the agentive high-animacy subjects. Likewise, the accusative percentage of SO subject constructions is higher than that of A/SA subject constructions. The common denominator of the examined semantic variables is likely to be the control exercised by the subject over the verbal process. Syntactic factors seem to influence the case distribution pattern as well. For example, the immediate preverbal position of the subject implies a high retention of the nominative. The immediate preverbal position of SV(O) language is a canonical subject position where the syntactic complexity measured as dependency lengths is at its lowest and the cohesion of the verbal nucleus at its highest. Thus, a by then already marked nominative form results. The control of the subject over the verbal process (semantic variable) and the cohesion of the verbal nucleus (syntactic variable) may be partly conflated, i.e., both may affect the subject case selection in certain conditions.Klassisessa latinassa subjektin sijamuoto oli nominatiivi ja suoran objektin akkusatiivi. Myöhäislatinassa akkusatiivia alettiin käyttää myös subjektin sijana. Tämä tutkimus selvittää varhaiskeskiaikaisen asiakirjakorpuksen avulla, miten klassisen latinan nominatiivi akkusatiivi-perustainen sijamerkintäjärjestelmä muuttui vähitellen semanttisesti motivoituneeksi aktiivi inaktiivi-järjestelmäksi. Semanttisessa sijamerkinnässä nominatiivi erikoistuu merkitsemään kaikkia argumentteja, jotka ovat semanttiselta rooliltaan agentteja, ja akkusatiivi argumentteja, jotka ovat rooliltaan patientteja. Huolimatta järjestelmän semanttisuudesta myös syntaktisten tekijöiden oletetaan vaikuttavan sijamerkintään. Tutkin, mitkä semanttiset ja syntaktiset muuttujat näyttävät parhaiten selittävän subjektin sijamuodon kussakin Late Latin Charter Treebankin (LLCT) subjekti finiittiverbi-yhtymässä. LLCT on latinankielisten asiakirjatekstien annotoitu korpus (n. 200 000 sanaa), joka sisältää 519 asiakirjaa Toscanasta Italiasta vuosilta 714 869 jKr. Tutkimuksen perusteella LLCT:n latinassa vallitsi semanttisesti motivoitunut sijamerkintäjärjestelmä niissä taivutusluokissa, joissa oli säilynyt morfofonologinen ero nominatiivi- ja akkusatiivipohjaisten muotojen välillä: tarkoitteeltaan elottomat subjektit esiintyvät useammin akkusatiivissa kuin tarkoitteeltaan elolliset, agentiiviset subjektit. Vastaavasti epäakkusatiivisten rakenteiden SO-subjektit ovat useammin akkusatiivissa kuin epäergatiivisten ja transitiivisten rakenteiden A/SA-subjektit. Tutkittuja semanttisia muuttujia näyttää yhdistävän se, kuinka vahvasti subjekti kontrolloi predikaattiverbin ilmaisemaa tilaa tai toimintaa. Myös syntaktiset tekijät näyttävät vaikuttavan LLCT:n subjektien sijajakaumaan. Esimerkiksi välittömästi ennen predikaattiverbiä sijaitsevissa subjekteissa nominatiivi on erityisen yleinen. Predikaattiverbiä edeltävä asema on SV(O)-kielessä subjektin kanoninen asema, jossa dependenssietäisyytenä mitattu syntaktinen kompleksisuus on minimissään ja verbin ja sen argumenttien välinen koheesio vahvimmillaan. Tällainen ympäristö sallii oletettavasti tunnusmerkittömästä tunnusmerkkiseksi muuttuneen nominatiivin realisoitumisen. Subjektin verbaaliprosessiin kohdistama kontrolli (semanttinen muuttuja) ja verbin sekä sen argumenttien välinen koheesio (syntaktinen muuttuja) näyttävät sulautuneen osittain yhteen

    A computational approach to Latin verbs: new resources and methods

    Get PDF
    Questa tesi presenta l'applicazione di metodi computazionali allo studio dei verbi latini. In particolare, mostriamo la creazione di un lessico di sottocategorizzazione estratto automaticamente da corpora annotati; inoltre presentiamo un modello probabilistico per l'acquisizione di preferenze di selezione a partire da corpora annotati e da un'ontologia (Latin WordNet). Infine, descriviamo i risultati di uno studio diacronico e quantitativo sui preverbi spaziali latini

    Building a dynamic lexicon from a digital library

    No full text
    We describe here in detail our work toward creating a dynamic lexicon from the texts in a large digital library. By leveraging a small structured knowledge source (a 30,457 word treebank), we are able to extract selectional preferences for words from a 3.5 million word Latin corpus. This is promising news for low-resource languages and digital collections seeking to leverage a small human investment into much larger gain. The library architecture in which this work is developed allows us to query customized subcorpora to report on lexical usage by author, genre or era and allows us to continually update the lexicon as new texts are added to the collection
    corecore