16,064 research outputs found

    Analisi del discorso attraverso tecniche di ‘Concordancing’. Aspetti teorico metodologici

    Get PDF
    The seminar aims at providing theoretical and methodological background to Corpus Linguistics research, in terms of corpus creation, annotation and analysis. A corpus is a collection of naturally-occurring language text, chosen to characterize a state or variety of a language, a collection of texts representative of a given language put together for linguistic analysis. Corpus-based approaches to language analysis are used to expound, test or exemplify theories and descriptions that were formulated before large corpora became available to inform language study. Corpus-driven linguists are strictly committed to the integrity of the data as a whole. Theoretical statements are fully consistent with, and reflect directly, the evidence provided by the corpus. Corpus mark-up is the system of standard codes inserted into a document stored in electronic form to provide information about the text itself. The most widely used mark-up schemes are TEI (Text Encoding Initiative) and CES (Corpus Encoding Standard). Annotation makes extracting information easier, faster and enables human analysts to exploit and retrieve analyses of which they are not themselves capable. Annotated corpora are reusable resources. Corpus annotation records a linguistic analysis explicitly and provides a standard reference resource, a stable base of linguistic analyses, so that successive studies can be compared and contrasted. There are different types of corpora: parallel corpora (source texts plus translations) which can be either unidirectional (from La to Lb or from Lb to Lc alone) or bidirectional (from La to Lb and from Lb to La); comparable corpora (monolingual subcorpora designed using the same sampling techniques); general corpora (BNC, AMC); specialised corpora (MICASE); monitor corpora (Bank of English); reference corpora. Corpora can be used for a wide variety of language analyses. These range from lexicography/terminology to (computational) Linguistics, from dictionaries and grammars to (Critical) Discourse Analysis, from Translation practice and theory to Language teaching and learning. Basic notions of Corpus Linguistics methodology include; Concordance / Concordancer, Collocation (Lexis), Colligation (Grammar), Semantic Preference (Semantics), Discourse Prosody (Pragmatics), Paradigmatic and Syntagmatic Dimensions, the lexico-grammar approach, the idiom principle vs. open-choice principle. To know a word is to know how to use it since certain grammar attracts certain words. For example grammatical words like “a” and “the” are often used in phrases rather than being used independently, compare: “A free hand” vs. “her free hand”, “Hurt his leg” vs. “hit someone in the leg”, “Turn her face” vs. “a slap in the face”. During the seminar different software tools were presented, highlighting their similarities and differences. These include Xaira, WordSmith Tools, AntConc, Concgram as well as web-resources

    ON MONITORING LANGUAGE CHANGE WITH THE SUPPORT OF CORPUS PROCESSING

    Get PDF
    One of the fundamental characteristics of language is that it can change over time. One method to monitor the change is by observing its corpora: a structured language documentation. Recent development in technology, especially in the field of Natural Language Processing allows robust linguistic processing, which support the description of diverse historical changes of the corpora. The interference of human linguist is inevitable as it determines the gold standard, but computer assistance provides considerable support by incorporating computational approach in exploring the corpora, especially historical corpora. This paper proposes a model for corpus development, where corpus are annotated to support further computational operations such as lexicogrammatical pattern matching, automatic retrieval and extraction. The corpus processing operations are performed by local grammar based corpus processing software on a contemporary Indonesian corpus. This paper concludes that data collection and data processing in a corpus are equally crucial importance to monitor language change, and none can be set aside

    A Data-Oriented Approach to Semantic Interpretation

    Full text link
    In Data-Oriented Parsing (DOP), an annotated language corpus is used as a stochastic grammar. The most probable analysis of a new input sentence is constructed by combining sub-analyses from the corpus in the most probable way. This approach has been succesfully used for syntactic analysis, using corpora with syntactic annotations such as the Penn Treebank. If a corpus with semantically annotated sentences is used, the same approach can also generate the most probable semantic interpretation of an input sentence. The present paper explains this semantic interpretation method, and summarizes the results of a preliminary experiment. Semantic annotations were added to the syntactic annotations of most of the sentences of the ATIS corpus. A data-oriented semantic interpretation algorithm was succesfully tested on this semantically enriched corpus.Comment: 10 pages, Postscript; to appear in Proceedings Workshop on Corpus-Oriented Semantic Analysis, ECAI-96, Budapes

    Irish treebanking and parsing: a preliminary evaluation

    Get PDF
    Language resources are essential for linguistic research and the development of NLP applications. Low- density languages, such as Irish, therefore lack significant research in this area. This paper describes the early stages in the development of new language resources for Irish – namely the first Irish dependency treebank and the first Irish statistical dependency parser. We present the methodology behind building our new treebank and the steps we take to leverage upon the few existing resources. We discuss language specific choices made when defining our dependency labelling scheme, and describe interesting Irish language characteristics such as prepositional attachment, copula and clefting. We manually develop a small treebank of 300 sentences based on an existing POS-tagged corpus and report an inter-annotator agreement of 0.7902. We train MaltParser to achieve preliminary parsing results for Irish and describe a bootstrapping approach for further stages of development

    Annotating patient clinical records with syntactic chunks and named entities: the Harvey corpus

    Get PDF
    The free text notes typed by physicians during patient consultations contain valuable information for the study of disease and treatment. These notes are difficult to process by existing natural language analysis tools since they are highly telegraphic (omitting many words), and contain many spelling mistakes, inconsistencies in punctuation, and non-standard word order. To support information extraction and classification tasks over such text, we describe a de-identified corpus of free text notes, a shallow syntactic and named entity annotation scheme for this kind of text, and an approach to training domain specialists with no linguistic background to annotate the text. Finally, we present a statistical chunking system for such clinical text with a stable learning rate and good accuracy, indicating that the manual annotation is consistent and that the annotation scheme is tractable for machine learning
    • …
    corecore