Search CORE

121,894 research outputs found

CORPUS LINGUISTICS, LANGUAGE CORPORA AND LANGUAGE TEACHING

Author: Dewi Ekawati Anita
Marzuq Ahmad
Perdana Prasetya Eska
Sapta Nugraha Deni
Saputri Darlis Tiara
Publication venue: 'LPPM Universitas Ibn Khaldun Bogor'
Publication date: 17/09/2020
Field of study

This research is about Corpus Linguistics, Language Corpora, And Language Teaching. As we know about this science is relatively new and is associated with technology. There are several areas discussed in this study such as several important parts of the corpus, the information generated in the corpus, four main characteristics of the corpus, Types of Corpora, Corpora in Language Teaching, several types that could be related to corpus research, Applications of corpus linguistics to language teaching may be direct or indirect. The field of applied linguistics analyses large collections of written and spoken texts, which have been carefully designed to represent specific domains of language use, such as informal speech or academic writing

Electronic Journals of UIKA Bogor (Universitas Ibn Khaldun)

Learning and teaching of connectors of contrargumentation in the Spanish language. The use of the student corpus

Author: Górska Weronika
Publication venue: 'Adam Mickiewicz University Poznan'
Publication date: 15/06/2006
Field of study

The present article is a part of M.A. thesis completed and defended in the Institute of Linguistics at Adam Mickiewicz University in Poznań (June 2006). The main focus of the work is the presentation of the application of analysis of the student corpus in the examination of the mastering of the connectors of contrargumentation. The base for the work is being formed by a corpus of 300 pages. The material has been created by the author who would get information via forum La ruta de la lengua Espanola, where information are being transferred between Spanish people – future teachers of their native language, and the students of Spanish from abroad. The text contains introduction to the linguistics of the corpus, methodology of its creation as well as analysis and comparison to the standard corpus of the Spanish language. The present article is a part of M.A. thesis completed and defended in the Institute of Linguistics at Adam Mickiewicz University in Poznań (June 2006). The main focus of the work is the presentation of the application of analysis of the student corpus in the examination of the mastering of the connectors of contrargumentation. The base for the work is being formed by a corpus of 300 pages. The material has been created by the author who would get information via forum La ruta de la lengua Espanola, where information are being transferred between Spanish people – future teachers of their native language, and the students of Spanish from abroad. The text contains introduction to the linguistics of the corpus, methodology of its creation as well as analysis and comparison to the standard corpus of the Spanish language

Biblioteka Nauki - repozytorium artykuÅÃ³w

Investigationes Linguisticae

What Counts as Data?

Author: Bernstein Anya
Publication venue: BrooklynWorks
Publication date: 22/09/2021
Field of study

We live in an age of information. But whether information counts as data depends on the questions we put to it. The same bit of information can constitute important data for some questions, but be irrelevant to others. And even when relevant, the same bit of data can speak to one aspect of our question while having little to say about another. Knowing what counts as data, and what it is data of, makes or breaks a data-driven approach. Yet that need for clarity sometimes gets ignored or assumed away. In this essay, I examine what counts as data in legal corpus linguistics, a method of interpretation that uses large datasets of actual language use to give empirical heft to claims about how “ordinary people” would use or understand legal terminology—claims that pervade legal interpretation. Unlike corpus linguistics in the field of linguistics, however, legal corpus linguistic analysis tends not to articulate or examine just what its datasets can reveal. Practitioners are thus liable to make large claims on the basis of materials that don’t support them—materials that provide information, but do not constitute data that answers the questions legal corpus linguistics poses. This essay undertakes a more careful parsing of what the corpora preferred by legal corpus linguistics can, and cannot, reveal. Although I conclude that legal corpus linguistics currently faces a mismatch between information and aspiration, I also suggest areas of legal work where it can be of real use

Brooklyn Law School: BrooklynWorks

bepress Legal Repository

Digital Commons @ University at Buffalo School of Law

Using a combined approach of ontology construction and corpus linguistics analysis to build a course on printmaking terminology

Author: Eumeridou Eugenia
Publication venue: Selected papers on theoretical and applied linguistics
Publication date: 24/07/2019
Field of study

In this paper a combined approach to ESP teaching is proposed, one that combines corpus linguistics techniques and ontology construction. Corpus linguistics techniques have long been used in ESP teaching to provide the teacher with authentic pieces of language usage to guide and monitor classroom practice. Yet, an English language teacher who is not a subject specialist himself needs some guidance as to which are the most salient concepts in the field he is teaching, the breadth of a certain topic and the depth of specialization required. Traditionally, such kind of information could only be provided by a subject specialist. In this paper, a method is outlined for constructing a special language ontology by employing corpus linguistics techniques and the appropriate software. Such a method could prove an indispensable tool for any teacher teaching ESP and provide valuable insight to the labyrinth of special subject knowledge. The special language presented as a case study in this paper concerns the art form of printmaking

Aristotle University of Thessaloniki: Open Journals / ΑΡΙΣΤΟΤΕΛΕΙΟ ΠΑΝΕΠΙΣΤΗΜΙΟ ΘΕΣΣΑΛΟΝΙΚΗΣ

Scaling out for extreme scale corpus data

Author: Coole Matt
Mariani John Amedeo
Rayson Paul Edward
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2015
Field of study

Much of the previous work in Big Data has focussed on numerical sources of information. However, with the `narrative turn' in many disciplines gathering pace and commercial organisations beginning to realise the value of their textual assets, natural language data is fast catching up as an exploitable source of information for decision making. With vast quantities of unstructured textual data on the web, in social media, and in newly digitised historical document archives, the 5Vs (Volume, Velocity, Variety, Value and Veracity) apply equally well, if not more so, to big textual data. Corpus linguistics, the computer-aided study of large collections of naturally occurring language data, has been dealing with big data for fifty years. Corpus linguistics methods impose complex requirements on the retrieval, annotation and analysis of text in terms of displaying narrow contexts for each occurrence of a word or linguistic feature being studied and counting co-occurrences with other words or features to determine significant patterns in language. This, coupled with the distribution of language features in accordance with Zipf's Law, poses complex challenges for data models and corpus software dealing with extreme scale language data. A related issue is the non-random nature of language and the `burstiness' of word occurrences, or what we might put in Big Data terms as a sixth `V' called Viscosity. We report experiments to examine and compare the capabilities of two No-SQL databases in clustered configurations for the indexing, retrieval and analysis of billion-word corpora, since this size is the current state-of-the-art in corpus linguistics. We find that modern DBMSs (Database Management Systems) are capable of handling this extreme scale corpus data set for simple queries but are limited when querying for more frequent words or more complex queries

Crossref

Lancaster E-Prints

Compiling of Phonetic Database Structure

Author: Heydarova Maya
Publication venue: 'Publishing Center Dialog'
Publication date: 09/06/2021
Field of study

The voice corpus of language is the essential part of the linguistic resources, and it contains the phonetic database. A phonetic database is a structured collection of software-delivered speech fragments. Nowadays, phonetic database or voice corpus became like a new element in speech technologies, and much investigation has taken place according to this event. The investigators' interest in voice corpus is related to the development of a speech recognition system. Today it is enough to experience in preparation of a phonetic database. Equipped with unique information on the preparation and usage of everyday speech corpus, the development level of speech technologies and the increasing power of computer technologies allow for the investigation of various language materials, largescale, and statistical phonetic research. These developed directions of linguistics were investigated in this article. Speech corpora are a valuable source of information for phonological research and the study of sound patterns. The study of speech corpora is in its infancy compared to other field studies in linguistics. Existing speech corpora form the part of the world's languages and do not fully represent all the dialects and speech forms by phonological aspect. The article analyses the history, structure, and importance of developing speech corpses, a branch of corpus linguistics and has developed in recent years. The article also lists the main features to be considered in the design of the speech corpus

Traektoria Nauki

From corpus-based collocation frequencies to readability measure

Author: Anagnostou N.K.
Weir G.R.S.
Publication venue
Publication date: 01/06/2006
Field of study

This paper provides a broad overview of three separate but related areas of research. Firstly, corpus linguistics is a growing discipline that applies analytical results from large language corpora to a wide variety of problems in linguistics and related disciplines. Secondly, readability research, as the name suggests, seeks to understand what makes texts more or less comprehensible to readers, and aims to apply this understanding to issues such as text rating and matching of texts to readers. Thirdly, collocation is a language feature that occurs when particular words are used frequently together for other than purely grammatical reasons. The intersection of these three aspects provides the basis for on-going research within the Department of Computer and Information Sciences at the University of Strathclyde and is the motivation for this overview. Specifically, we aim through analysis of collocation frequencies in major corpora, to afford valuable insight on the content of texts, which we believe will, in turn, provide a novel basis for estimating text readability

University of Strathclyde Institutional Repository

Steps for Creating two Persian Specialized Corpora

Author: Alayiaboozar Elham
Hojjatpanah Ali Asghar
Publication venue: Regional Information Center for Science & Technology
Publication date: 12/10/2022
Field of study

Currently, most linguistic studies benefit from valid linguistic data available at corpora. Compiling corpora is a common practice in linguistic research. The present study introduces two specialized corpora in Persian; a specialized corpus is used to study a particular type of language or language variety. For building such corpora, first, a set of texts were compiled based on pre-established criteria used in the sampling process (including the mode of the texts, type of the texts, domain of the texts, language/ language varieties of the texts and the date of the texts). The corpora are specialized because they include technical terms in information processing and management, librarianship, linguistics, computational linguistics, thesaurus building, managing, policy-making, natural language processing, information technology, information retrieval, ontology and other related interdisciplinary domains. After compiling data and Metadata, the texts were preprocessed (normalized and tokenized) and annotated (automated POS tagging); finally, the tags were manually checked. Each corpus includes more than four million words. Since not many specialized corpora are built in Persian, such corpora could be considered valuable resources for researchers interested in studying linguistic variations in Persian interdisciplinary texts.https://dorl.net/dor/20.1001.1.20088302.2022.20.4.14.

International Journal of Information Science and Management (IJISM)

Analisi del discorso attraverso tecniche di ‘Concordancing’. Aspetti teorico metodologici

Author: VENUTI MARCO
Publication venue
Publication date: 01/01/2011
Field of study

The seminar aims at providing theoretical and methodological background to Corpus Linguistics research, in terms of corpus creation, annotation and analysis. A corpus is a collection of naturally-occurring language text, chosen to characterize a state or variety of a language, a collection of texts representative of a given language put together for linguistic analysis. Corpus-based approaches to language analysis are used to expound, test or exemplify theories and descriptions that were formulated before large corpora became available to inform language study. Corpus-driven linguists are strictly committed to the integrity of the data as a whole. Theoretical statements are fully consistent with, and reflect directly, the evidence provided by the corpus. Corpus mark-up is the system of standard codes inserted into a document stored in electronic form to provide information about the text itself. The most widely used mark-up schemes are TEI (Text Encoding Initiative) and CES (Corpus Encoding Standard). Annotation makes extracting information easier, faster and enables human analysts to exploit and retrieve analyses of which they are not themselves capable. Annotated corpora are reusable resources. Corpus annotation records a linguistic analysis explicitly and provides a standard reference resource, a stable base of linguistic analyses, so that successive studies can be compared and contrasted. There are different types of corpora: parallel corpora (source texts plus translations) which can be either unidirectional (from La to Lb or from Lb to Lc alone) or bidirectional (from La to Lb and from Lb to La); comparable corpora (monolingual subcorpora designed using the same sampling techniques); general corpora (BNC, AMC); specialised corpora (MICASE); monitor corpora (Bank of English); reference corpora. Corpora can be used for a wide variety of language analyses. These range from lexicography/terminology to (computational) Linguistics, from dictionaries and grammars to (Critical) Discourse Analysis, from Translation practice and theory to Language teaching and learning. Basic notions of Corpus Linguistics methodology include; Concordance / Concordancer, Collocation (Lexis), Colligation (Grammar), Semantic Preference (Semantics), Discourse Prosody (Pragmatics), Paradigmatic and Syntagmatic Dimensions, the lexico-grammar approach, the idiom principle vs. open-choice principle. To know a word is to know how to use it since certain grammar attracts certain words. For example grammatical words like “a” and “the” are often used in phrases rather than being used independently, compare: “A free hand” vs. “her free hand”, “Hurt his leg” vs. “hit someone in the leg”, “Turn her face” vs. “a slap in the face”. During the seminar different software tools were presented, highlighting their similarities and differences. These include Xaira, WordSmith Tools, AntConc, Concgram as well as web-resources

Archivio della ricerca - Università degli studi di Napoli Federico II

Modelling the flow of discourse in a corpus of written academic English

Author: Moore Nicolas
Publication venue
Publication date
Field of study

Discourse studies attempt to describe how context affects text, and how text progresses from one sentence to the next. Systemic Functional Linguistics (SFL) offers a model of language to describe how information flow varies according to context and co-text through the Textual metafunction, especially using the functions of Participant Identification and Tracking, Theme and Information Structure. These systems were evaluated by assembling a corpus of academic texts and assessing their information flow. Results of the analysis of the three grammatical systems in the Textual Metafunction demonstrate significant patterns, or unmarked choices, where the participant, thematic and information systems combine to powerful effect. Where the systems are not aligned, there is a recognisable effect on the flow of information

Sheffield Hallam University Research Archive