227 research outputs found

    Reflections on documentary corpora

    Get PDF
    For decades, language documentation proponents have argued for the separability of LD as its own sub-discipline. Many corpus linguists have made this same claim; thus, corpus linguistics shares the ethos of data over theorizing, whereby primary data represent authentic, connected discourse that is natural (not elicited), broadly sampled (across speakers, generations, dialects), and balanced (reflecting different usage contexts and genres). Nevertheless, many misconceptions remain about what a language corpus is, how it is formatted, how big or balanced it needs to be, and most importantly, how it is queried. In this reflection, I dispel some of these misconceptions, while reassuring community members and field linguists alike that a corpus is an exceedingly powerful tool for guiding the expansion of the documentary record, keeping precious language data in circulation, and helping to produce the classic descriptive by-products of LD such as dictionaries, phrasebooks, and grammars. Above all, the less-familiar but more direct by-products of corpus interrogation, such as word lists, frequency counts, concordance lines, N-grams, collocations, distribution, and dispersion plots, are so immediately interpretable and useful by speakers, learners, and linguists, that LD should give corpus linguistic training the same attention as project planning, ethics, recording, transcription, annotation, metadata, and archiving.National Foreign Language Resource Cente

    Community-based corpus-building: Three case studies

    Get PDF
    We describe three ongoing projects involving different First Peoples’ languages of Canada (Cree/nehiyawewin, Dene SƳƂinĂ©, and Nakoda/Stoney) that centre around the recording, transcription, compilation, and analysis of spontaneous oral language use––some narrative, some conversation––using freely available, Unicode-savvy corpus software (in this case, AntConc [Anthony 2014]) and little to no up- front annotation or translation into English. Because these languages are all polysynthetic, lemmatization and POS tagging are either unachievable or excessively time-draining and indeterminate activities. Nevertheless, corpus creation can still continue apace and reap huge benefits using the most basic of corpus tools. These projects are consonant with a growing ethos in language documentation circles that advocate for the value of corpus development alongside more traditional documentary activities (cf. McEnery & Ostler 2000, Woodbury 2003, Crowley 2007, Cox 2011, Mosel 2014, Vinogradov 2016). Each corpus is at a different stage of development, yet we hope to persuade community-based colleagues of the enormous benefits that ensue from the deliberate creation and use of a corpus of naturally occurring language data for language analysis and teaching. Direct benefits include ready-to-hand word lists; authentic sample utterances for exemplifying dictionaries, phrasebooks, and grammatical sketches; and a conscientious focus on recording many speakers across different demographic categories, discursive situations, and registers in order to achieve a broad range of usage conditions. A focus on wide and balanced sampling clearly strengthens the data pool from which analyses can follow. But it also results in a closer connection by speakers/learners to important and recurring phenomena in their language rather than to descriptions of phenomena that may have emerged through bilingual situations with a handful of speakers under the direct control of non-speaking linguists (who may have been guided by theoretical concerns unrelated to actual language use). Our demonstration corpora vary in size and composition, but each is already useful in revealing frequency, collocational, and distributional information about lexical items and morphosyntactic devices that may have received scant prior attention. We discuss the basics of corpus creation from scratch, the role of strategic metadata and file-naming practices, and illustrate the types of immediately interpretable analyses that standard corpus tools can provide with monolingual, untagged transcripts. Best of all, once the central principles and logistics of corpus creation are mastered, the corpus can grow in a natural and incremental way, involving an expanding group of participants. Ultimately, a broadly sampled corpus can provide a solid empirical basis for the study of lexico-syntactic phenomena, not to mention a lasting, reusable, and shareable record of actual language use. References Anthony, L. 2014. AntConc (Version 3.4.1m) [Computer Software]. Tokyo: Waseda University. Available from http://www.laurenceanthony.net/. Cox, C. 2011. Corpus linguistics and language documentation: Challenges for collaboration. In Newman, J., R. H. Baayen, & S. Rice (eds.), Corpus-Based Studies in Language Use, Language Learning, and Language Documentation, 239-264. Amsterdam: Brill. Crowley, T. 2007. Field Linguistics: A Beginner’s Guide. Oxford: Oxford University Press. McEnery, T. & N. Ostler. 2000. A new agenda for corpus linguistics––working with all of the world’s languages. Literary and Linguistic Computing 15 (4): 403-420. Mosel, U. 2014. Corpus linguistic and documentary approaches in writing a grammar of a previously undescribed language. Language Documentation and Conservation 8: 135-157. Vinogradov, I. 2016. Linguistic corpora of understudied languages: Do they make sense? Kåñina 40(1): 127-141. Woodbury, T. 2003. Defining documentary linguistics. Language Documentation and Description 1(1): 35-51

    Unlikely Lexical Entries

    Get PDF
    Proceedings of the Fourteenth Annual Meeting of the Berkeley Linguistics Society (1988), pp. 202-21

    Towards a Transitive Prototype: Evidence from Some Atypical English Passives

    Get PDF
    Proceedings of the Thirteenth Annual Meeting of the Berkeley Linguistics Society (1987), pp. 422-43

    Defining and implementing domains with multiple types using mesodata modelling techniques

    Get PDF
    The integration of data from different sources often leads to the adoption of schemata that entail a loss of information in respect of one or more of the data sets being combined. The coercion of data to conform to the type of the unified attribute is one of the major reasons for this information loss. We argue that for maximal information retention it would be useful to be able to define attributes over domains capable of accommodating multiple types, that is, domains that potentially allow an attribute to take its values from more than one base type. Mesodata is a concept that provides an intermediate conceptual layer between the definition of a relational structure and that of attribute definition to aid the specification of complex domain structures within the database. Mesodata modelling techniques involve the use of data types and operations for common data structures defined in the mesodata layer to facilitate accurate modelling of complex data domains, so that any commonality between similar domains used for different purposes can be exploited. This paper shows how the mesodata concept can be extended to facilitate the creation of domains defined over multiple base types, and also allow the same set of base values to be used for domains with different semantics. Using an example domain containing values representing three different types of incomplete knowledge about the data item (coarse granularity, vague terms, or intervals) we show how operations and data structures for types already existing within the mesodata can simplify the task of developing a new intelligent domain.Sydney, NS

    A Corpus Investigation of English Cognition Verbs and their Effect on the Incipient Epistemization of Physical Activity Verbs

    Get PDF
    In the spirit of NSM accounts that attempt to build up a language’s full expressivity from a small set of lexical primitives, we have investigated the usage in English of basic verbs of ideation ( think, know ) and physical activity ( strike, hit, go, run ) as they take on new epistemic meanings and functions, all the while calcifying in their inflectional range. It is well known that certain verbs of cognition in English such as remember , forget , and think are grammaticalizing into pragmatic particles of epistemic stance and, consequently, 1st person singular (1sg) forms account for the majority of usages. Likewise, we have carried out systematic queries and hand-tagging of corpus returns and have found that many verbs and phrasal expressions, ideational or not, seem to be associated with rather narrow collocational patterning, argument structure, and inflectional marking in almost idiom-like and constructional fashion. Moreover, we find that expressions associated with 1sg and 2nd person “cognizers” are, to a large extent, in complementary distribution, giving rise to fairly strong semantic differences in how I and you “ideate”. In this study, we demonstrate the extent of inflectional and collocational specificity for verbs of cognition and physical activity and discuss implications this lexico-syntactic idiosyncracy has for cognitive linguistics

    Event-Packing: The Case of Object Incorporation in English

    Get PDF
    Proceedings of the Seventeenth Annual Meeting of the Berkeley Linguistics Society: General Session and Parasession on The Grammar of Event Structure (1991), pp. 283-29

    The Glucose Transporter GLUT4 and the Aminopeptidase vp165 Colocalise in Tubulo-Vesicular Elements in Adipocytes and Cardiomyocytes

    Get PDF
    The aminopeptidase vp165 is one of the major polypeptides enriched in GLUT4-containing vesicles immuno-isolated from adipocytes. In the present study we have confirmed and quantified the high degree of colocalisation between GLUT4 and vp165 using double label immuno-electron microscopy on vesicles isolated from adipocytes and heart. The percentage of vp165-containing vesicles that also contained GLUT4 was 91%, 76%, and 86% in rat adipocytes, 3T3-L1 adipocytes, and rat heart, respectively. Internalisation of a transferrin/HRP (Tf/HRP) conjugate by 3T3-L1 adipocytes, followed by diaminobenzidine treatment in intact cells, resulted in ablation of only 41% and 45% of GLUT4 and vp165, respectively, whereas endosomal markers are almost quantitatively ablated. Using immuno-electron microscopy on cryosections it was determined that in atrial cardiomyocytes GLUT4 and vp165 colocalised in a population of tubulo-vesicular (T-V) elements that were often found close to the plasma membrane. Double label immunocytochemistry indicated a high degree of overlap in these T-V elements between GLUT4 and vp165. However, in atrial cardiomyocytes a large proportion of GLUT4 was also present in secretory granules containing atrial natriuretic factor (ANF). In contrast, very little vp165 was detected in ANF granules. These data indicate that GLUT4 and vp165 are colocalised in an intracellular, post-endocytic, tubulo-vesicular compartment in adipocytes and cardiomyocytes suggesting that both proteins are sorted in a similar manner in these cells. However, GLUT4 but not vp165 is additionally localised in the regulated secretory pathway in atrial cardiomyocytes
    • 

    corecore