23 research outputs found

    AFRILEX-ALASA 2009 Conference Book

    Get PDF

    Predicting IELTS ratings using vocabulary measures

    Get PDF
    This thesis addresses the relationship between vocabulary measures and IELTS ratings. The research questions focus on the relationship between measures of lexical richness and teacher ratings. The specific question the thesis seeks to address is: Which measures of lexical richness are the best for predicting the ratings? This question has been considered central in vocabulary measurement research for the last decades particularly in relation to IELTS, one of the most popular exams in the world. Therefore, if a model can predict IELTS scores by using vocabulary measures it could be used as a predictive tool by teachers and researchers worldwide. The research was carried out through two studies, Study 1 and Study 2 and then the model was tested through a third smaller study. Study 1 was a small pilot study which looked at both oral and written data. Study 2 focused on written data only. Measures of both lexical diversity and sophistication were chosen for both studies. Both studies followed similar methodologies with the addition of an extra variable in the second study. For the first study data was collected from 42 IELTS learners whereas for the second study an existing corpus was used. The measures investigated in both studies were: Tokens, TTR, D, Guiraud, Types, Guiraud Advanced and P_Lex. The first four are measures of lexical diversity, the other three measures of lexical sophistication. However, all of the previous measures are measures of breadth of vocabulary. For the second study, a measure of formulaic count was added. This is an aspect of depth of vocabulary used to check if results would improve with this addition. Formulaic sequences were counted in each essay by using Martinez and Schmitt’s (2012) PHRASE List of the 505 most frequent non-transparent multiword expressions in English. The main findings show that all the measures correlate with the ratings but Tokens has the highest correlation of all lexical diversity measures, and Types has the highest correlation of all lexical sophistication measures. TTR, Guiraud and P_Lex can explain 52.8% of the variability in the Lexical ratings. In addition, holistic ratings can be predicted by the same two lexical diversity measures (TTR and Guiraud) but with a different measure of lexical sophistication, Guiraud Advanced. The model consisting of these three measures can explain 49.2% of the variability in the holistic ratings. The formulaic count did not seem to improve the model’s predictive validity, but further analysis from a qualitative angle seemed to explain this behaviour. In Study 3, the holistic ratings model was tested using a small sample of real IELTS data and the examiners comments’ were used for a more qualitative analysis. This revealed that the model underestimated the scores since the range of ratings from the IELTS data was wider than the range of the data from Study 2 which were used as the basis for the model. This proved to be a major hindrance to the study. However, the qualitative analysis confirmed the argument that vocabulary accounts for a high percentage of variance in ratings and provided insights to other aspects that may influence raters which could be added to the model in future research. The issues and limitations of the study and the current findings contribute to the field by stimulating further research into producing a predictive tool that could inform students of their predicted rating before they decide to take the IELTS exam. This could have potential financial benefits for students

    Research on Phraseology Across Continents

    Get PDF
    The second volume of the IDP series contains papers by phraseologists from five continents: Europe, Australia, North America, South America and Asia, which were written within the framework of the project Intercontinental Dialogue on Phraseology, prepared and coordinated by Joanna Szerszunowicz, conducted by the University of Bialystok in cooperation with Kwansei Gakuin University in Japan. The book consists of the following parts: Dialogue on Phraseology, General and Corpus Linguistics & Phraseology, Lexicography & Phraseology, Contrastive Linguistics, Translation & Phraseology, Literature, Cultural Studies, Education & Phraseology. Dialogue contains two papers written by widely recognised phraseologists: professor Anita Naciscione from Latvia and professor Irine Goshkheteliani.The volume has been financed by the Philological Department of the University of Bialysto

    Алгоритмічно-програмний адаптований метод автоматизованого виявлення трендів в текстових оголошеннях про вакансії

    Get PDF
    Дана магістерська дисертація присвячена дослідженню та розробці алгоритмічно-програмного адаптованого методу автоматизованого виявлення трендів в текстових оголошеннях про вакансії. Дана магістерська дисертація включає у себе проведені дослідження проблеми виявлення трендів в текстових даних. У роботі виконано вивчення специфіки оголошень про вакансії, розміщених в мережі, та сформовано перелік їх характеристик. Проаналізовано процес виявлення трендів з можливістю врахування в ньому специфіки описів вакансій і в результаті запропоновано модифікацію етапу передоброблення вхідних даних запропонованого базового методу виявлення трендів. Розроблено алгоритмічно-програмний адаптований метод виявлення трендів в описах вакансій, що представлені в мережі, з можливістю врахування особливостей описів тестових даних цих вакансій. На основі проведеного дослідження було викладено теоретичний матеріал, що описує запропонований метод, та програмно реалізовано запропонований метод. У даній магістерській дисертації надано результати роботи запропонованого методу та проведено їх детальний аналіз.This master's dissertation is devoted to the research and development of an algorithmic-software adapted method of automated trend detection in text vacancies advertisements. This master's dissertation includes research on the problem of identifying trends in text data. In this work the study of specifics of vacancy announcements posted on the network was completed and a list of their characteristics was formed. The process of trend detection is analyzed with the possibility of taking into account the specifics of job descriptions and as a result, a modification of the stage of pre-processing of the input data of the proposed basic method of trend detection is proposed. An algorithmic-software adapted method for identifying trends in job descriptions presented in the network has been developed, with the possibility of taking into account the features of test data descriptions of these vacancies. On the basis of the conducted research the theoretical material describing the offered method was stated, and the offered method was programmatically realized. In this master's dissertation the results of the proposed method are given and their detailed analysis is carried out

    An examination of how loanwords in a corpus of spoken and written contemporary isiXhosa are incorporated into the noun class system of isiXhosa

    Get PDF
    Lexical change is a natural phenomenon for all of the world’s languages. This change can be viewed in terms of language contact, technological innovation and the adoption of new lifestyles. Whereas in the past isiXhosa, a Nguni language spoken in South Africa, borrowed words from both English and Afrikaans, contemporary speakers rely more on the English lexicon, with some previous adoptions from Afrikaans being replaced by those from English. This study focusses specifically on contemporary borrowed, or loanword nouns in isiXhosa which are brought into the noun class system of the language via a number of different noun class prefixes. The focus of this study is to understand whether there are any features or properties, whether morphological or semantic, that predispose loanword nouns to fall into a particular noun class. In this thesis I therefore analyse a corpus of new data from conversations and interviews I conducted with contemporary isiXhosa-speakers, as well as from written translation activities. After providing a general background to the semantic content of isiXhosa noun classes, I analyse the new data and try to make some conclusions as to which noun class prefix is the most productive for loanwords, as well as to argue the existence of a significant amount of variation in terms of prefixes used. The study concludes that most loanword nouns are assigned to Noun Class 9, but some speakers also use Noun Classes 1a, 5 and 7 as alternatives for Class 9 under certain morphological and semantic conditions. Even Noun Class 3 was found to contain a number of loanword nouns, suggesting that speakers are able to manipulate the grammar of isiXhosa, and particularly its noun class system, to accommodate words from other languages

    A Cross-Sectional Study of English-Major Students’ Receptive and Productive Vocabulary Knowledge

    Get PDF
    This study explores the relationship between receptive and productive vocabulary knowledge. The relationship between productive and receptive vocabulary can be framed as dichotomous (with two separate stores), or developmental (with words that start as part of the receptive state moving to the productive state). This study draws on both understandings. The relationship was investigated at frequency levels and different years of study. The study also makes a distinction between controlled productive and free productive knowledge. Receptive knowledge was analysed using the first four categories (a word-recognition task and a translation task) of the Vocabulary Knowledge Scale (VKS) (Paribakht and Wesche, 1997). Controlled productive use was investigated by the fifth category of the VKS (a sentence-writing task). Free productive use data was collected with an argumentative essay-writing task by Laufer and Nation (1995). To ensure consistency of the analysis, the same words and the same scoring systems were applied in these tests. The words produced in the free productive test were lemmatised, grouped based on frequency levels, and graded in terms of correctness of usage in order to facilitate comparison with the other data sets. The data was quantitatively analysed within both the dichotomous and the developmental understandings of the relationship between receptive and productive vocabulary knowledge. Within the dichotomous approach, a three-scale scoring system was used to grade the correctness of the translations and the words used in the tests. Within the developmental approach, I tracked how the participants' word knowledge changed by adopting Paribakht and Wesche's (1997) five-scale scoring. The data showed that all forms of vocabulary knowledge were all affected by frequency levels and years of study. The same data also showed that the knowledge moved forward and backward on a continuum. The findings were triangulate with qualitative analysis. Overall, the findings suggest that words cannot be simply classified into receptive or productive vocabulary stores. The study shows that we need a more sophisticated view of vocabulary knowledge that allows for different patterns of development for different aspects of vocabulary knowledge. Word knowledge gradually moves along the cline with its aspects moving to receptive or productive states at different degrees and at different time

    A distributional investigation of German verbs

    Get PDF
    Diese Dissertation bietet eine empirische Untersuchung deutscher Verben auf der Grundlage statistischer Beschreibungen, die aus einem großen deutschen Textkorpus gewonnen wurden. In einem kurzen Überblick über linguistische Theorien zur lexikalischen Semantik von Verben skizziere ich die Idee, dass die Verbbedeutung wesentlich von seiner Argumentstruktur (der Anzahl und Art der Argumente, die zusammen mit dem Verb auftreten) und seiner Aspektstruktur (Eigenschaften, die den zeitlichen Ablauf des vom Verb denotierten Ereignisses bestimmen) abhängt. Anschließend erstelle ich statistische Beschreibungen von Verben, die auf diesen beiden unterschiedlichen Bedeutungsfacetten basieren. Insbesondere untersuche ich verbale Subkategorisierung, Selektionspräferenzen und Aspekt. Alle diese Modellierungsstrategien werden anhand einer gemeinsamen Aufgabe, der Verbklassifikation, bewertet. Ich zeige, dass im Rahmen von maschinellem Lernen erworbene Merkmale, die verbale lexikalische Aspekte erfassen, für eine Anwendung von Vorteil sind, die Argumentstrukturen betrifft, nämlich semantische Rollenkennzeichnung. Darüber hinaus zeige ich, dass Merkmale, die die verbale Argumentstruktur erfassen, bei der Aufgabe, ein Verb nach seiner Aspektklasse zu klassifizieren, gut funktionieren. Diese Ergebnisse bestätigen, dass diese beiden Facetten der Verbbedeutung auf grundsätzliche Weise zusammenhängen.This dissertation provides an empirical investigation of German verbs conducted on the basis of statistical descriptions acquired from a large corpus of German text. In a brief overview of the linguistic theory pertaining to the lexical semantics of verbs, I outline the idea that verb meaning is composed of argument structure (the number and types of arguments that co-occur with a verb) and aspectual structure (properties describing the temporal progression of an event referenced by the verb). I then produce statistical descriptions of verbs according to these two distinct facets of meaning: In particular, I examine verbal subcategorisation, selectional preferences, and aspectual type. All three of these modelling strategies are evaluated on a common task, automatic verb classification. I demonstrate that automatically acquired features capturing verbal lexical aspect are beneficial for an application that concerns argument structure, namely semantic role labelling. Furthermore, I demonstrate that features capturing verbal argument structure perform well on the task of classifying a verb for its aspectual type. These findings suggest that these two facets of verb meaning are related in an underlying way

    APPLICATION OF LINK GRAMMAR IN SEMI-SUPERVISED NAMED ENTITY RECOGNITION FOR ACCIDENT DOMAIN

    Get PDF
    Accident document typically contains some crucial information that might be useful for analysis process for future accident investigation i.e. date and time when the accident happened, location where the accident occurred and also the person involved in the accident. This document is largely available in free text; it can be in the form of news wire articles or accident reports. Although it is possible to identify the information manually, due to the high volumes of data involved, this task can be time consuming and prone to error. Information Extraction (IE) has been identified as a potential solution to this problem. IE has the ability to extract crucial information from unstructured texts and convert them into a more structured representation. This research is attempted to explore Name Entity Recognition (NER), one of the important tasks in IE research aimed to identify and classify entities in the text documents into some predefined categories. Numerous related research works on IE and NER have been published and commercialized. However, to the best of our knowledge, there exists only a handful of IE research works that are really focused on accident domain. In addition, none of these works have attempted to either explore or focus on NER, which becomes the main motivation for this research. The work presented in this thesis proposed an NER approach for accident documents that applies syntactical and word features in combination with Self-Training algorithm. In order to satisfy the research objectives, this thesis comes with three main contributions. The first contribution is the identification of the entity boundary. Entity segmentation or identification of entity boundary is required since named entity may consist of one or more words. We adopted Stanford Part-of-Speech (POS) tagger for the word POS tag and connectors from the Link Grammar (LG) parser to determine the starting and stopping word. The second contribution is the extraction pattern construction. Each named entity candidate will be assigned with an extraction pattern constructed from a set of word and syntactical feature. Current NER system used restricted syntactical features which are associated with a number of limitations. It is therefore a great challenge to propose a new NER approach using syntactical features that could capture all syntactical structure in a sentence. For the third contribution, we have applied the Self-Training algorithm which is one of the semi-supervised machines learning technique. The algorithm is utilized for predicting a huge set of unlabeled data, given a small number of labelled data. In our research, extraction pattern from the first module will be fed to this algorithm and is used to make the prediction of named entity candidate category. The Self-Training algorithm greatly benefits semi-supervised learning which allows classification of entities given only a small-size of labelled data. The algorithm reduces the training efforts and generates almost similar result as compared to the conventional supervised learning technique. The proposed system was tested on 100 accident news from Reuters to recognize three different named entities: date, person and location which are universally accepted categories in most NER applications. Exact Match evaluation method which consists of three evaluation metrics; precision, recall and F-measure is used to measure the proposed system performance against three existing NER systems. The proposed system has successfully outperforms one of those systems with an overall F-measure of approximately 9% but in the other hand it shows a slight decrease as compared to other two systems identified in our benchmarking. However, we believe that this difference is due to the different nature and techniques used in the three systems. We consider our semi-supervised approach as a promising method even though only two features are utilized: syntactical and word features. Further manual inspection during the experiments suggested that by using complete word and syntactical features or combination of these features with other features such as the semantic feature, would yield an improved result

    A corpus linguistic analysis of phraseology and collocation in the register of current European Union administrative French

    Get PDF
    The French administrative language of the European Union is an emerging discourse: it is less than fifty years old, and has its origins in the French administrative register of the middle of the twentieth century. This thesis has two main objectives. The first is descriptive: using the flourishing methodology of corpus linguistics, and a specially compiled two-million word corpus of texts, it aims to describe the current discourse of EU French in terms of its phraseology and collocational patterning, in particular in relation to its French national counterpart. The description confirms the phraseological specificity of EU language but shows that not all of this can be ascribed to semantic or pragmatic factors. The second objective of this thesis is therefore explanatory: given the phraseological differences evident between the two discourses, and by means of a diachronic comparison, it asks how the EU discourse has developed in relation to the national discourse. A detailed analysis is provided of differences between the administrative language as a whole and other registers of French, and indeed of genre-based variation within the administrative register. Three main types of phraseological patterning are investigated: phraseology which is the creation of administrators themselves; phraseological elements which are part of the general language heritage adopted by the administrative register; and collocational patterning which, as a statistical notion, is the creation of the corpus. The thesis then seeks to identify the most significant influences on the discourse. The data indicates that, contrary to expectations, English, nowadays the most commonly-used official language of the EU institutions, has had relatively little influence. More importantly, the translation process itself has acted as a conservative influence on the EU discourse. This corresponds with linguistic findings about the nature of translated text
    corecore