5 research outputs found

    Improving Statistical Language Model Performance with Automatically Generated Word Hierarchies

    Full text link
    An automatic word classification system has been designed which processes word unigram and bigram frequency statistics extracted from a corpus of natural language utterances. The system implements a binary top-down form of word clustering which employs an average class mutual information metric. Resulting classifications are hierarchical, allowing variable class granularity. Words are represented as structural tags --- unique nn-bit numbers the most significant bit-patterns of which incorporate class information. Access to a structural tag immediately provides access to all classification levels for the corresponding word. The classification system has successfully revealed some of the structure of English, from the phonemic to the semantic level. The system has been compared --- directly and indirectly --- with other recent word classification systems. Class based interpolated language models have been constructed to exploit the extra information supplied by the classifications and some experiments have shown that the new models improve model performance.Comment: 17 Page Paper. Self-extracting PostScript Fil

    Corpus linguistics and language learning: bootstrapping linguistic knowledge and resources from text

    Get PDF
    This submission for the award of the degree of PhD by published work must: “make a contribution to knowledge in a coherent and related subject area; demonstrate originality and independent critical ability; satisfy the examiners that it is of sufficient merit to qualify for the award of the degree of PhD.” It includes a selection of my work as a Lecturer (and later, Senior Lecturer) at Leeds University, from 1984 to the present. The overall theme of my research has been bootstrapping linguistic knowledge and resources from text. A persistent strand of interest has been unsupervised and semi-supervised machine learning of linguistic knowledge from textual sources; the attraction of this approach is that I could start with English, but go on to apply analogous techniques to other languages, in particular Arabic. This theme covers a broad range of research over more than 20 years at Leeds University which I have divided into 8 sub-topics: A: Constituent-Likelihood statistical modelling of English grammar; B: Machine Learning of grammatical patterns from a corpus; C: Detecting grammatical errors in English text; D: Evaluation of English grammatical annotation models; E: Machine Learning of semantic language models; F: Applications in English language teaching; G: Arabic corpus linguistics; H: Applications in Computing teaching and research. The first section builds on my early years as a lecturer at Leeds University, when my research was essentially a progression from my previous work at Lancaster University on the LOB Corpus Part-of-Speech Tagging project (which resulted in the Tagged LOB Corpus, a resource for Corpus Linguistics research still in use today); I investigated a range of ideas for extending and/or applying techniques related to Part-of-Speech tagging in Corpus Linguistics. The second section covers a range of co-authored papers representing grant-funded research projects in Corpus Linguistics; in this mode of research, I had to come up with the original ideas and guide the project, but much of the detailed implementation was down to research assistant staff. Another highly productive mode of research has been supervision of research students, leading to further jointly-authored research papers. I helped formulate the research plans, and guided and advised the students; as with research-grant projects, the detailed implementation of the research has been down to the research students. The third section includes a few of the most significant of these jointly-authored Corpus Linguistics research papers. A “standard” PhD generally includes a survey of the field to put the work in context; so as a fourth section, I include some survey papers aimed at introducing new developments in corpus linguistics to a wider audience

    Voicing imperial subjects in British literature:a corpus analysis of literary dialect, 1768-1929

    Get PDF
    This study investigates nonstandard dialect as it used in fictional dialogue. The works included in it were produced by British authors between 1768 and 1929 – a period marking the expansion and height of the British Empire. One of the project’s aims is to examine the connections among dialect representation and the imperial project, to investigate how ventriloquizing African diasporic, Chinese, and Indian characters works with related forms of characterization to encode ideologies and relations of power. A related aim is to explore the emergence and evolution of these literary dialects over time and to compare their structures as they are used to impersonate different communities of speakers. In order to track such patterns of representation, a corpus was constructed from the dialogue of 126 novels, plays, and short stories. That dialogue was then annotated for more than 200 lexical, morphological, orthographic, and phonological features. That data enable statistical analyses that model variation in the voicing of speakers and how those voicings change over time. This modeling demonstrates, for example, an increase in the frequency of phonological features for African diasporic dialogue and a countervailing decrease in the frequency and complexity of coded features generally for Indian dialogue. Trends like these that are surfaced though quantitative methods are further contextualized using qualitative, archival data. The analysis ultimately rests on connecting patterns of representation to changes in the imperial political economy, evolving language ideologies that circulate in the Anglophone world, and shifts in sociocultural anxieties that crosscut race and empire. The combined quantitative and qualitative analyses, therefore, expose representational systems – the apparatuses that propagate structures and the social attitudes that accrue to those structures. It further demonstrates that in such propagation, structures and attitudes are complementary
    corecore