5 research outputs found
Improving Statistical Language Model Performance with Automatically Generated Word Hierarchies
An automatic word classification system has been designed which processes
word unigram and bigram frequency statistics extracted from a corpus of natural
language utterances. The system implements a binary top-down form of word
clustering which employs an average class mutual information metric. Resulting
classifications are hierarchical, allowing variable class granularity. Words
are represented as structural tags --- unique -bit numbers the most
significant bit-patterns of which incorporate class information. Access to a
structural tag immediately provides access to all classification levels for the
corresponding word. The classification system has successfully revealed some of
the structure of English, from the phonemic to the semantic level. The system
has been compared --- directly and indirectly --- with other recent word
classification systems. Class based interpolated language models have been
constructed to exploit the extra information supplied by the classifications
and some experiments have shown that the new models improve model performance.Comment: 17 Page Paper. Self-extracting PostScript Fil
Recommended from our members
Domain-independent information extraction in unstructured text
Extracting information from unstructured text has become an important research area in recent years due to the large amount of text now electronically available. This status report describes the findings and work done during the second year of a two-year Laboratory Directed Research and Development Project. Building on the first-year`s work of identifying important entities, this report details techniques used to group words into semantic categories and to output templates containing selective document content. Using word profiles and category clustering derived during a training run, the time-consuming knowledge-building task can be avoided. Though the output still lacks in completeness when compared to systems with domain-specific knowledge bases, the results do look promising. The two approaches are compatible and could complement each other within the same system. Domain-independent approaches retain appeal as a system that adapts and learns will soon outpace a system with any amount of a priori knowledge
Corpus linguistics and language learning: bootstrapping linguistic knowledge and resources from text
This submission for the award of the degree of PhD by published work must: “make a contribution to knowledge in a coherent and related subject area; demonstrate originality and independent critical ability; satisfy the examiners that it is of sufficient merit to qualify for the award of the degree of PhD.” It includes a selection of my work as a Lecturer (and later, Senior Lecturer) at Leeds University,
from 1984 to the present. The overall theme of my research has been bootstrapping linguistic knowledge and resources from text. A persistent strand of interest has been
unsupervised and semi-supervised machine learning of linguistic knowledge from textual sources; the attraction of this approach is that I could start with English, but
go on to apply analogous techniques to other languages, in particular Arabic. This theme covers a broad range of research over more than 20 years at Leeds University
which I have divided into 8 sub-topics: A: Constituent-Likelihood statistical modelling of English grammar; B: Machine Learning of grammatical patterns from a corpus; C: Detecting grammatical errors in English text; D: Evaluation of English grammatical annotation models; E: Machine Learning of semantic language models; F: Applications in English language teaching; G: Arabic corpus linguistics; H:
Applications in Computing teaching and research. The first section builds on my early years as a lecturer at Leeds University, when my research was essentially a progression from my previous work at Lancaster University on the LOB Corpus Part-of-Speech Tagging project (which resulted in the Tagged LOB Corpus, a resource for Corpus Linguistics research still in use today); I investigated a range of
ideas for extending and/or applying techniques related to Part-of-Speech tagging in Corpus Linguistics. The second section covers a range of co-authored papers representing grant-funded research projects in Corpus Linguistics; in this mode of research, I had to come up with the original ideas and guide the project, but much of the detailed implementation was down to research assistant staff. Another highly productive mode of research has been supervision of research students, leading to
further jointly-authored research papers. I helped formulate the research plans, and guided and advised the students; as with research-grant projects, the detailed
implementation of the research has been down to the research students. The third section includes a few of the most significant of these jointly-authored Corpus
Linguistics research papers. A “standard” PhD generally includes a survey of the field to put the work in context; so as a fourth section, I include some survey papers
aimed at introducing new developments in corpus linguistics to a wider audience
Voicing imperial subjects in British literature:a corpus analysis of literary dialect, 1768-1929
This study investigates nonstandard dialect as it used in fictional dialogue. The works included in it were produced by British authors between 1768 and 1929 – a period marking the expansion and height of the British Empire. One of the project’s aims is to examine the connections among dialect representation and the imperial project, to investigate how ventriloquizing African diasporic, Chinese, and Indian characters works with related forms of characterization to encode ideologies and relations of power. A related aim is to explore the emergence and evolution of these literary dialects over time and to compare their structures as they are used to impersonate different communities of speakers. In order to track such patterns of representation, a corpus was constructed from the dialogue of 126 novels, plays, and short stories. That dialogue was then annotated for more than 200 lexical, morphological, orthographic, and phonological features. That data enable statistical analyses that model variation in the voicing of speakers and how those voicings change over time. This modeling demonstrates, for example, an increase in the frequency of phonological features for African diasporic dialogue and a countervailing decrease in the frequency and complexity of coded features generally for Indian dialogue. Trends like these that are surfaced though quantitative methods are further contextualized using qualitative, archival data. The analysis ultimately rests on connecting patterns of representation to changes in the imperial political economy, evolving language ideologies that circulate in the Anglophone world, and shifts in sociocultural anxieties that crosscut race and empire. The combined quantitative and qualitative analyses, therefore, expose representational systems – the apparatuses that propagate structures and the social attitudes that accrue to those structures. It further demonstrates that in such propagation, structures and attitudes are complementary