1,410 research outputs found
A lexical database tool for quantitative phonological research
A lexical database tool tailored for phonological research is described.
Database fields include transcriptions, glosses and hyperlinks to speech files.
Database queries are expressed using HTML forms, and these permit regular
expression search on any combination of fields. Regular expressions are passed
directly to a Perl CGI program, enabling the full flexibility of Perl extended
regular expressions. The regular expression notation is extended to better
support phonological searches, such as search for minimal pairs. Search results
are presented in the form of HTML or LaTeX tables, where each cell is either a
number (representing frequency) or a designated subset of the fields. Tables
have up to four dimensions, with an elegant system for specifying which
fragments of which fields should be used for the row/column labels. The tool
offers several advantages over traditional methods of analysis: (i) it supports
a quantitative method of doing phonological research; (ii) it gives universal
access to the same set of informants; (iii) it enables other researchers to
hear the original speech data without having to rely on published
transcriptions; (iv) it makes the full power of regular expression search
available, and search results are full multimedia documents; and (v) it enables
the early refutation of false hypotheses, shortening the
analysis-hypothesis-test loop. A life-size application to an African tone
language (Dschang) is used for exemplification throughout the paper. The
database contains 2200 records, each with approximately 15 fields. Running on a
PC laptop with a stand-alone web server, the `Dschang HyperLexicon' has already
been used extensively in phonological fieldwork and analysis in Cameroon.Comment: 7 pages, uses ipamacs.st
Automated tone transcription
In this paper I report on an investigation into the problem of assigning
tones to pitch contours. The proposed model is intended to serve as a tool for
phonologists working on instrumentally obtained pitch data from tone languages.
Motivation and exemplification for the model is provided by data taken from my
fieldwork on Bamileke Dschang (Cameroon). Following recent work by Liberman and
others, I provide a parametrised F_0 prediction function P which generates F_0
values from a tone sequence, and I explore the asymptotic behaviour of
downstep. Next, I observe that transcribing a sequence X of pitch (i.e. F_0)
values amounts to finding a tone sequence T such that P(T) {}~= X. This is a
combinatorial optimisation problem, for which two non-deterministic search
techniques are provided: a genetic algorithm and a simulated annealing
algorithm. Finally, two implementations---one for each technique---are
described and then compared using both artificial and real data for sequences
of up to 20 tones. These programs can be adapted to other tone languages by
adjusting the F_0 prediction function.Comment: 12 pages, 4 postscript figures, uses examples.sty, newapa.sty,
latex-acl.sty, ipamacs.st
Strategies for Representing Tone in African Writing Systems
Tone languages provide some interesting challenges for the designers of new orthographies.
One approach is to omit tone marks, just as stress is not marked in English (zero marking).
Another approach is to do phonemic tone analysis and then make heavy use of diacritic
symbols to distinguish the `tonemes' (exhaustive marking). While orthographies based on
either system have been successful, this may be thanks to our ability to manage inadequate
orthographies rather than to any intrinsic advantage which is afforded by one or the other
approach. In many cases, practical experience with both kinds of orthography in sub-Saharan
Africa has shown that people have not been able to attain the level of reading and writing
fluency that we know to be possible for the orthographies of non-tonal languages. In some
cases this can be attributed to a sociolinguistic setting which does not favour vernacular
literacy. In other cases, the orthography itself might be to blame. If the orthography of a tone
language is difficult to user or to learn, then a good part of the reason, I believe, is that the
designer either has not paid enough attention to the function of tone in the language, or has
not ensured that the information encoded in the orthography is accessible to the ordinary
(non-linguist) user of the language. If the writing of tone is not going to continue to be a
stumbling block to literacy efforts, then a fresh approach to tone orthography is required, one
which assigns high priority to these two factors.
This article describes the problems with orthographies that use too few or too many tone
marks, and critically evaluates a wide range of creative intermediate solutions. I review the
contributions made by phonology and reading theory, and provide some broad methodological
principles to guide someone who is seeking to represent tone in a writing system. The tone
orthographies of several languages from sub-Saharan Africa are presented throughout the
article, with particular emphasis on some tone languages of Cameroon
Orthography and Identity in Cameroon
The tone languages of sub-Saharan Africa
raise challenging questions for the design
of new writing systems. Marking too much or too little tone can have
grave consequences for the usability of an orthography.
Orthography development, past and present, rests on a
raft of sociolinguistic issues having little to do with the
technical phonological concerns that usually preoccupy orthographers.
Some of these issues
are familiar from the spelling reforms which have taken place
in European languages. However, many of the issues faced in
sub-Saharan Africa are
different, being concerned with the creation of new writing systems
in a multi-ethnic context: residual colonial influences, the
construction of new nation-states, detribalization versus
culture preservation and language reclamation, and so on.
Language development projects which crucially rely on creating
or revising orthographies may founder if they do not attend to
the various layers of identity that are indexed by orthography:
whether colonial, national, ethnic, local or individual identity.
In this study, I review the history and politics
of orthography in Cameroon, with a focus on tone marking.
The paper concludes by calling present-day orthographers to
a deeper and broader understanding of orthographic issues
Annotation graphs as a framework for multidimensional linguistic data analysis
In recent work we have presented a formal framework for linguistic annotation
based on labeled acyclic digraphs. These `annotation graphs' offer a simple yet
powerful method for representing complex annotation structures incorporating
hierarchy and overlap. Here, we motivate and illustrate our approach using
discourse-level annotations of text and speech data drawn from the CALLHOME,
COCONUT, MUC-7, DAMSL and TRAINS annotation schemes. With the help of domain
specialists, we have constructed a hybrid multi-level annotation for a fragment
of the Boston University Radio Speech Corpus which includes the following
levels: segment, word, breath, ToBI, Tilt, Treebank, coreference and named
entity. We show how annotation graphs can represent hybrid multi-level
structures which derive from a diverse set of file formats. We also show how
the approach facilitates substantive comparison of multiple annotations of a
single signal based on different theoretical models. The discussion shows how
annotation graphs open the door to wide-ranging integration of tools, formats
and corpora.Comment: 10 pages, 10 figures, Towards Standards and Tools for Discourse
Tagging, Proceedings of the Workshop. pp. 1-10. Association for Computational
Linguistic
A Formal Framework for Linguistic Annotation
`Linguistic annotation' covers any descriptive or analytic notations applied
to raw language data. The basic data may be in the form of time functions --
audio, video and/or physiological recordings -- or it may be textual. The added
notations may include transcriptions of all sorts (from phonetic features to
discourse structures), part-of-speech and sense tagging, syntactic analysis,
`named entity' identification, co-reference annotation, and so on. While there
are several ongoing efforts to provide formats and tools for such annotations
and to publish annotated linguistic databases, the lack of widely accepted
standards is becoming a critical problem. Proposed standards, to the extent
they exist, have focussed on file formats. This paper focuses instead on the
logical structure of linguistic annotations. We survey a wide variety of
existing annotation formats and demonstrate a common conceptual core, the
annotation graph. This provides a formal framework for constructing,
maintaining and searching linguistic annotations, while remaining consistent
with many alternative data structures and file formats.Comment: 49 page
Many uses, many annotations for large speech corpora: Switchboard and TDT as case studies
This paper discusses the challenges that arise when large speech corpora
receive an ever-broadening range of diverse and distinct annotations. Two case
studies of this process are presented: the Switchboard Corpus of telephone
conversations and the TDT2 corpus of broadcast news. Switchboard has undergone
two independent transcriptions and various types of additional annotation, all
carried out as separate projects that were dispersed both geographically and
chronologically. The TDT2 corpus has also received a variety of annotations,
but all directly created or managed by a core group. In both cases, issues
arise involving the propagation of repairs, consistency of references, and the
ability to integrate annotations having different formats and levels of detail.
We describe a general framework whereby these issues can be addressed
successfully.Comment: 7 pages, 2 figure
- …
