2,299 research outputs found
Data formats for phonological corpora
The goal of the present chapter is to explore the possibility of providing
the research (but also the industrial) community that commonly uses spoken
corpora with a stable portfolio of well-documented standardised formats that
allow a high re-use rate of annotated spoken resources and, as a consequence,
better interoperability across tools used to produce or exploit such resources.Comment: Handbook of Corpus Phonology Oxford University Press (Ed.) (2012
Enroller: an experiment in aggregating resources
This chapter describes a collaborative project between e-scientists and humanists working to create an online repository of linguistic data sets and tools. Corpora, dictionaries, and a thesaurus are brought together to enable a new method of research. It combines our most advanced knowledge in both computing and linguistic research techniques
A new web interface to facilitate access to corpora: development of the ASLLRP data access interface
A significant obstacle to broad utilization of corpora is the difficulty in gaining access to the specific subsets of data and annotations that may be relevant for particular types of research. With that in mind, we have developed a web-based Data Access Interface (DAI), to provide access to the expanding datasets of the American Sign Language Linguistic Research Project (ASLLRP). The DAI facilitates browsing the corpora, viewing videos and annotations, searching for phenomena of interest, and downloading selected materials from the website. The web interface, compared to providing videos and annotation files off-line, also greatly increases access by people that have no prior experience in working with linguistic annotation tools, and it opens the door to integrating the data with third-party applications on the desktop and in the mobile space. In this paper we give an overview of the available videos, annotations, and search functionality of the DAI, as well as plans for future enhancements. We also summarize best practices and key lessons learned that are crucial to the success of similar projects
The Validation of Speech Corpora
1.2 Intended audience........................
A Formal Framework for Linguistic Annotation
`Linguistic annotation' covers any descriptive or analytic notations applied
to raw language data. The basic data may be in the form of time functions --
audio, video and/or physiological recordings -- or it may be textual. The added
notations may include transcriptions of all sorts (from phonetic features to
discourse structures), part-of-speech and sense tagging, syntactic analysis,
`named entity' identification, co-reference annotation, and so on. While there
are several ongoing efforts to provide formats and tools for such annotations
and to publish annotated linguistic databases, the lack of widely accepted
standards is becoming a critical problem. Proposed standards, to the extent
they exist, have focussed on file formats. This paper focuses instead on the
logical structure of linguistic annotations. We survey a wide variety of
existing annotation formats and demonstrate a common conceptual core, the
annotation graph. This provides a formal framework for constructing,
maintaining and searching linguistic annotations, while remaining consistent
with many alternative data structures and file formats.Comment: 49 page
Annotation Graphs and Servers and Multi-Modal Resources: Infrastructure for Interdisciplinary Education, Research and Development
Annotation graphs and annotation servers offer infrastructure to support the
analysis of human language resources in the form of time-series data such as
text, audio and video. This paper outlines areas of common need among empirical
linguists and computational linguists. After reviewing examples of data and
tools used or under development for each of several areas, it proposes a common
framework for future tool development, data annotation and resource sharing
based upon annotation graphs and servers.Comment: 8 pages, 6 figure
Phonology and intonation
The encoding standards for phonology and intonation are designed to facilitate consistent annotation of the phonological and intonational aspects of information structure, in languages across a range ofprosodic types. The guidelines are designed with the aim that a nonspecialist in phonology can both implement and interpret the resulting annotation
EXMARaLDA - Creating, Analysing and Sharing Spoken Language Corpora for Pragmatic Research
This paper presents EXMARaLDA, a system for the computer-assisted creation and analysis of spoken
language corpora. The first part contains some general observations about technological and methodological requirements for doing corpus-based pragmatics. The second part explains the systems architecture and gives an overview of its most important software components a transcription editor, a corpus management tool and a corpus query tool. The last part presents some corpora which have been or are currently being compiled with the help of EXMARaLDA
- …