The Scottish Corpus of Texts and Speech (SCOTS) Project at Glasgow University aims to make available over the Internet a 4 million-word multimedia corpus of texts in the languages of Scotland. Twenty percent of this final total will comprise spoken language, in a combination of audio and video material. Versions of SCOTS have been accessible on the Internet since November 2004, and regular additions are made to the Corpus as texts are processed and functionality is improved. While the Corpus is a valuable resource for research, our target users also include the general public, and this has important implications for the nature of the Corpus and website.
This paper will begin with a general introduction to the SCOTS Project, and in particular to the nature of our data. The main part of the paper will then present the approach taken to spoken texts. Transcriptions are made using Praat (Boersma and Weenink, University of Amsterdam), which produces a time-based transcription and allows for multiple speakers though independent tiers. This output is then processed to produce a turn-based transcription with overlap and non-linguistic noises indicated. As this transcription is synchronised with the source audio/video material it allows users direct access to any particular passage of the recording, possibly based upon a word query. This process and the end result will be demonstrated and discussed.
We shall end by considering the value which is added to an Internet-delivered Corpus by these means of treating spoken text. The advantages include the possibility of returning search results from both written texts and multimedia documents; the easy location of the relevant section of the audio file; and the production through Praat of a turn-based orthographic transcription, which is accessible to a general as well as an academic user. These techniques can also be extended to other research requirements, such as the mark-up of gesture in video texts