14 research outputs found
Neue Ansätze zur Auswertung von Schreibprozessdaten : Textgeschichten und Satzgeschichten
Begutachteter Abstract 3 Seiten, Posterpräsentatio
Extraction of transforming sequences and sentence histories from writing process data : a first step towards linguistic modeling of writing
Online first, part of special issue "Methods for understanding writing process by analysis of writing timecourse"
Erworben im Rahmen der Schweizer Nationallizenzen (http://www.nationallizenzen.ch)Producing written texts is a non-linear process: in contrast to speech, writers are free to change already written text at any place at any point in time. Linguistic considerations are likely to play an important role, but so far, no linguistic models of the writing process exist. We present an approach for the analysis of writing processes with a focus on linguistic structures based on the novel concepts of transforming sequences, text history, and sentence history. The processing of raw keystroke logging data and the application of natural language processing tools allows for the extraction and filtering of product and process data to be stored in a hierarchical data structure. This structure is used to re-create and visualize the genesis and history for a text and its individual sentences. Focusing on sentences as primary building blocks of written language and full texts, we aim to complement established writing process analyses and, ultimately, to interpret writing timecourse data with respect to linguistic structures. To enable researchers to explore this view, we provide a fully functional implementation of our approach as an open-source software tool and visualizations of the results. We report on a small scale exploratory study in German where we used our tool. The results indicate both the feasibility of the approach and that writers actually revise on a linguistic level. The latter confirms the need for modeling written text production from the perspective of linguistic structures beyond the word level
CEASR : a corpus for evaluating automatic speech recognition
In this paper, we present CEASR, a Corpus for Evaluating ASR quality. It is a data set derived from public speech corpora, containing manual transcripts enriched with metadata along with transcripts generated by several modern state-of-the-art ASR systems. CEASR provides this data in a unified structure, consistent across all corpora and systems with normalised transcript texts and metadata.
We then use CEASR to evaluate the quality of ASR systems on the basis of their Word Error Rate (WER). Our experiments show, among other results, a substantial difference in quality between commercial versus open-source ASR tools and differences up to a factor of ten for single systems on different corpora. By using CEASR, we could very efficiently and easily obtain these results. This shows that our corpus enables researchers to perform ASR-related evaluations and various in-depth analyses with noticeably reduced effort: without the need to collect, process and transcribe the speech data themselves
ZHAW-InIT at GermEval 2020 task 4 : low-resource speech-to-text
This paper presents the contribution of ZHAW-InIT to Task 4 ”Low-Resource STT” at GermEval 2020. The goal of the task is to develop a system for translating Swiss German dialect speech into Standard German text in the domain of parliamentary debates. Our approach is based on Jasper, a CNN Acoustic Model, which we fine-tune on the task data. We enhance the base system with an extended Language Model containing in-domain data and speed perturbation and run further experiments with post-processing. Our submission achieved first place with a final Word Error Rate of 40.29%
ZHAW-CAI : ensemble method for Swiss German speech to Standard German text
This paper presents the contribution of ZHAW-CAI to the Shared Task ”Swiss German Speech to Standard German Text” at the SwissText 2021 conference. Our approach combines three models based on the Fairseq, Jasper and Wav2vec architectures trained on multilingual, German and Swiss German data. We applied an ensembling algorithm on the predictions of the three models in order to retrieve the most reliable candidate out of the provided translations for each spoken utterance. With the ensembling output, we achieved a BLEU score of 39.39 on the private test set, which gave us the third place out of four contributors in the competition
SDS-200 : a Swiss German speech to Standard German text corpus
We present SDS-200, a corpus of Swiss German dialectal speech with Standard German text translations, annotated with dialect, age, and gender information of the speakers. The dataset allows for training speech translation, dialect recognition, and speech synthesis systems, among others. The data was collected using a web recording tool that is open to the public. Each participant was given a text in Standard German and asked to translate it to their Swiss German dialect before recording it. To increase the corpus quality, recordings were validated by other participants. The data consists of 200 hours of speech by around 4000 different speakers and covers a large part of the Swiss German dialect landscape. We release SDS-200 alongside a baseline speech translation model, which achieves a word error rate (WER) of 30.3 and a BLEU score of 53.1 on the SDS-200 test set. Furthermore, we use SDS-200 to fine-tune a pre-trained XLS-R model, achieving 21.6 WER and 64.0 BLEU
Using the concept of transforming sequences to automatically extract and classify bursts
References:
- Allal L, Chanquoy L (2004) Introduction: Revision Revisited. In: Allal L, Chanquoy L, Largy P (eds) Revision. Cognitive and instructional processes, Studies in Writing, vol 13, Kluwer, Boston, Dordrecht, London, pp 1–7.
- Baaijen VM, Galbraith D, de Glopper K (2012) Keystroke Analysis. Written Communication 29(3):246–277, DOI 10.1177/0741088312451108.
- Bridwell LS (1980) Revising Strategies in Twelfth Grade Students’ Transactional Writing. Research in the Teaching of English 14(3):197–222, URL http://www.eric.ed.gov/ERICWebPortal/detail accno=EJ236505
- Faigley L, Witte S (1981) Analyzing Revision. College Composition and Communication 32(4):400–414, DOI 10.2307/356602.
- Fitzgerald J (1987) Research on Revision in Writing. Review of Educational Research 57(4):481–506, DOI 10.2307/1170433.
- Galbraith D, Baaijen VM (2019) Aligning keystrokes with cognitive processes in writing. In: Lindgren E, Sullivan K (eds) Observing writing, Brill, Leiden, The Netherlands, pp 306–325.
- Kaufer DS, Hayes JR, Flower L (1986) Composing written sentences. Research in the Teaching of English 20(2):121–140, URL http://www.jstor.org/stable/40171073.
- Lindgren E (2005) Writing and revising: Didactic and Methodological Implications of Keystroke Logging. PhD thesis, UmeĂĄ Universitet, URL http://www.divaportal.org/umu/abstract.xsql?dbid=534.
- Mahlow C, Ulasik MA, Tuggener D (2022) Extraction of transforming sequences and sentence histories from writing process data: a first step towards linguistic modeling of writing. Reading and Writing. DOI 10.1007/s11145–021–10234–6.
- Sommers N (1980) Revision Strategies of Student Writers and Experienced Adult Writers. College Composition and Communication 31(4):378–388, DOI 10.2307/356588.The overall goal of our research is to understand the production of linguistic units to better support writers during revision and to help them to effectively use structures considered essential for academic writing. With THETool (Text History Extraction Tool) we are able to automatically explore writing on a structural level (syntax in the broadest sense) and gain relevant insights (Mahlow et al. 2022). THETool parses keystroke-logging data and creates text and sentence histories for a particular writing session. Sentence histories cover all events relevant for a sentence. In a particular sentence history, we can follow what the writer did, even when they came back to a sentence several times. This history contains all different versions of this sentence during the writing session. A version of a particular sentence is created every time a specific text-produced-so-far (TPSF) can be detected and saved, which depends on the writer switching modes of producing (including deleting) texts. We call the difference between two consecutive versions the transforming sequence (TS), which we consider an instantiation of a burst. A TS can either be a deletion, an insertion, or an extension to the current sentence. Several attempts have been made to explore bursts and propose extensive classifications. Baaijen et al (2012) and Galbraith and Baaijen (2019) work on modeling bursts (Kaufer et al, 1986), taking into account type of bursts as either production or revision bursts. They rely on manual annotation of keystroke-logging data and, to the best of our knowledge, no replication (maybe even for writing in other languages) or extension of their approach has been done so far as the manual annotation process is difficult. With THETool we aim to test their classification scheme automatically on large collections of writing sessions. For revision bursts, connections to taxonomies of revision activities could be drawn, but again this is very laborious and has not been done automatically before. However, with THETool we will be able to explicitly address and build on previous work: Bridwell (1980) and Sommers (1980) focus on observable revisions at the surface of the text and distinguish syntactic structures involved. Faigley and Witte (1981) and later Fitzgerald (1987) aim at categorizing changes in meaning of the text on the basis of observable structural changes. Allal and Chanquoy (2004) and Lindgren (2005) introduce the notions of pretextual and precontextual revisions, i.e., mental changes before the author transcribes them—but as such revisions are not directly observable through the process and product data, we do not consider them here. We hope, however, to be able to find some evidence to construct dedicated experiments to further explore these aspects based on specific classes of bursts. We propose a taxonomy of bursts based on the surface (i.e., the language visible on screen or paper) and the syntactic structure of a transforming sequence. Additional features address (1) the actions and production mode (deletion, insertion, extension) before and after a specific revision, (2) the distance between a previous point of inscription in the text and the current one, (3) the distance between the current point of inscription and the current position in the text, (4) the grammatical status of the sentence before and after the revision action, and (5) the location of the previous action, the current point of inscription, and the following action
Entwicklung eines Produkt-Prozess-Korpus zur UnterstĂĽtzung des Erwerbs von Kompetenzen im Bereich Digital Literacy (SPPC Swiss Process-Product Corpus of Student Writing Development)
Eingeladener Vortrag am Clarin-WorkshopDigital Literacy kann verstanden werden als Kompetenz, digitale multimodale Kommunikation im Kontext adäquat zu erfassen, zu reflektieren, zu verarbeiten und zu entwickeln, um so Beziehungen herzustellen und sich diskursiv zu beteiligen. Der Schwerpunkt liegt oft auf «neuartigen» Kommunikationsformen, wobei «Schreiben» eine der wichtigsten Kommunikationsarten bleibt.
Die Gestaltung und Umsetzung von Interventionen und Unterrichtssequenzen, die Studierende dabei unterstützen, solche digitalen Kompetenzen zu erwerben und erfolgreich an akademischen und beruflichen Diskursen teilzunehmen, ist eine anspruchsvolle Aufgabe. Studierende lernen während ihres Studiums das Schreiben wissenschaftlicher Texte; der Fokus in der Lehre verlagert sich dabei vom Produkt zum Prozess. Bislang wurde Forschung und Entwicklung in diesem Bereich durch die Schwierigkeit behindert, den Prozess und das Produkt gleichzeitig zu untersuchen, um ein ganzheitliches Verständnis der Komplexität des Schreibens zu erhalten. Dies ist auf einen Mangel an (1) geeigneten Methoden und (2) geeigneten Korpora zurückzuführen.
Problem (1) kann durch das Konzept der transforming sequences (Mahlow et al. 2022) auf der Grundlage von Änderungen im Produktionsmodus gelöst werden (Mahlow 2015) gelöst werden. Dies ermöglicht es, Text- und Satzverläufe zu extrahieren, die Entwicklung von Texten auf linguistischer Ebene zu untersuchen und so den Prozess mit dem Produkt in Beziehung zu setzen. Dieser Schritt in Richtung einer linguistischen Modellierung und Analyse von Schreibprozessdaten während des Schreibens in natürlichen Umgebungen geht über Analysen auf Wortebene (Leijten et al. 2019, Leijten et al. 2012) und Analysen auf der Grundlage manueller linguistischer Annotation (Cislaru und Olive 2018) nach einer Schreibsitzung hinaus.
Um Problem (2) anzugehen, erstellen wir eine neue Art von Schreibkorpus: das Swiss Process-Product Corpus of Student Writing Development (SPPC), das sowohl Prozess- als auch Produktdaten sowie Feedback zu EntwĂĽrfen von Betreuenden, Schreibberatenden oder Peers enthalten wird. SPPC wird aus schriftlichen Arbeiten von Studierenden (Entwurfe und eingereichte Endtexte) in ihrer Erstsprache (L1)
Deutsch erstellt.
Wir stellen hier unseren Ansatz vor, Texte (in verschiedenen Versionen), Prozessdaten und Analysen mit THETool (Mahlow et al. 2022) im XML-Format in einer speziellen XML-Datenbank, BaseX (https://basex.org), zu speichern. FĂĽr die Verbindung verschiedener Versionen eines Textes mit Feedback- Kommentaren verwenden wir TEI (https://tei-c.org).
Literatur
Georgetta Cislaru and Thierry Olive (2018) Le processus de textualisation. Analyse des unités linguistiques de performance écrite. De Boeck Supérieur, Louvain-la-Neuve.
Mariëlle Leijten, Eric Van Horenbeeck, and Luuk Van Waes (2019) Analysing keystroke logging data from a linguistic perspective. In Ob- serving writing, Eva Lindgren and Kirk Sullivan (eds.). Brill, Leiden, 71–95. https://doi.org/10.1163/9789004392526_005
Mariëlle Leijten, Lieve Macken, Veronique Hoste, Eric Van Horenbeeck, and Luuk Van Waes (2012) From character to word level: Enabling the linguistic analyses of Inputlog process data. In Proceedings of the Second Workshop on Computational Linguistics and Writing (CL&W 2012): Linguistic and cognitive aspects of document creation and document engineering, 1–8. Retrieved from http://aclanthology.org/W12–0301
Mahlow (2015) A definition of “version” for text production data and natural language document drafts. In Proceedings of the 3rd International Workshop on (Document) Changes: Modeling, detection, storage and visualization (DChanges 2015), 27–32. https://doi.org/10.1145/2881631.2881638
Cerstin Mahlow, Malgorzata Anna Ulasik, Don Tuggener (2022) Extraction of transforming sequences and sentence histories from writing process data: a first step towards linguistic modeling of writing. Reading and Writing. DOI 10.1007/s11145–021–10234–
Automated Extraction and Analysis of Sentences under Production: A Theoretical Framework and Its Evaluation
Sentences are generally understood to be essential communicative units in writing that are built to express thoughts and meanings. Studying sentence production provides a valuable opportunity to shed new light on the writing process itself and on the underlying cognitive processes. Nevertheless, research on the production of sentences in writing remains scarce. We propose a theoretical framework and an open-source implementation that aim to facilitate the study of sentence production based on keystroke logs. We centre our approach around the notion of sentence history: all the versions of a given sentence during the production of a text. The implementation takes keystroke logs as input and extracts sentence versions, aggregates them into sentence histories and evaluates the sentencehood of each sentence version. We provide detailed evaluation of the implementation based on a manually annotated corpus of texts in French, German and English. The implementation yields strong results on the three processing aspects