44 research outputs found

    A Survey of Available Corpora For Building Data-Driven Dialogue Systems: The Journal Version

    Get PDF
    During the past decade, several areas of speech and language understanding have witnessed substantial breakthroughs from the use of data-driven models. In the area of dialogue systems, the trend is less obvious, and most practical systems are still built through significant engineering and expert knowledge. Nevertheless, several recent results suggest that data-driven approaches are feasible and quite promising. To facilitate research in this area, we have carried out a wide survey of publicly available datasets suitable for data-driven learning of dialogue systems. We discuss important characteristics of these datasets, how they can be used to learn diverse dialogue strategies, and their other potential uses. We also examine methods for transfer learning between datasets and the use of external knowledge. Finally, we discuss appropriate choice of evaluation metrics for the learning objective

    Analyzing Prosody with Legendre Polynomial Coefficients

    Full text link
    This investigation demonstrates the effectiveness of Legendre polynomial coefficients representing prosodic contours within the context of two different tasks: nativeness classification and sarcasm detection. By making use of accurate representations of prosodic contours to answer fundamental linguistic questions, we contribute significantly to the body of research focused on analyzing prosody in linguistics as well as modeling prosody for machine learning tasks. Using Legendre polynomial coefficient representations of prosodic contours, we answer prosodic questions about differences in prosody between native English speakers and non-native English speakers whose first language is Mandarin. We also learn more about prosodic qualities of sarcastic speech. We additionally perform machine learning classification for both tasks, (achieving an accuracy of 72.3% for nativeness classification, and achieving 81.57% for sarcasm detection). We recommend that linguists looking to analyze prosodic contours make use of Legendre polynomial coefficients modeling; the accuracy and quality of the resulting prosodic contour representations makes them highly interpretable for linguistic analysis

    Proceedings of the Fifth Italian Conference on Computational Linguistics CLiC-it 2018 : 10-12 December 2018, Torino

    Get PDF
    On behalf of the Program Committee, a very warm welcome to the Fifth Italian Conference on Computational Linguistics (CLiC-­‐it 2018). This edition of the conference is held in Torino. The conference is locally organised by the University of Torino and hosted into its prestigious main lecture hall “Cavallerizza Reale”. The CLiC-­‐it conference series is an initiative of the Italian Association for Computational Linguistics (AILC) which, after five years of activity, has clearly established itself as the premier national forum for research and development in the fields of Computational Linguistics and Natural Language Processing, where leading researchers and practitioners from academia and industry meet to share their research results, experiences, and challenges

    Tune your brown clustering, please

    Get PDF
    Brown clustering, an unsupervised hierarchical clustering technique based on ngram mutual information, has proven useful in many NLP applications. However, most uses of Brown clustering employ the same default configuration; the appropriateness of this configuration has gone predominantly unexplored. Accordingly, we present information for practitioners on the behaviour of Brown clustering in order to assist hyper-parametre tuning, in the form of a theoretical model of Brown clustering utility. This model is then evaluated empirically in two sequence labelling tasks over two text types. We explore the dynamic between the input corpus size, chosen number of classes, and quality of the resulting clusters, which has an impact for any approach using Brown clustering. In every scenario that we examine, our results reveal that the values most commonly used for the clustering are sub-optimal

    Social Factors in the Production, Perception and Processing of Contact Varieties: Evidence from Bilingual Corpora, Nativeness Evaluations, and Real-time Processing (EEG) of Spanish-accented English

    Full text link
    Originating in the 1960’s with the work of William Labov, the field of sociolinguistics has given way to a rich literature that continues to uncover the many ways in which social factors influence how we produce, perceive, and process speech. Sociolinguistic research has burgeoned alongside increasing globalization and migration, which has, in the case of the U.S. at least, resulted in increased levels of bilingualism and more frequent interactions with non-native English speakers. My dissertation, which consists of three distinct chapters, combines insights from the sociolinguistic literature with methodologies from cognitive science in order to better understand the ways in which perceptions of identity and social attitudes towards nonstandard language varieties influence our everyday spoken interactions. More specifically, I investigate how several social factors (i.e. language background, dialect stigmatization, and speaker accent) may influence speech production, perception, and processing. The data presented come from over sixty fieldwork interviews, a series of corpus analyses, two online surveys, and one neurolinguistic experiment. In the first paper, I identify how social factors have appeared to influence auxiliary verb choice among some Ecuadorian Spanish speakers. While the markedly frequent use of auxiliary ir, Sp. ‘to go’ in Ecuadorian Spanish has historically been traced to contact effects from Quichua, analysis of a present-day Ecuadorian Spanish corpus reveals that Quichua-Spanish bilinguals do not use the construction significantly more than Spanish monolinguals. Given auxiliary ir may be marked as a slightly nonstandard alternative for the auxiliary estar and that Quichua-Spanish bilinguals have long been denied linguistic prestige in the sociolinguistic stratification of Ecuadorian Spanish, I propose the possibility that language background and dialect stigmatization may explain the current distribution of auxiliary ir production among Ecuadorian Spanish speakers. In the second chapter, I investigate the relationship between speaker accents and American perceptions of nativeness. Specifically, I examined how young adult Midwesterners today perceive two main kinds of Spanish-influenced English varieties: L1 Latino English (as spoken in Chicago, U.S.) and L2 Spanish-accented English (as spoken in Santiago, Chile). Since Latinos have recently become the dominant ethnic minoritized group in the U.S., the varieties of English that they speak are under increasing scrutiny, and cases of linguistic discrimination are on the rise. Results from an accent evaluation survey reveal that respondents distinguished the L1 Latino English from the L2 Spanish-influenced English speaker, but still rated him as slightly more foreign-sounding than L1 speakers with more established U.S. dialects (e.g. New York). In other words, native U.S. speakers perceived as “sounding Hispanic” were perceived as sounding “almost American,” which suggests that what Midwesterners count as sounding American may be in the process of expanding to include U.S.-born Latinos. In the third chapter, I focus on the effect that speaker accent has on online word processing in the brain. Specifically, does Spanish-accented English speech increase activation of the Spanish lexicon in the mind of Spanish-English bilingual listeners? Though more data is needed for a clear answer, preliminary data from an EEG experiment suggests that speaker accent may possibly modulate bilingual lexical activation. This is investigated via analysis of N400 responses from bilingual listeners when false cognates from Spanish were produced by a Spanish-accented English speaker relative to a Chinese-accented English speaker.PHDLinguisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/168117/1/emsabo_1.pd

    Multi-dialect Arabic broadcast speech recognition

    Get PDF
    Dialectal Arabic speech research suffers from the lack of labelled resources and standardised orthography. There are three main challenges in dialectal Arabic speech recognition: (i) finding labelled dialectal Arabic speech data, (ii) training robust dialectal speech recognition models from limited labelled data and (iii) evaluating speech recognition for dialects with no orthographic rules. This thesis is concerned with the following three contributions: Arabic Dialect Identification: We are mainly dealing with Arabic speech without prior knowledge of the spoken dialect. Arabic dialects could be sufficiently diverse to the extent that one can argue that they are different languages rather than dialects of the same language. We have two contributions: First, we use crowdsourcing to annotate a multi-dialectal speech corpus collected from Al Jazeera TV channel. We obtained utterance level dialect labels for 57 hours of high-quality consisting of four major varieties of dialectal Arabic (DA), comprised of Egyptian, Levantine, Gulf or Arabic peninsula, North African or Moroccan from almost 1,000 hours. Second, we build an Arabic dialect identification (ADI) system. We explored two main groups of features, namely acoustic features and linguistic features. For the linguistic features, we look at a wide range of features, addressing words, characters and phonemes. With respect to acoustic features, we look at raw features such as mel-frequency cepstral coefficients combined with shifted delta cepstra (MFCC-SDC), bottleneck features and the i-vector as a latent variable. We studied both generative and discriminative classifiers, in addition to deep learning approaches, namely deep neural network (DNN) and convolutional neural network (CNN). In our work, we propose Arabic as a five class dialect challenge comprising of the previously mentioned four dialects as well as modern standard Arabic. Arabic Speech Recognition: We introduce our effort in building Arabic automatic speech recognition (ASR) and we create an open research community to advance it. This section has two main goals: First, creating a framework for Arabic ASR that is publicly available for research. We address our effort in building two multi-genre broadcast (MGB) challenges. MGB-2 focuses on broadcast news using more than 1,200 hours of speech and 130M words of text collected from the broadcast domain. MGB-3, however, focuses on dialectal multi-genre data with limited non-orthographic speech collected from YouTube, with special attention paid to transfer learning. Second, building a robust Arabic ASR system and reporting a competitive word error rate (WER) to use it as a potential benchmark to advance the state of the art in Arabic ASR. Our overall system is a combination of five acoustic models (AM): unidirectional long short term memory (LSTM), bidirectional LSTM (BLSTM), time delay neural network (TDNN), TDNN layers along with LSTM layers (TDNN-LSTM) and finally TDNN layers followed by BLSTM layers (TDNN-BLSTM). The AM is trained using purely sequence trained neural networks lattice-free maximum mutual information (LFMMI). The generated lattices are rescored using a four-gram language model (LM) and a recurrent neural network with maximum entropy (RNNME) LM. Our official WER is 13%, which has the lowest WER reported on this task. Evaluation: The third part of the thesis addresses our effort in evaluating dialectal speech with no orthographic rules. Our methods learn from multiple transcribers and align the speech hypothesis to overcome the non-orthographic aspects. Our multi-reference WER (MR-WER) approach is similar to the BLEU score used in machine translation (MT). We have also automated this process by learning different spelling variants from Twitter data. We mine automatically from a huge collection of tweets in an unsupervised fashion to build more than 11M n-to-m lexical pairs, and we propose a new evaluation metric: dialectal WER (WERd). Finally, we tried to estimate the word error rate (e-WER) with no reference transcription using decoding and language features. We show that our word error rate estimation is robust for many scenarios with and without the decoding features

    ACQUIRING SYNTACTIC VARIATION: REGULARIZATION IN WH-QUESTION PRODUCTION

    Get PDF
    Children are often exposed to language-internal variation. Studying the acquisition of variation allows us to understand more about children’s ability to acquire probabilistic input, their preferences at choice points, and factors contributing to such preference. Using wh-variation as a case study, this dissertation explores the acquisition of syntactic variation through corpus analyses, behavioral experiments, and computational simulation. In English and some other languages (e.g., French, Brazilian Portuguese, etc.), information-seeking wh-questions allow for at least two variants: a wh-in-situ variant and a fronted-wh variant. How do English-speaking children acquire wh-variation, and what factors condition their course of acquisition? Experimental results show that 3-to-5 year-old children regularize to fronted wh-questions in their production even in contexts that allow for both variants to be used interchangeably. Based on the characteristics of the variants, two factors are identified to potentially contribute to the preference for fronted wh-questions: frequency and discourse restrictions. Two artificial language learning (ALL) experiments are then conducted so that the effect of discourse can be studied separately from frequency. The results show that learners prefer the variant with fewer or no discourse restrictions (i.e., the fronted-wh variant) when frequency is controlled. Thus, regularization in language acquisition is conditioned by both domain-general factors, such as frequency, and language-specific factors, such as discourse markedness. The dissertation also looks into the motivation for regularization. One prominent hypothesis is that regularization serves as a means to reduce the cognitive burden associated with learning multiple variants at once. Instead of mastering all the variants, learners can simplify the learning process and minimize their chance of violating a constraint by producing the dominant variant. This work provides additional evidence for the hypothesis in three ways. First, we replicate the findings that tasks that are more cognitively taxing induce more regularization. Second, we present new evidence that participants with a lower composite working memory score tend to have a higher regularization rate. Third, we provide a computational simulation showing that regularization behavior only happens when an intake limit (reflecting limited working memory capacity) and a parsimony bias to reduce the cognitive burden are incorporated in the model
    corecore