2,643 research outputs found

    Homogenous Ensemble Phonotactic Language Recognition Based on SVM Supervector Reconstruction

    Get PDF
    Currently, acoustic spoken language recognition (SLR) and phonotactic SLR systems are widely used language recognition systems. To achieve better performance, researchers combine multiple subsystems with the results often much better than a single SLR system. Phonotactic SLR subsystems may vary in the acoustic features vectors or include multiple language-specific phone recognizers and different acoustic models. These methods achieve good performance but usually compute at high computational cost. In this paper, a new diversification for phonotactic language recognition systems is proposed using vector space models by support vector machine (SVM) supervector reconstruction (SSR). In this architecture, the subsystems share the same feature extraction, decoding, and N-gram counting preprocessing steps, but model in a different vector space by using the SSR algorithm without significant additional computation. We term this a homogeneous ensemble phonotactic language recognition (HEPLR) system. The system integrates three different SVM supervector reconstruction algorithms, including relative SVM supervector reconstruction, functional SVM supervector reconstruction, and perturbing SVM supervector reconstruction. All of the algorithms are incorporated using a linear discriminant analysis-maximum mutual information (LDA-MMI) backend for improving language recognition evaluation (LRE) accuracy. Evaluated on the National Institute of Standards and Technology (NIST) LRE 2009 task, the proposed HEPLR system achieves better performance than a baseline phone recognition-vector space modeling (PR-VSM) system with minimal extra computational cost. The performance of the HEPLR system yields 1.39%, 3.63%, and 14.79% equal error rate (EER), representing 6.06%, 10.15%, and 10.53% relative improvements over the baseline system, respectively, for the 30-, 10-, and 3-s test conditions

    NIST 2007 Language Recognition Evaluation: From the Perspective of IIR

    Get PDF
    PACLIC / The University of the Philippines Visayas Cebu College Cebu City, Philippines / November 20-22, 200

    Subspace Gaussian Mixture Models for Language Identification and Dysarthric Speech Intelligibility Assessment

    Get PDF
    En esta Tesis se ha investigado la aplicación de técnicas de modelado de subespacios de mezclas de Gaussianas en dos problemas relacionados con las tecnologías del habla, como son la identificación automática de idioma (LID, por sus siglas en inglés) y la evaluación automática de inteligibilidad en el habla de personas con disartria. Una de las técnicas más importantes estudiadas es el análisis factorial conjunto (JFA, por sus siglas en inglés). JFA es, en esencia, un modelo de mezclas de Gaussianas en el que la media de cada componente se expresa como una suma de factores de dimensión reducida, y donde cada factor representa una contribución diferente a la señal de audio. Esta factorización nos permite compensar nuestros modelos frente a contribuciones indeseadas presentes en la señal, como la información de canal. JFA se ha investigado como clasficador y como extractor de parámetros. En esta última aproximación se modela un solo factor que representa todas las contribuciones presentes en la señal. Los puntos en este subespacio se denominan i-Vectors. Así, un i-Vector es un vector de baja dimensión que representa una grabación de audio. Los i-Vectors han resultado ser muy útiles como vector de características para representar señales en diferentes problemas relacionados con el aprendizaje de máquinas. En relación al problema de LID, se han investigado dos sistemas diferentes de acuerdo al tipo de información extraída de la señal. En el primero, la señal se parametriza en vectores acústicos con información espectral a corto plazo. En este caso, observamos mejoras de hasta un 50% con el sistema basado en i-Vectors respecto al sistema que utilizaba JFA como clasificador. Se comprobó que el subespacio de canal del modelo JFA también contenía información del idioma, mientras que con los i-Vectors no se descarta ningún tipo de información, y además, son útiles para mitigar diferencias entre los datos de entrenamiento y de evaluación. En la fase de clasificación, los i-Vectors de cada idioma se modelaron con una distribución Gaussiana en la que la matriz de covarianza era común para todos. Este método es simple y rápido, y no requiere de ningún post-procesado de los i-Vectors. En el segundo sistema, se introdujo el uso de información prosódica y formántica en un sistema de LID basado en i-Vectors. La precisión de éste estaba por debajo de la del sistema acústico. Sin embargo, los dos sistemas son complementarios, y se obtuvo hasta un 20% de mejora con la fusión de los dos respecto al sistema acústico solo. Tras los buenos resultados obtenidos para LID, y dado que, teóricamente, los i-Vectors capturan toda la información presente en la señal, decidimos usarlos para la evaluar de manera automática la inteligibilidad en el habla de personas con disartria. Los logopedas están muy interesados en esta tecnología porque permitiría evaluar a sus pacientes de una manera objetiva y consistente. En este caso, los i-Vectors se obtuvieron a partir de información espectral a corto plazo de la señal, y la inteligibilidad se calculó a partir de los i-Vectors obtenidos para un conjunto de palabras dichas por el locutor evaluado. Comprobamos que los resultados eran mucho mejores si en el entrenamiento del sistema se incorporaban datos de la persona que iba a ser evaluada. No obstante, esta limitación podría aliviarse utilizando una mayor cantidad de datos para entrenar el sistema.In this Thesis, we investigated how to effciently apply subspace Gaussian mixture modeling techniques onto two speech technology problems, namely automatic spoken language identification (LID) and automatic intelligibility assessment of dysarthric speech. One of the most important of such techniques in this Thesis was joint factor analysis (JFA). JFA is essentially a Gaussian mixture model where the mean of the components is expressed as a sum of low-dimension factors that represent different contributions to the speech signal. This factorization makes it possible to compensate for undesired sources of variability, like the channel. JFA was investigated as final classiffer and as feature extractor. In the latter approach, a single subspace including all sources of variability is trained, and points in this subspace are known as i-Vectors. Thus, one i-Vector is defined as a low-dimension representation of a single utterance, and they are a very powerful feature for different machine learning problems. We have investigated two different LID systems according to the type of features extracted from speech. First, we extracted acoustic features representing short-time spectral information. In this case, we observed relative improvements with i-Vectors with respect to JFA of up to 50%. We realized that the channel subspace in a JFA model also contains language information whereas i-Vectors do not discard any language information, and moreover, they help to reduce mismatches between training and testing data. For classification, we modeled the i-Vectors of each language with a Gaussian distribution with covariance matrix shared among languages. This method is simple and fast, and it worked well without any post-processing. Second, we introduced the use of prosodic and formant information with the i-Vectors system. The performance was below the acoustic system but both were found to be complementary and we obtained up to a 20% relative improvement with the fusion with respect to the acoustic system alone. Given the success in LID and the fact that i-Vectors capture all the information that is present in the data, we decided to use i-Vectors for other tasks, specifically, the assessment of speech intelligibility in speakers with different types of dysarthria. Speech therapists are very interested in this technology because it would allow them to objectively and consistently rate the intelligibility of their patients. In this case, the input features were extracted from short-term spectral information, and the intelligibility was assessed from the i-Vectors calculated from a set of words uttered by the tested speaker. We found that the performance was clearly much better if we had available data for training of the person that would use the application. We think that this limitation could be relaxed if we had larger databases for training. However, the recording process is not easy for people with disabilities, and it is difficult to obtain large datasets of dysarthric speakers open to the research community. Finally, the same system architecture for intelligibility assessment based on i-Vectors was used for predicting the accuracy that an automatic speech recognizer (ASR) system would obtain with dysarthric speakers. The only difference between both was the ground truth label set used for training. Predicting the performance response of an ASR system would increase the confidence of speech therapists in these systems and would diminish health related costs. The results were not as satisfactory as in the previous case, probably because an ASR is a complex system whose accuracy can be very difficult to be predicted only with acoustic information. Nonetheless, we think that we opened a door to an interesting research direction for the two problems

    The illocution-prosody relationship and the Information Pattern in spontaneous speech according to the Language into Act Theory (L-AcT)

    Get PDF
    This paper introduces the question of the definition of reference units for speech, correlating with the necessary condition that they must be an adequate and useful means for analyzing large spoken corpora. According to Language into Act Theory (L-AcT), the utterance is the proper reference unit and the counterpart of the speech act (Austin 1962), being demarcated by prosody within the flow of speech. The pragmatic foundations of the utterance and its information structure will be described and are closely connected to the role of prosody in their identification. The pragmatic and information analysis of English and Romance examples are presented, which are taken from representative spoken corpora (C-ORAL-ROM, C-ORAL-BRAZIL, S. Barbara Corpus). Regarding the information structure, the Comment unit is considered the core of the Information Pattern and since its role is the expression of the illocution it automatically conveys the new information. The Comment may be accompanied and supported by other optional information units which are functionally differentiated. The Information Pattern is systematically demarcated by a Prosodic Pattern within an isomorphic correlation

    Business Development and Marketing Strategy in Early‐stage Technology Start-­up Businesses: The Importance of Understanding the Customer

    No full text
    The author sets out to explore the role of a marketer and the importance of the customer in an early-stage technology start‐up business when exploring the commercial options for a new technology or product. The author sets learning objectives around the use of an academic model to explore the development of the enterprise and the role of a marketer within a start-up team. In order to reach these aims, the author compares three strategic marketing models and draws on insights from academic and practice-­based literature to justify the use of Kotler and Armstrong’s Marketing Process Model. The author then implements Kotler’s model, detailing the practical elements of his role as the marketing and business development lead across three different projects, exploring the commercial potential for three different technologies/or products. The author recommends the use of Kotler and Armstrong’s Marketing Process Model for early-­stage start-up business teams that are exploring commercial options for a new technology or a product. He recommends a customer-­led approach to marketing within a technology start-­up team. The author recognizes the importance of a marketer’s role in establishing, maintaining and nurturing relationships with potential customers in order to drive and inform product development

    Variation in form and meaning across the Japonic language family: With a focus on the Ryukyuan languages

    Get PDF

    Integration of Action and Language Knowledge: A Roadmap for Developmental Robotics

    Get PDF
    “This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder." “Copyright IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.”This position paper proposes that the study of embodied cognitive agents, such as humanoid robots, can advance our understanding of the cognitive development of complex sensorimotor, linguistic, and social learning skills. This in turn will benefit the design of cognitive robots capable of learning to handle and manipulate objects and tools autonomously, to cooperate and communicate with other robots and humans, and to adapt their abilities to changing internal, environmental, and social conditions. Four key areas of research challenges are discussed, specifically for the issues related to the understanding of: 1) how agents learn and represent compositional actions; 2) how agents learn and represent compositional lexica; 3) the dynamics of social interaction and learning; and 4) how compositional action and language representations are integrated to bootstrap the cognitive system. The review of specific issues and progress in these areas is then translated into a practical roadmap based on a series of milestones. These milestones provide a possible set of cognitive robotics goals and test scenarios, thus acting as a research roadmap for future work on cognitive developmental robotics.Peer reviewe

    The effects of motor practice on coarticulatory interactions in the speech of children and adults

    Full text link
    The current study was designed to elucidate the role of practice on speech production. Specifically, this investigation examined the effects of a distributed practice schedule on speech productions in young children and adults. Unlike the practice period used in previous studies, the practice session utilized in this investigation was spread out over one week (distributed over time), in which participants were required to practice on three different occasions before being retested. Therefore, the purpose of this investigation is to examine the notion of a developmental trend of coarticulation in children by verifying whether or not speech production strategies as exhibited by coarticulatory interactions are influenced by a distributed practice schedule. Participants were three-year olds, eight-year olds, and adults who were pre-tested, trained for one week, and post-tested. The data substantiates the developmental coarticulatory effects across age groups and demonstrated that this coarticulation can be affected by practice

    Language In My Mouth: Linguistic Variation in the Nmbo Speech Community of Southern New Guinea

    Get PDF
    This thesis is a mixed-methods investigation into the question of the sociolinguistics of linguistic diversity in Papua New Guinea. Social and cultural traits of New Guinean speech communities have been hypothesised as conducive to language differentiation and diversification (Laycock 1991, Thurston 1987, 1992, Foley 2000, Ross 2001), however there have been few empirical studies to support these hypotheses. In this thesis I investigate linguistic micro-variations within a contemporary New Guinean speech community, with the goal of identifying socio-cultural pressures that affect language variation and change. The community under investigation is the Nmbo speech community located in the Morehead area of Southern New Guinea. It is a highly multilingual community in the middle of the Nambu branch dialect chain, and consists primarily of the three villages Govav, Bevdvn, and Arovwe. The ideologically licensed speakers of Nmbo are the Kerake tribe people, but due to the practice of marriage exogamy, a large portion of non-Kerake people speak Nmbo as an additional language learnt from their parents or spouse. This thesis embraces the complexities of the multilingual ecology by including data from Kerake women who have married out of the Nmbo villages into the neighbouring Nen language village of Bimadbn. The empirical investigations bring data from three directions. First are the qualitative descriptions based on my own ethnographic fieldwork supported by prior ethnographic descriptions. The picture to emerge is of an egalitarian multilingual speech community. The qualitative descriptions also provide basic facts about demographics and social structures of the community. Second is the linguistic description of the Nmbo language. Nmbo is an under-described language without substantial prior description, and this thesis contains a sketch grammar covering the basics aspects of Nmbo grammar. Finally there are three quantitative studies of variation. The vowel sociophonetic study and the word initial [h]-drop study are classic Labovian variationist studies that investigate patterns of variation across a sample of speakers. The former is based of elicited word list data, and the latter on naturalistic speech data. The third quantitative study takes a grammaticalisation approach to an emergent topic marker in a topicalising construction from a relative clause construction. This is the first thesis ever produced providing qualitative, descriptive, and quantitative data from a New Guinean speech community within a language ecology of vital indigenous multilingualism. The contributions of the thesis are two fold. Firstly, this thesis brings grammatical and sociolinguistic descriptions from an under-studied language. It is a socio-grammar (Nagy 2009) that considers language ecology, sociolinguistics, and grammatical description. Secondly, this thesis contributes empirical data on the sociolinguistics of small-scale speech communities. The classic sociolinguistic variable of gender is not found to be particularly significant in the variables studied, despite the community being highly gendered in other social domains. Village, however, shows some significance. As far as the three variables are concerned, Nmbo speakers show little community-internal variation and paint a picture of a tight-knit society of intimates (Trudgill 2011). The conclusion to the question of the sociolinguistics of diversification is that while there is some evidence of sociolinguistic differentiation within the Nmbo speech community, the most important social groups to orient against are the other sister language groups in the Morehead area. The nascent variation within the Nmbo speech community, combined with the ethnographic evidence of a cluster of dense and multiplex social networks, suggest that should the social need to differentiate between other Kerake arise, linguistic differentiation may occur rapidly
    corecore