54 research outputs found

    Neural approaches to spoken content embedding

    Full text link
    Comparing spoken segments is a central operation to speech processing. Traditional approaches in this area have favored frame-level dynamic programming algorithms, such as dynamic time warping, because they require no supervision, but they are limited in performance and efficiency. As an alternative, acoustic word embeddings -- fixed-dimensional vector representations of variable-length spoken word segments -- have begun to be considered for such tasks as well. However, the current space of such discriminative embedding models, training approaches, and their application to real-world downstream tasks is limited. We start by considering ``single-view" training losses where the goal is to learn an acoustic word embedding model that separates same-word and different-word spoken segment pairs. Then, we consider ``multi-view" contrastive losses. In this setting, acoustic word embeddings are learned jointly with embeddings of character sequences to generate acoustically grounded embeddings of written words, or acoustically grounded word embeddings. In this thesis, we contribute new discriminative acoustic word embedding (AWE) and acoustically grounded word embedding (AGWE) approaches based on recurrent neural networks (RNNs). We improve model training in terms of both efficiency and performance. We take these developments beyond English to several low-resource languages and show that multilingual training improves performance when labeled data is limited. We apply our embedding models, both monolingual and multilingual, to the downstream tasks of query-by-example speech search and automatic speech recognition. Finally, we show how our embedding approaches compare with and complement more recent self-supervised speech models.Comment: PhD thesi

    Internationalization prospects of Finnish language technology SMEs in rural India

    Get PDF
    Rural India is an emerging business area with a population of over 800 million people. Despite the strong two-decade long economic growth, most of these people have to cope with a poor selection of both private and public services due to consumer limitations and deficiencies in service infrastructure. Mobile services are rapidly becoming one important exception. Mobile phones are enabling access to various services from banking to agriculture and from healthcare to education for the rural people, and this consequently creates large scale business opportunities for international mobile service developers. In multilingual India, services have to be scaled to various languages and they have to overcome the obstacle of illiteracy in order to reach entire rural audience. The utilization of language technology is one possibility to deal with both issues cost-effectively. This thesis takes a novel approach on internationalization research by examining the prospects that Finnish language technology companies have in the commercial development of multilingual mobile services in rural India through a case study of six SMEs. The results suggest that the prospects are characterized by the internationalization orientation and knowledge orientation of the company, and that Finnish language technology companies are prone to reactive internationalization at best when it comes to developing areas

    Internationalization prospects of Finnish language technology SMEs in rural India

    Get PDF
    Rural India is an emerging business area with a population of over 800 million people. Despite the strong two-decade long economic growth, most of these people have to cope with a poor selection of both private and public services due to consumer limitations and deficiencies in service infrastructure. Mobile services are rapidly becoming one important exception. Mobile phones are enabling access to various services from banking to agriculture and from healthcare to education for the rural people, and this consequently creates large scale business opportunities for international mobile service developers. In multilingual India, services have to be scaled to various languages and they have to overcome the obstacle of illiteracy in order to reach entire rural audience. The utilization of language technology is one possibility to deal with both issues cost-effectively. This thesis takes a novel approach on internationalization research by examining the prospects that Finnish language technology companies have in the commercial development of multilingual mobile services in rural India through a case study of six SMEs. The results suggest that the prospects are characterized by the internationalization orientation and knowledge orientation of the company, and that Finnish language technology companies are prone to reactive internationalization at best when it comes to developing areas

    LOW RESOURCE HIGH ACCURACY KEYWORD SPOTTING

    Get PDF
    Keyword spotting (KWS) is a task to automatically detect keywords of interest in continuous speech, which has been an active research topic for over 40 years. Recently there is a rising demand for KWS techniques in resource constrained conditions. For example, as for the year of 2016, USC Shoah Foundation covers audio-visual testimonies from survivors and other witnesses of the Holocaust in 63 countries and 39 languages, and providing search capability for those testimonies requires substantial KWS technologies in low language resource conditions, as for most languages, resources for developing KWS systems are not as rich as that for English. Despite the fact that KWS has been in the literature for a long time, KWS techniques in resource constrained conditions have not been researched extensively. In this dissertation, we improve KWS performance in two low resource conditions: low language resource condition where language specific data is inadequate, and low computation resource condition where KWS runs on computation constrained devices. For low language resource KWS, we focus on applications for speech data mining, where large vocabulary continuous speech recognition (LVCSR)-based KWS techniques are widely used. Keyword spotting for those applications are also known as keyword search (KWS) or spoken term detection (STD). A key issue for this type of KWS technique is the out-of-vocabulary (OOV) keyword problem. LVCSR-based KWS can only search for words that are defined in the LVCSR's lexicon, which is typically very small in a low language resource condition. To alleviate the OOV keyword problem, we propose a technique named "proxy keyword search" that enables us to search for OOV keywords with regular LVCSR-based KWS systems. We also develop a technique that expands LVCSR's lexicon automatically by adding hallucinated words, which increases keyword coverage and therefore improves KWS performance. Finally we explore the possibility of building LVCSR-based KWS systems with limited lexicon, or even without an expert pronunciation lexicon. For low computation resource KWS, we focus on wake-word applications, which usually run on computation constrained devices such as mobile phones or tablets. We first develop a deep neural network (DNN)-based keyword spotter, which is lightweight and accurate enough that we are able to run it on devices continuously. This keyword spotter typically requires a pre-defined keyword, such as "Okay Google". We then propose a long short-term memory (LSTM)-based feature extractor for query-by-example KWS, which enables the users to define their own keywords

    The problem of codifying linguistic knowledge in two translations of Shakespeare's sonnets: a corpus-based study

    Get PDF
    Tese (doutorado) - Universidade Federal de Santa Catarina, Centro de Comunicação e Expressão, Programa de Pós-Graduação em Letras/Inglês e Literatura Correspondente, Florianópolis, 2012Abstract : The present study deals with the problem of codifying linguistic knowledge in a parallel corpus, in other words, the process of corpus annotation. The purpose of the present study was to test the identification of four types of translational correspondence, as defined by Thunes (2011) in a parallel corpus made up of 45 Shakespeare's Sonnets and two distinct translations into Brazilian Portuguese. The obtained results show that Thunes' model can be considered effective when applied to classify alignment units in a parallel corpus of translated poetry, but it needs some adjustments in order to cope with some translational pairs which did not fit properly into any of the four categories. The advantage of Thunes' proposal is that it establishes criteria to analyse complexity involved in the translation process in a very clear way. Este estudo aborda o problema de codificação do conhecimento linguístico em um corpus paralelo, em outras palavras, o processo de anotação de corpus. O objetivo deste estudo foi testar a identificação dos quatro tipos de correspondência tradutória descritos por Thunes (2011) em um corpus paralelo constituído por 45 sonetos de Shakespeare e duas traduções distintas em Português. Os resultados obtidos mostram que o modelo de Thunes pode ser considerado eficaz quando utilizado para classificar unidades de alinhamento em um corpus paralelo de poesia traduzida, mas precisa de algumas adaptações, a fim de lidar com alguns pares tradutórios que não se ajustaram adequadamente em nenhuma das quatro categorias propostas. O modelo proposto por Thunes pode ser considerado vantajoso por estabelecer critérios para analisar a complexidade envolvida no processo de tradução de uma forma muito clara

    Spaces, borders, histories: Identity construction in colonial Goalpara (India).

    Get PDF
    The thesis traces the construction of a regional identity in the historically transitional area of Goalpara, located on the western borders of the colonial province of Assam, in the late nineteenth and early twentieth century. The relationship between the emergence of new concepts of political space and changes in the political economy informs this work, which begins with the entry of the colonial state into the region and its transition from an initially hesitant power, relying on the symbolic memory of previous empires, to a more confident and decisively interventionist one, dependent on rent collection and a centralized and effective apparatus of control. The second chapter locates the emergence of a cultural identity in a region of overlapping and multiple sovereignties, as new concepts of territoriality and sovereignty were imposed under colonial rule. It studies the subsequent displacement of indigenous concepts of space and the refashioning of social relationships between local groups. It explores the attempt at a construction of communities into singular, substantive entities in a region where, despite increasing sedenterisation, the adoption of sedentary or non-sedentary lifestyles was far from rigid and determined. Discussed here is the relationship between topography and politics. The narrative is then carried on to the third chapter, on the colonial state's determination of social and political space through the discourse of mapping and the creation of a centralized, integrated structure and system of social action, evident in the role of colonial law. The thesis does not argue for a seamless hegemony of the colonial state. Rather, it views colonial projects as being shaped in varied encounters with the colonised which involved a frequent circumvention and contestation of the state's claims to superiority. The significance of the autonomy and agency of the colonised becomes evident in history writing, a discourse of legitimacy used by both the colonial state and Assamese nationalism. The fourth chapter explores the ways in which the delimiting of different forms of space in the modern colonial district of Goalpara was both reinforced and resisted through the narrative structure of history. It recognises the role of history writing in the imagining of a 'Goalparia' identity and views such writing, which resisted singular narratives of Assamese nationalism, as discourses that always exist marginally in certain areas, challenging, destabilising and displacing the dominant discourses. The last chapter looks at similar resistance and imagining of a collective identity by Goalpara's traditional elite within the realm of language. The educated middle class who spoke in a rational and liberal voice offered better potential for political investment for the colonial state than the traditional powers but the framework of colonial law still allowed for a continuance of aspects of the 'old regime'. This chapter studies the concerns of this marginalised traditional elite and explores their reinvention of roles within the newly emerging and expanding public sphere, which centred around producing a political consciousness through a contest over the use of language

    Using Comparable Corpora to Augment Statistical Machine Translation Models in Low Resource Settings

    Get PDF
    Previously, statistical machine translation (SMT) models have been estimated from parallel corpora, or pairs of translated sentences. In this thesis, we directly incorporate comparable corpora into the estimation of end-to-end SMT models. In contrast to parallel corpora, comparable corpora are pairs of monolingual corpora that have some cross-lingual similarities, for example topic or publication date, but that do not necessarily contain any direct translations. Comparable corpora are more readily available in large quantities than parallel corpora, which require significant human effort to compile. We use comparable corpora to estimate machine translation model parameters and show that doing so improves performance in settings where a limited amount of parallel data is available for training. The major contributions of this thesis are the following: * We release ‘language packs’ for 151 human languages, which include bilingual dictionaries, comparable corpora of Wikipedia document pairs, comparable corpora of time-stamped news text that we harvested from the web, and, for non-roman script languages, dictionaries of name pairs, which are likely to be transliterations. * We present a novel technique for using a small number of example word translations to learn a supervised model for bilingual lexicon induction which takes advantage of a wide variety of signals of translation equivalence that can be estimated over comparable corpora. * We show that using comparable corpora to induce new translations and estimate new phrase table feature functions improves end-to-end statistical machine translation performance for low resource language pairs as well as domains. * We present a novel algorithm for composing multiword phrase translations from multiple unigram translations and then use comparable corpora to prune the large space of hypothesis translations. We show that these induced phrase translations improve machine translation performance beyond that of component unigrams. This thesis focuses on critical low resource machine translation settings, where insufficient parallel corpora exist for training statistical models. We experiment with both low resource language pairs and low resource domains of text. We present results from our novel error analysis methodology, which show that most translation errors in low resource settings are due to unseen source language words and phrases and unseen target language translations. We also find room for fixing errors due to how different translations are weighted, or scored, in the models. We target both error types; we use comparable corpora to induce new word and phrase translations and estimate novel translation feature scores. Our experiments show that augmenting baseline SMT systems with new translations and features estimated over comparable corpora improves translation performance significantly. Additionally, our techniques expand the applicability of statistical machine translation to those language pairs for which zero parallel text is available

    Application of pre-training and fine-tuning AI models to machine translation: a case study of multilingual text classification in Baidu

    Get PDF
    With the development of international information technology, we are producing a huge amount of information all the time. The processing ability of information in various languages is gradually replacing information and becoming a rarer resource. How to obtain the most effective information in such a large and complex amount of multilingual textual information is a major goal of multilingual information processing. Multilingual text classification helps users to break the language barrier and accurately locate the required information and triage information. At the same time, the rapid development of the Internet has accelerated the communication among users of various languages, giving rise to a large number of multilingual texts, such as book and movie reviews, online chats, product introductions and other forms, which contain a large amount of valuable implicit information and urgently need automated tools to categorize and process those multilingual texts. This work describes the Natural Language Process (NLP) sub-task known as Multilingual Text Classification (MTC) performed within the context of Baidu, a Chinese leading AI company with a strong Internet base, whose NLP division led the industry in deep learning technology to go online in Machine Translation (MT) and search. Multilingual text classification is an important module in NLP machine translation and a basic module in NLP tasks. It can be applied to many fields, such as Fake Reviews Detection, News Headlines Categories Classification, Analysis of positive and negative reviews and so on. In the following work, we will first define the AI model paradigm of 'pre-training and fine-tuning' in deep learning in the Baidu NLP department. Then investigated the application scenarios of multilingual text classification. Most of the text classification systems currently available in the Chinese market are designed for a single language, such as Alibaba's text classification system. If users need to classify texts of the same category in multiple languages, they need to train multiple single text classification systems and then classify them one by one. However, many internationalized products do not have a single text language, such as AliExpress cross-border e-commerce business, Airbnb B&B business, etc. Industry needs to understand and classify users’ reviews in various languages, and have conducted in-depth statistics and marketing strategy development, and multilingual text classification is particularly important in this scenario. Therefore, we focus on interpreting the methodology of multilingual text classification model of machine translation in Baidu NLP department, and capture sets of multilingual data of reviews, news headlines and other data for manual classification and labeling, use the labeling results for fine-tuning of multilingual text classification model, and output the quality evaluation data of Baidu multilingual text classification model after fine-tuning. We will discuss if the pre-training and fine-tuning of the large model can substantially improve the quality and performance of multilingual text classification. Finally, based on the machine translation-multilingual text classification model, we derive the application method of pre-training and fine-tuning paradigm in the current cutting-edge deep learning AI model under the NLP system and verify the generality and cutting-edge of the pre-training and fine-tuning paradigm in the deep learning-intelligent search field.Com o desenvolvimento da tecnologia de informação internacional, estamos sempre a produzir uma enorme quantidade de informação e o recurso mais escasso já não é a informação, mas a capacidade de processar informação em cada língua. A maior parte da informação multilingue é expressa sob a forma de texto. Como obter a informação mais eficaz numa quantidade tão considerável e complexa de informação textual multilingue é um dos principais objetivos do processamento de informação multilingue. A classificação de texto multilingue ajuda os utilizadores a quebrar a barreira linguística e a localizar com precisão a informação necessária e a classificá-la. Ao mesmo tempo, o rápido desenvolvimento da Internet acelerou a comunicação entre utilizadores de várias línguas, dando origem a um grande número de textos multilingues, tais como críticas de livros e filmes, chats, introduções de produtos e outros distintos textos, que contêm uma grande quantidade de informação implícita valiosa e necessitam urgentemente de ferramentas automatizadas para categorizar e processar esses textos multilingues. Este trabalho descreve a subtarefa do Processamento de Linguagem Natural (PNL) conhecida como Classificação de Texto Multilingue (MTC), realizada no contexto da Baidu, uma empresa chinesa líder em IA, cuja equipa de PNL levou a indústria em tecnologia baseada em aprendizagem neuronal a destacar-se em Tradução Automática (MT) e pesquisa científica. A classificação multilingue de textos é um módulo importante na tradução automática de PNL e um módulo básico em tarefas de PNL. A MTC pode ser aplicada a muitos campos, tais como análise de sentimentos multilingues, categorização de notícias, filtragem de conteúdos indesejados (do inglês spam), entre outros. Neste trabalho, iremos primeiro definir o paradigma do modelo AI de 'pré-treino e afinação' em aprendizagem profunda no departamento de PNL da Baidu. Em seguida, realizaremos a pesquisa sobre outros produtos no mercado com capacidade de classificação de texto — a classificação de texto levada a cabo pela Alibaba. Após a pesquisa, verificamos que a maioria dos sistemas de classificação de texto atualmente disponíveis no mercado chinês são concebidos para uma única língua, tal como o sistema de classificação de texto Alibaba. Se os utilizadores precisarem de classificar textos da mesma categoria em várias línguas, precisam de aplicar vários sistemas de classificação de texto para cada língua e depois classificá-los um a um. No entanto, muitos produtos internacionalizados não têm uma única língua de texto, tais como AliExpress comércio eletrónico transfronteiriço, Airbnb B&B business, etc. A indústria precisa compreender e classificar as revisões dos utilizadores em várias línguas. Esta necessidade conduziu a um desenvolvimento aprofundado de estatísticas e estratégias de marketing, e a classificação de textos multilingues é particularmente importante neste cenário. Desta forma, concentrar-nos-emos na interpretação da metodologia do modelo de classificação de texto multilingue da tradução automática no departamento de PNL Baidu. Colhemos para o efeito conjuntos de dados multilingues de comentários e críticas, manchetes de notícias e outros dados para classificação manual, utilizamos os resultados dessa classificação para o aperfeiçoamento do modelo de classificação de texto multilingue e produzimos os dados de avaliação da qualidade do modelo de classificação de texto multilingue da Baidu. Discutiremos se o pré-treino e o aperfeiçoamento do modelo podem melhorar substancialmente a qualidade e o desempenho da classificação de texto multilingue. Finalmente, com base no modelo de classificação de texto multilingue de tradução automática, derivamos o método de aplicação do paradigma de pré-formação e afinação no atual modelo de IA de aprendizagem profunda de ponta sob o sistema de PNL, e verificamos a robustez e os resultados positivos do paradigma de pré-treino e afinação no campo de pesquisa de aprendizagem profunda
    corecore