11 research outputs found

    Utilizing Review Summarization in a Spoken Recommendation System

    Get PDF
    In this paper we present a framework for spoken recommendation systems. To provide reliable recommendations to users, we incorporate a review summarization technique which extracts informative opinion summaries from grass-roots users‘ reviews. The dialogue system then utilizes these review summaries to support both quality-based opinion inquiry and feature- specific entity search. We propose a probabilistic language generation approach to automatically creating recommendations in spoken natural language from the text-based opinion summaries. A user study in the restaurant domain shows that the proposed approaches can effectively generate reliable and helpful recommendations in human-computer conversations.T-Party ProjectQuanta Computer (Firm

    Sound collection and visualization system enabled participatory and opportunistic sensing approaches

    Get PDF
    This paper presents a sound collection system to visualize environmental sounds that are collected using a crowd-sourcing approach. An analysis of physical features is generally used to analyze sound properties; however, human beings not only analyze but also emotionally connect to sounds. If we want to visualize the sounds according to the characteristics of the listener, we need to collect not only the raw sound, but also the subjective feelings associated with them. For this purpose, we developed a sound collection system using a crowdsourcing approach to collect physical sounds, their statistics, and subjective evaluations simultaneously. We then conducted a sound collection experiment using the developed system on ten participants.We collected 6,257 samples of equivalent loudness levels and their locations, and 516 samples of sounds and their locations. Subjective evaluations by the participants are also included in the data. Next, we tried to visualize the sound on a map. The loudness levels are visualized as a color map and the sounds are visualized as icons which indicate the sound type. Finally, we conducted a discrimination experiment on the sound to implement a function of automatic conversion from sounds to appropriate icons. The classifier is trained on the basis of the GMM-UBM (Gaussian Mixture Model and Universal Background Model) method. Experimental results show that the F-measure is 0.52 and the AUC is 0.79

    Sound collection systems using a crowdsourcing approach to construct sound map based on subjective evaluation

    Get PDF
    This paper presents a sound collection system that uses crowdsourcing to gather information for visualizing area characteristics. First, we developed a sound collection system to simultaneously collect physical sounds, their statistics, and subjective evaluations. We then conducted a sound collection experiment using the developed system on 14 participants. We collected 693,582 samples of equivalent Aweighted loudness levels and their locations, and 5,935 samples of sounds and their locations. The data also include subjective evaluations by the participants. In addition, we analyzed the changes in sound properties of some areas before and after the opening of a large-scale shopping mall in a city. Next, we implemented visualizations on the server system to attract users’ interests. Finally, we published the system, which can receive sounds from any Android smartphone user. The sound data were continuously collected and achieved a specified result

    Multimodal speech interfaces for map-based applications

    Get PDF
    Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Cataloged from student-submitted PDF version of thesis.Includes bibliographical references (p. 71-73).This thesis presents the development of multimodal speech interfaces for mobile and vehicle systems. Multimodal interfaces have been shown to increase input efficiency in comparison with their purely speech or text-based counterparts. To date, much of the existing work has focused on desktop or large tablet-sized devices. The advent of the smartphone and its ability to handle both speech and touch inputs in combination with a screen display has created a compelling opportunity for deploying multimodal systems on smaller-sized devices. We introduce a multimodal user interface designed for mobile and vehicle devices, and system enhancements for a dynamically expandable point-of-interest database. The mobile system is evaluated using Amazon Mechanical Turk and the vehicle- based system is analyzed through in-lab usability studies. Our experiments show encouraging results for multimodal speech adoption.by Sean Liu.M.Eng

    Harvesting and summarizing user-generated content for advanced speech-based human-computer interaction

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2012.Cataloged from PDF version of thesis.Includes bibliographical references (p. 155-164).There have been many assistant applications on mobile devices, which could help people obtain rich Web content such as user-generated data (e.g., reviews, posts, blogs, and tweets). However, online communities and social networks are expanding rapidly and it is impossible for people to browse and digest all the information via simple search interface. To help users obtain information more efficiently, both the interface for data access and the information representation need to be improved. An intuitive and personalized interface, such as a dialogue system, could be an ideal assistant, which engages a user in a continuous dialogue to garner the user's interest and capture the user's intent, and assists the user via speech-navigated interactions. In addition, there is a great need for a type of application that can harvest data from the Web, summarize the information in a concise manner, and present it in an aggregated yet natural way such as direct human dialogue. This thesis, therefore, aims to conduct research on a universal framework for developing speech-based interface that can aggregate user-generated Web content and present the summarized information via speech-based human-computer interaction. To accomplish this goal, several challenges must be met. Firstly, how to interpret users' intention from their spoken input correctly? Secondly, how to interpret the semantics and sentiment of user-generated data and aggregate them into structured yet concise summaries? Lastly, how to develop a dialogue modeling mechanism to handle discourse and present the highlighted information via natural language? This thesis explores plausible approaches to tackle these challenges. We will explore a lexicon modeling approach for semantic tagging to improve spoken language understanding and query interpretation. We will investigate a parse-and-paraphrase paradigm and a sentiment scoring mechanism for information extraction from unstructured user-generated data. We will also explore sentiment-involved dialogue modeling and corpus-based language generation approaches for dialogue and discourse. Multilingual prototype systems in multiple domains have been implemented for demonstration.by Jingjing Liu.Ph.D

    Multi-level acoustic modeling for automatic speech recognition

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2012.Cataloged from PDF version of thesis.Includes bibliographical references (p. 183-192).Context-dependent acoustic modeling is commonly used in large-vocabulary Automatic Speech Recognition (ASR) systems as a way to model coarticulatory variations that occur during speech production. Typically, the local phoneme context is used as a means to define context-dependent units. Because the number of possible context-dependent units can grow exponentially with the length of the contexts, many units will not have enough training examples to train a robust model, resulting in a data sparsity problem. For nearly two decades, this data sparsity problem has been dealt with by a clustering-based framework which systematically groups different context-dependent units into clusters such that each cluster can have enough data. Although dealing with the data sparsity issue, the clustering-based approach also makes all context-dependent units within a cluster have the same acoustic score, resulting in a quantization effect that can potentially limit the performance of the context-dependent model. In this work, a multi-level acoustic modeling framework is proposed to address both the data sparsity problem and the quantization effect. Under the multi-level framework, each context-dependent unit is associated with classifiers that target multiple levels of contextual resolution, and the outputs of the classifiers are linearly combined for scoring during recognition. By choosing the classifiers judiciously, both the data sparsity problem and the quantization effect can be dealt with. The proposed multi-level framework can also be integrated into existing large-vocabulary ASR systems, such as FST-based ASR systems, and is compatible with state-of-the-art error reduction techniques for ASR systems, such as discriminative training methods. Multiple sets of experiments have been conducted to compare the performance of the clustering-based acoustic model and the proposed multi-level model. In a phonetic recognition experiment on TIMIT, the multi-level model has about 8% relative improvement in terms of phone error rate, showing that the multi-level framework can help improve phonetic prediction accuracy. In a large-vocabulary transcription task, combining the proposed multi-level modeling framework with discriminative training can provide more than 20% relative improvement over a clustering baseline model in terms of Word Error Rate (WER), showing that the multi-level framework can be integrated into existing large-vocabulary decoding frameworks and that it combines well with discriminative training methods. In speaker adaptive transcription task, the multi-level model has about 14% relative WER improvement, showing that the proposed framework can adapt better to new speakers, and potentially to new environments than the conventional clustering-based approach.by Hung-An Chang.Ph.D

    Crowd-supervised training of spoken language systems

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2012.Cataloged from PDF version of thesis.Includes bibliographical references (p. 155-166).Spoken language systems are often deployed with static speech recognizers. Only rarely are parameters in the underlying language, lexical, or acoustic models updated on-the-fly. In the few instances where parameters are learned in an online fashion, developers traditionally resort to unsupervised training techniques, which are known to be inferior to their supervised counterparts. These realities make the development of spoken language interfaces a difficult and somewhat ad-hoc engineering task, since models for each new domain must be built from scratch or adapted from a previous domain. This thesis explores an alternative approach that makes use of human computation to provide crowd-supervised training for spoken language systems. We explore human-in-the-loop algorithms that leverage the collective intelligence of crowds of non-expert individuals to provide valuable training data at a very low cost for actively deployed spoken language systems. We also show that in some domains the crowd can be incentivized to provide training data for free, as a byproduct of interacting with the system itself. Through the automation of crowdsourcing tasks, we construct and demonstrate organic spoken language systems that grow and improve without the aid of an expert. Techniques that rely on collecting data remotely from non-expert users, however, are subject to the problem of noise. This noise can sometimes be heard in audio collected from poor microphones or muddled acoustic environments. Alternatively, noise can take the form of corrupt data from a worker trying to game the system - for example, a paid worker tasked with transcribing audio may leave transcripts blank in hopes of receiving a speedy payment. We develop strategies to mitigate the effects of noise in crowd-collected data and analyze their efficacy. This research spans a number of different application domains of widely-deployed spoken language interfaces, but maintains the common thread of improving the speech recognizer's underlying models with crowd-supervised training algorithms. We experiment with three central components of a speech recognizer: the language model, the lexicon, and the acoustic model. For each component, we demonstrate the utility of a crowd-supervised training framework. For the language model and lexicon, we explicitly show that this framework can be used hands-free, in two organic spoken language systems.by Ian C. McGraw.Ph.D

    Crowd-powered systems

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2012.Cataloged from PDF version of thesis.Includes bibliographical references (p. 217-237).Crowd-powered systems combine computation with human intelligence, drawn from large groups of people connecting and coordinating online. These hybrid systems enable applications and experiences that neither crowds nor computation could support alone. Unfortunately, crowd work is error-prone and slow, making it difficult to incorporate crowds as first-order building blocks in software systems. I introduce computational techniques that decompose complex tasks into simpler, verifiable steps to improve quality, and optimize work to return results in seconds. These techniques develop crowdsourcing as a platform so that it is reliable and responsive enough to be used in interactive systems. This thesis develops these ideas through a series of crowd-powered systems. The first, Soylent, is a word processor that uses paid micro-contributions to aid writing tasks such as text shortening and proofreading. Using Soylent is like having access to an entire editorial staff as you write. The second system, Adrenaline, is a camera that uses crowds to help amateur photographers capture the exact right moment for a photo. It finds the best smile and catches subjects in mid-air jumps, all in realtime. Moving beyond generic knowledge and paid crowds, I introduce techniques to motivate a social network that has specific expertise, and techniques to data mine crowd activity traces in support of a large number of uncommon user goals. These systems point to a future where social and crowd intelligence are central elements of interaction, software, and computation.by Michael Scott Bernstein.Ph.D

    Incorporating Weak Statistics for Low-Resource Language Modeling

    Get PDF
    Automatic speech recognition (ASR) requires a strong language model to guide the acoustic model and favor likely utterances. While many tasks enjoy billions of language model training tokens, many domains which require ASR do not have readily available electronic corpora.The only source of useful language modeling data is expensive and time-consuming human transcription of in-domain audio. This dissertation seeks to quickly and inexpensively improve low-resource language modeling for use in automatic speech recognition. This dissertation first considers efficient use of non-professional human labor to best improve system performance, and demonstrate that it is better to collect more data, despite higher transcription error, than to redundantly transcribe data to improve quality. In the process of developing procedures to collect such data, this work also presents an efficient rating scheme to detect poor transcribers without gold standard data. As an alternative to this process, automatic transcripts are generated with an ASR system and explore efficiently combining these low-quality transcripts with a small amount of high quality transcripts. Standard n-gram language models are sensitive to the quality of the highest order n-gram and are unable to exploit accurate weaker statistics. Instead, a log-linear language model is introduced, which elegantly incorporates a variety of background models through MAP adaptation. This work introduces marginal class constraints which effectively capture knowledge of transcriber error and improve performance over n-gram features. Finally, this work constrains the language modeling task to keyword search of words unseen in the training text. While overall system performance is good, these words suffer the most due to a low probability in the language model. Semi-supervised learning effectively extracts likely n-grams containing these new keywords from a large corpus of audio. By using a search metric that favors recall over precision, this method captures over 80% of the potential gain

    BAHASA RITUAL DAN KEKUASAAN TRADISIONAL ETNIK RONGGA

    Get PDF
    Makalah ini memaparkan kekuasaan tradisional (traditional power) dalam konteks kehidupan kontemporer etnik Rongga di Flores NTT. Fokus kajiannya pada aspek sosio-etnolinguistik terkait dengan bahasa ritual meliputi: (1) bentuk-bentuk linguistik dan non linguistik yang relevan dengan nilai-nilai kekuasaan; (2) sistem nilai budaya yang terkait dengan nilai-nilai kekuasaan itu sendiri yang terkandung didalamnya; (3) proses pemerolehan, pewarisan, pemertahanannya di masa lampau dan kini, serta prospeknya di masa mendatang dalam dinamika sosiopolitik baik di Manggarai Timur dan Indonesia. Tujuannya adalah untuk mengetahui sejauh manakah terjadi interaksi antara bahasa (ritual) etnik Rongga dengan kekuasaan. Interaksinya akan dikaji dari dua dimensi, yakni tradisional dan kotemporer, dilihat dari dinamikanya terkait dengan usaha konservasi bahasa dan budaya minoritas yang terpinggirkan (Arka, 2013;2015). Penelitian ini termasuk penelitian deskriptif-kualitatif dengan pendekatan etnografi merupakan kelanjutan dari penelitian bahasa dan budaya Rongga (Arka, 2010; Sumitri, 2015). Inovasi kajian terletak pada ancangan yang diusulkan berupa kajian kapital lingusitik sebagai bagian dari kapital lainnya (sisiokultural dan ekonomis) (Morrison dan Lui, 2000; Bourdieu 1997). Metode dan teknik pengumpulan data adalah pengamatan, wawancara, studi dokumentasi, rekam dan catat. Temuan. Secara linguistik, terdapat kekhasan satuan bentuk ujaran bahasa ritual bersifat puitis arkais dalam polapola bersajak dengan tingkat kesulitan dalam bentuk dan irama yang tinggi. Secara etnolinguistik, bahasa ritual berisi pesan/makna yang sarat nilai sosial budaya dan pengetahuan etnik Rongga. Bahasa ritual tersebut ditopang pula dengan perilaku ragawi untuk menunjang kebermaknaan esensi pesan yang disampaikan. Relasi kekuasaan dan bahasa ritual terbangun secara alamiah melalui sejumlah kualitas persona yang dihargai tinggi dengan mendapatkan pengakuan atas posisinya dalam hirarki sosial seperti kemampuan, keterampilan, dan kepekaan dalam penguasaan pengetahuan adat yang luhur dengan ekpresi linguistik dengan tingkat kerumitan tinggi sebagai bentuk kapital linguistik dan kultural bagi seseorang. Terbentuknya kapital linguistik dan budaya yang tinggi pada seseorang adalah proses yang kompleks, kombinasi dari kualitas diri dan bakat verbal linguistik serta pembawaan dengan legitimasi seseorang. Semua itu diperoleh secara tradisional berdasarkan pengalaman dan juga bersifat genealogis (dengan otoritas rohaniah) terkait dengan adat/suku/marga tertentu yang semuanya menjadi sumber daya potensial yang berakumulasi pada pengaruh dan kekuasaan menggerakkan kepatuhan dan penghormatan warga lain. Walaupun kekuasaan tradisional mengalami penyusutan, yang bisa dijelaskan dengan baik dari perspektif pergeseran ideologi (bahasa/budaya) dan ekologi kekuasaan lebih besar, namun fungsi dan perannya tidaklah punah sama sekali. Diargumentasikan bahwa dinamika kekuasaan tradisional mestinya didokumentasikan dan dipahami dengan baik, diaktualisasi untuk kepentingan kontemporer sebaik-baiknya. Kekuatan legitimasinya tergerus sebagai dampak dari kehadiran sistem pemerintahan/birokrasi modern Indonesia (menggantikan sistem kedaluan pada tahun 1960an). Meskipun demikian, sistem pewarisan kekuasaan tradisional masih mengikuti garis kekuasaan kepada orang yang memiliki kapital linguistik-budaya, umumnya tokoh adat yang berpengaruh, yang mampu menguasai bahasa ritual dan memanfaatkan pengetahuan adat dan energi lembaga adat untuk berbagai kepentingan, baik ritual/tradisi maupun kontemporer. Makalah lengkap akan menguraikan lebih jauh secara komparatif dampak positif-negatif terpisahnya (perekrutan) kepemimpinan dan kekuasaan ditingkat lokal (tradisional/adat vs. modern), dalam konteks kapital budaya/linguistik yang lebih luas di Indonesia
    corecore