127 research outputs found

    Spoken content retrieval: A survey of techniques and technologies

    Get PDF
    Speech media, that is, digital audio and video containing spoken content, has blossomed in recent years. Large collections are accruing on the Internet as well as in private and enterprise settings. This growth has motivated extensive research on techniques and technologies that facilitate reliable indexing and retrieval. Spoken content retrieval (SCR) requires the combination of audio and speech processing technologies with methods from information retrieval (IR). SCR research initially investigated planned speech structured in document-like units, but has subsequently shifted focus to more informal spoken content produced spontaneously, outside of the studio and in conversational settings. This survey provides an overview of the field of SCR encompassing component technologies, the relationship of SCR to text IR and automatic speech recognition and user interaction issues. It is aimed at researchers with backgrounds in speech technology or IR who are seeking deeper insight on how these fields are integrated to support research and development, thus addressing the core challenges of SCR

    Automatic Text Summarization of Newswire: Lessons Learned from the Document Understanding Conference

    Get PDF
    Since 2001, the Document Understanding Conferences have been the forum for researchers in automatic text summarization to compare methods and results on common test sets. Over the years, several types of summarization tasks have been addressed--single document summarization, multi-document summarization, summarization focused by question, and headline generation. This paper is an overview of the achieved results in the different types of summarization tasks. We compare both the broader classes of baselines, systems and humans, as well as individual pairs of summarizers (both human and automatic). An analysis of variance model is fitted, with summarizer and input set as independent variables, and the coverage score as the dependent variable, and simulation-based multiple comparisons were performed. The results document the progress in the field as a whole, rather then focusing on a single system, and thus can serve as a future reference on the work done up to date, as well as a starting point in the formulation of future tasks. Results also indicate that most progress in the field has been achieved in generic multi-document summarization and that the most challenging task is that of producing a focused summary in answer to a question/topic

    Detecting (Un)Important Content for Single-Document News Summarization

    Full text link
    We present a robust approach for detecting intrinsic sentence importance in news, by training on two corpora of document-summary pairs. When used for single-document summarization, our approach, combined with the "beginning of document" heuristic, outperforms a state-of-the-art summarizer and the beginning-of-article baseline in both automatic and manual evaluations. These results represent an important advance because in the absence of cross-document repetition, single document summarizers for news have not been able to consistently outperform the strong beginning-of-article baseline.Comment: Accepted By EACL 201

    Automatic Detection of Contrastive Elements in Spontaneous Speech

    Get PDF
    In natural speech people use different levels of prominence to signal which parts of an utterance are especially important. Contrastive elements are often produced with stronger than usual prominence and their presence modifies the meaning of the utterance in subtle but important ways. We use a richly annotated corpus of conversational speech to study the acoustic characteristics of contrastive elements and the differences between them and words at other levels of prominence. We report our results for automatic detection of contrastive elements based on acoustic and textual features, finding that a baseline predicting nouns and adjectives as contrastive performs on par with the best combination of features. We achieve a much better performance in a modified task of detecting contrastive elements among words that are predicted to bear pitch accent

    Predicting the Fluency of Text with Shallow Structural Features: Case Studies of Machine Tanslation and Human-Written Text

    Get PDF
    Sentence fluency is an important component of overall text readability but few studies in natural language processing have sought to understand the factors that define it. We report the results of an initial study into the predictive power of surface syntactic statistics for the task; we use fluency assessments done for the purpose of evaluating machine translation. We find that these features are weakly but significantly correlated with fluency. Machine and human translations can be distinguished with accuracy over 80%. The performance of pairwise comparison of fluency is also very high—over 90% for a multi-layer perceptron classifier. We also test the hypothesis that the learned models capture general fluency properties applicable to human-written text. The results do not support this hypothesis: prediction accuracy on the new data is only 57%. This finding suggests that developing a dedicated, task-independent corpus of fluency judgments will be beneficial for further investigations of the problem

    Creating Local Coherence: An Empirical Assessment

    Get PDF
    Two of the mechanisms for creating natural transitions between adjacent sentences in a text, resulting in local coherence, involve discourse relations and switches of focus of attention between discourse entities. These two aspects of local coherence have been traditionally discussed and studied separately. But some empirical studies have given strong evidence for the necessity of understanding how the two types of coherence-creating devices interact. Here we present a joint corpus study of discourse relations and entity coherence exhibited in news texts from the Wall Street Journal and test several hypotheses expressed in earlier work about their interaction.

    Improving the Estimation of Word Importance for News Multi-Document Summarization - Extended Technical Report

    Get PDF
    In this paper, we propose a supervised model for ranking word importance that incorporates a rich set of features. Our model is superior to prior approaches for identifying words used in human summaries. Moreover we show that an extractive summarizer which includes our estimation of word importance results in summaries comparable with the state-of-the-art by automatic evaluation
    corecore