2,985 research outputs found
Spoken content retrieval: A survey of techniques and technologies
Speech media, that is, digital audio and video containing spoken content, has blossomed in recent years. Large collections are accruing on the Internet as well as in private and enterprise settings. This growth has motivated extensive research on techniques and technologies that facilitate reliable indexing and retrieval. Spoken content retrieval (SCR) requires the combination of audio and speech processing technologies with methods from information retrieval (IR). SCR research initially investigated planned speech structured in document-like units, but has subsequently shifted focus to more informal spoken content produced spontaneously, outside of the studio and in conversational settings. This survey provides an overview of the field of SCR encompassing component technologies, the relationship of SCR to text IR and automatic speech recognition and user interaction issues. It is aimed at researchers with backgrounds in speech technology or IR who are seeking deeper insight on how these fields are integrated to support research and development, thus addressing the core challenges of SCR
Identifying Relationships Among Sentences in Court Case Transcripts Using Discourse Relations
Case Law has a significant impact on the proceedings of legal cases.
Therefore, the information that can be obtained from previous court cases is
valuable to lawyers and other legal officials when performing their duties.
This paper describes a methodology of applying discourse relations between
sentences when processing text documents related to the legal domain. In this
study, we developed a mechanism to classify the relationships that can be
observed among sentences in transcripts of United States court cases. First, we
defined relationship types that can be observed between sentences in court case
transcripts. Then we classified pairs of sentences according to the
relationship type by combining a machine learning model and a rule-based
approach. The results obtained through our system were evaluated using human
judges. To the best of our knowledge, this is the first study where discourse
relationships between sentences have been used to determine relationships among
sentences in legal court case transcripts.Comment: Conference: 2018 International Conference on Advances in ICT for
Emerging Regions (ICTer
Extraction of Keyphrases from Text: Evaluation of Four Algorithms
This report presents an empirical evaluation of four algorithms for automatically extracting keywords and keyphrases from documents. The four algorithms are compared using five different collections of documents. For each document, we have a target set of keyphrases, which were generated by hand. The target keyphrases were generated for human readers; they were not tailored for any of the four keyphrase extraction algorithms. Each of the algorithms was evaluated by the degree to which the algorithms keyphrases matched the manually generated keyphrases. The four algorithms were (1) the AutoSummarize feature in Microsofts Word 97, (2) an algorithm based on Eric Brills part-of-speech tagger, (3) the Summarize feature in Veritys Search 97, and (4) NRCs Extractor algorithm. For all five document collections, NRCs Extractor yields the best match with the manually generated keyphrases
Learning to Extract Keyphrases from Text
Many academic journals ask their authors to provide a list of about five to fifteen key words, to appear on the first page of each article. Since these key words are often phrases of two or more words, we prefer to call them keyphrases. There is a surprisingly wide variety of tasks for which keyphrases are useful, as we discuss in this paper. Recent commercial software, such as Microsoft?s Word 97 and Verity?s Search 97, includes algorithms that automatically extract keyphrases from documents. In this paper, we approach the problem of automatically extracting keyphrases from text as a supervised learning task. We treat a document as a set of phrases, which the learning algorithm must learn to classify as positive or negative examples of keyphrases. Our first set of experiments applies the C4.5 decision tree induction algorithm to this learning task. The second set of experiments applies the GenEx algorithm to the task. We developed the GenEx algorithm specifically for this task. The third set of experiments examines the performance of GenEx on the task of metadata generation, relative to the performance of Microsoft?s Word 97. The fourth and final set of experiments investigates the performance of GenEx on the task of highlighting, relative to Verity?s Search 97. The experimental results support the claim that a specialized learning algorithm (GenEx) can generate better keyphrases than a general-purpose learning algorithm (C4.5) and the non-learning algorithms that are used in commercial software (Word 97 and Search 97)
Identifying Purpose Behind Electoral Tweets
Tweets pertaining to a single event, such as a national election, can number
in the hundreds of millions. Automatically analyzing them is beneficial in many
downstream natural language applications such as question answering and
summarization. In this paper, we propose a new task: identifying the purpose
behind electoral tweets--why do people post election-oriented tweets? We show
that identifying purpose is correlated with the related phenomenon of sentiment
and emotion detection, but yet significantly different. Detecting purpose has a
number of applications including detecting the mood of the electorate,
estimating the popularity of policies, identifying key issues of contention,
and predicting the course of events. We create a large dataset of electoral
tweets and annotate a few thousand tweets for purpose. We develop a system that
automatically classifies electoral tweets as per their purpose, obtaining an
accuracy of 43.56% on an 11-class task and an accuracy of 73.91% on a 3-class
task (both accuracies well above the most-frequent-class baseline). Finally, we
show that resources developed for emotion detection are also helpful for
detecting purpose
- …