Search CORE

1,160 research outputs found

Finding Relevant Answers in Software Forums

Author: GOTTOPATI Swapna
JIANG Jing
LO David
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2011
Field of study

Abstract—Online software forums provide a huge amount of valuable content. Developers and users often ask questions and receive answers from such forums. The availability of a vast amount of thread discussions in forums provides ample opportunities for knowledge acquisition and summarization. For a given search query, current search engines use traditional information retrieval approach to extract webpages containin

CiteSeerX

Institutional Knowledge at Singapore Management University

A Comparative analysis: QA evaluation questions versus real-world queries

Author: Leveling Johannes
Publication venue
Publication date: 22/05/2010
Field of study

This paper presents a comparative analysis of user queries to a web search engine, questions to a Q&A service (answers.com), and questions employed in question answering (QA) evaluations at TREC and CLEF. The analysis shows that user queries to search engines contain mostly content words (i.e. keywords) but lack structure words (i.e. stopwords) and capitalization. Thus, they resemble natural language input after case folding and stopword removal. In contrast, topics for QA evaluation and questions to answers.com mainly consist of fully capitalized and syntactically well-formed questions. Classification experiments using a na¨ıve Bayes classifier show that stopwords play an important role in determining the expected answer type. A classification based on stopwords is considerably more accurate (47.5% accuracy) than a classification based on all query words (40.1% accuracy) or on content words (33.9% accuracy). To simulate user input, questions are preprocessed by case folding and stopword removal. Additional classification experiments aim at reconstructing the syntactic wh-word frame of a question, i.e. the embedding of the interrogative word. Results indicate that this part of questions can be reconstructed with moderate accuracy (25.7%), but for a classification problem with a much larger number of classes compared to classifying queries by expected answer type (2096 classes vs. 130 classes). Furthermore, eliminating stopwords can lead to multiple reconstructed questions with a different or with the opposite meaning (e.g. if negations or temporal restrictions are included). In conclusion, question reconstruction from short user queries can be seen as a new realistic evaluation challenge for QA systems

Irish Universities

DCU Online Research Access Service

A Systematic Review on Stopword Removal Algorithms

Author: Jashanjot Kaur, Preetpal Kaur Buttar
Publication venue: Auricle Global Society of Education and Research
Publication date: 30/04/2018
Field of study

Stopwords, also known as noise words, are the words that contain a little information which is not usually required. Stopwords were discovered by H.P. Luhn in 1958. In the domain of information retrieval, an effective indexing can be achieved by removing the stopwords. Indexing is a technique of connecting or tagging documents with different search terms or criteria. The main motive behind the elimination of stopwords is to increase the execution speed and the accuracy. It not only decreases the vector space but also helps to improve overall performance. It also helps in reducing the size of text. Till now, techniques for automatic stopwords removal have been developed for languages such as English, Sanskrit, Arabic, Chinese, etc. In this paper, we discuss the different techniques which have been used by the researchers to construct automated stopword lists in different languages

International Journal on Future Revolution in Computer Science & Communication Engineering

Answering English queries in automatically transcribed Arabic speech

Author: Nwesri A
Scholer F
Tahaghoghi S
Publication venue: IEEE (USA)
Publication date: 01/01/2007
Field of study

There are several well-known approaches to parsing Arabic text in preparation for indexing and retrieval. Techniques such as stemming and stopping have been shown to improve search results on written newswire dispatches, but few comparisons are available on other data sources. In this paper, we apply several alternative stemming and stopping approaches to Arabic text automatically extracted from the audio soundtrack of news video footage, and compare these with approaches that rely on machine translation of the underlying text. Using the TRECVID video collection and queries, we show that normalisation, stopword- removal, and light stemming increase retrieval precision, but that heavy stemming and trigrams have a negative effect. We also show that the choice of machine translation engine plays a major role in retrieval effectiveness

RMIT Research Repository

Text Mining Infrastructure in R

Author: David Meyer
Ingo Feinerer
Kurt Hornik
Publication venue
Publication date
Field of study

During the last decade text mining has become a widely used discipline utilizing statistical and machine learning methods. We present the tm package which provides a framework for text mining applications within R. We give a survey on text mining facilities in R and explain how typical application tasks can be carried out using our framework. We present techniques for count-based analysis methods, text clustering, text classification and string kernels.

Research Papers in Economics

Experiments in terabyte searching, genomic retrieval and novelty detection for TREC 2004

Author: Blott Stephen
Boydell Oisín
Camous Fabrice
Ferguson Paul
Gaughan Georgina
Gurrin Cathal
Jones Gareth J.F.
Murphy Noel
O'Connor Noel E.
Smeaton Alan F.
Smyth Barry
Wilkins Peter
Publication venue: 'University of Aden - Faculty of Economics and Administration'
Publication date: 01/11/2004
Field of study

In TREC2004, Dublin City University took part in three tracks, Terabyte (in collaboration with University College Dublin), Genomic and Novelty. In this paper we will discuss each track separately and present separate conclusions from this work. In addition, we present a general description of a text retrieval engine that we have developed in the last year to support our experiments into large scale, distributed information retrieval, which underlies all of the track experiments described in this document

Irish Universities

DCU Online Research Access Service

DEXTER: A workbench for automatic term extraction with specialized corpora

Author: Periñán-Pascual Carlos
Publication venue: 'Cambridge University Press (CUP)'
Publication date: 01/01/2018
Field of study

[EN] Automatic term extraction has become a priority area of research within corpus processing. Despite the extensive literature in this field, there are still some outstanding issues that should be dealt with during the construction of term extractors, particularly those oriented to support research in terminology and terminography. In this regard, this article describes the design and development of DEXTER, an online workbench for the extraction of simple and complex terms from domain-specific corpora in English, French, Italian and Spanish. In this framework, three issues contribute to placing the most important terms in the foreground. First, unlike the elaborate morphosyntactic patterns proposed by most previous research, shallow lexical filters have been constructed to discard term candidates. Second, a large number of common stopwords are automatically detected by means of a method that relies on the IATE database together with the frequency distribution of the domain-specific corpus and a general corpus. Third, the term-ranking metric, which is grounded on the notions of salience, relevance and cohesion, is guided by the IATE database to display an adequate distribution of terms.Financial support for this research has been provided by the DGI, Spanish Ministry of Education and Science, grant FFI2014-53788-C3-1-P.Periñán-Pascual, C. (2018). DEXTER: A workbench for automatic term extraction with specialized corpora. Natural Language Engineering. 24(2):163-198. https://doi.org/10.1017/S1351324917000365S16319824

RiuNet