1,087 research outputs found
Identifying Unclear Questions in Community Question Answering Websites
Thousands of complex natural language questions are submitted to community
question answering websites on a daily basis, rendering them as one of the most
important information sources these days. However, oftentimes submitted
questions are unclear and cannot be answered without further clarification
questions by expert community members. This study is the first to investigate
the complex task of classifying a question as clear or unclear, i.e., if it
requires further clarification. We construct a novel dataset and propose a
classification approach that is based on the notion of similar questions. This
approach is compared to state-of-the-art text classification baselines. Our
main finding is that the similar questions approach is a viable alternative
that can be used as a stepping stone towards the development of supportive user
interfaces for question formulation.Comment: Proceedings of the 41th European Conference on Information Retrieval
(ECIR '19), 201
A Study on Dialog Act Recognition using Character-Level Tokenization
Dialog act recognition is an important step for dialog systems since it
reveals the intention behind the uttered words. Most approaches on the task use
word-level tokenization. In contrast, this paper explores the use of
character-level tokenization. This is relevant since there is information at
the sub-word level that is related to the function of the words and, thus,
their intention. We also explore the use of different context windows around
each token, which are able to capture important elements, such as affixes.
Furthermore, we assess the importance of punctuation and capitalization. We
performed experiments on both the Switchboard Dialog Act Corpus and the DIHANA
Corpus. In both cases, the experiments not only show that character-level
tokenization leads to better performance than the typical word-level
approaches, but also that both approaches are able to capture complementary
information. Thus, the best results are achieved by combining tokenization at
both levels.Comment: 11 pages, 2 figures, 4 tables, AIMSA 201
Hierarchical Re-estimation of Topic Models for Measuring Topical Diversity
A high degree of topical diversity is often considered to be an important
characteristic of interesting text documents. A recent proposal for measuring
topical diversity identifies three elements for assessing diversity: words,
topics, and documents as collections of words. Topic models play a central role
in this approach. Using standard topic models for measuring diversity of
documents is suboptimal due to generality and impurity. General topics only
include common information from a background corpus and are assigned to most of
the documents in the collection. Impure topics contain words that are not
related to the topic; impurity lowers the interpretability of topic models and
impure topics are likely to get assigned to documents erroneously. We propose a
hierarchical re-estimation approach for topic models to combat generality and
impurity; the proposed approach operates at three levels: words, topics, and
documents. Our re-estimation approach for measuring documents' topical
diversity outperforms the state of the art on PubMed dataset which is commonly
used for diversity experiments.Comment: Proceedings of the 39th European Conference on Information Retrieval
(ECIR2017
Gold Standard Online Debates Summaries and First Experiments Towards Automatic Summarization of Online Debate Data
Usage of online textual media is steadily increasing. Daily, more and more
news stories, blog posts and scientific articles are added to the online
volumes. These are all freely accessible and have been employed extensively in
multiple research areas, e.g. automatic text summarization, information
retrieval, information extraction, etc. Meanwhile, online debate forums have
recently become popular, but have remained largely unexplored. For this reason,
there are no sufficient resources of annotated debate data available for
conducting research in this genre. In this paper, we collected and annotated
debate data for an automatic summarization task. Similar to extractive gold
standard summary generation our data contains sentences worthy to include into
a summary. Five human annotators performed this task. Inter-annotator
agreement, based on semantic similarity, is 36% for Cohen's kappa and 48% for
Krippendorff's alpha. Moreover, we also implement an extractive summarization
system for online debates and discuss prominent features for the task of
summarizing online debate data automatically.Comment: accepted and presented at the CICLING 2017 - 18th International
Conference on Intelligent Text Processing and Computational Linguistic
Introducing a framework to assess newly created questions with Natural Language Processing
Statistical models such as those derived from Item Response Theory (IRT)
enable the assessment of students on a specific subject, which can be useful
for several purposes (e.g., learning path customization, drop-out prediction).
However, the questions have to be assessed as well and, although it is possible
to estimate with IRT the characteristics of questions that have already been
answered by several students, this technique cannot be used on newly generated
questions. In this paper, we propose a framework to train and evaluate models
for estimating the difficulty and discrimination of newly created Multiple
Choice Questions by extracting meaningful features from the text of the
question and of the possible choices. We implement one model using this
framework and test it on a real-world dataset provided by CloudAcademy, showing
that it outperforms previously proposed models, reducing by 6.7% the RMSE for
difficulty estimation and by 10.8% the RMSE for discrimination estimation. We
also present the results of an ablation study performed to support our features
choice and to show the effects of different characteristics of the questions'
text on difficulty and discrimination.Comment: Accepted at the International Conference of Artificial Intelligence
in Educatio
Locating bugs without looking back
Bug localisation is a core program comprehension task in software maintenance: given the observation of a bug, e.g. via a bug report, where is it located in the source code? Information retrieval (IR) approaches see the bug report as the query, and the source code files as the documents to be retrieved, ranked by relevance. Such approaches have the advantage of not requiring expensive static or dynamic analysis of the code. However, current state-of-the-art IR approaches rely on project history, in particular previously fixed bugs or previous versions of the source code. We present a novel approach that directly scores each current file against the given report, thus not requiring past code and reports. The scoring method is based on heuristics identified through manual inspection of a small sample of bug reports. We compare our approach to eight others, using their own five metrics on their own six open source projects. Out of 30 performance indicators, we improve 27 and equal 2. Over the projects analysed, on average we find one or more affected files in the top 10 ranked files for 76% of the bug reports. These results show the applicability of our approach to software projects without history
Word-Graph Based Applications for Handwriting Documents: Impact of Word-Graph Size on Their Performances
The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-319-19390-8 29Computer Assisted Transcription of Text Images (CATTI)
and Key-Word Spotting (KWS) applications aim at transcribing and
indexing handwritten documents respectively. They both are approached
by means of Word Graphs (WG) obtained using segmentation-free handwritten
text recognition technology based on N-gram Language Models
and Hidden Markov Models. A large WG contains most of the relevant
information of the original text (line) image needed for CATTI and
KWS but, if it is too large, the computational cost of generating and
using it can become unaffordable. Conversely, if it is too small, relevant
information may be lost, leading to a reduction of CATTI/KWS in performance
accuracy. We study the trade-off between WG size and CATTI
&KWS performance in terms of effectiveness and efficiency. Results show
that small, computationally cheap WGs can be used without loosing the
excellent CATTI/KWS performance achieved with huge WGs.Work partially supported by the Spanish MICINN projects STraDA (TIN2012-37475-C02-01) and by the EU 7th FP tranScriptorium project (Ref:600707).Toselli, AH.; Romero Gómez, V.; Vidal Ruiz, E. (2015). Word-Graph Based Applications for Handwriting Documents: Impact of Word-Graph Size on Their Performances. En Pattern Recognition and Image Analysis. Springer. 253-261. https://doi.org/10.1007/978-3-319-19390-8_29S253261Romero, V., Toselli, A.H., Vidal, E.: Multimodal Interactive Handwritten Text Transcription. Series in Machine Perception and Artificial Intelligence (MPAI). World Scientific Publishing, Singapore (2012)Toselli, A.H., Vidal, E., Romero, V., Frinken, V.: Word-graph based keyword spotting and indexing of handwritten document images. Technical report, Universitat Politècnica de València (2013)Oerder, M., Ney, H.: Word graphs: an efficient interface between continuous-speech recognition and language understanding. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. 119–122, April 1993Bazzi, I., Schwartz, R., Makhoul, J.: An omnifont open-vocabulary OCR system for English and Arabic. IEEE Trans. Pattern Anal. Mach. Intell. 21(6), 495–504 (1999)Jelinek, F.: Statistical Methods for Speech Recognition. MIT Press, Cambridge (1998)Ström, N.: Generation and minimization of word graphs in continuous speech recognition. In: Proceedings of IEEE Workshop on ASR 1995, Snowbird, Utah, pp. 125–126 (1995)Ortmanns, S., Ney, H., Aubert, X.: A word graph algorithm for large vocabulary continuous speech recognition. Comput. Speech Lang. 11(1), 43–72 (1997)Wessel, F., Schluter, R., Macherey, K., Ney, H.: Confidence measures for large vocabulary continuous speech recognition. IEEE Trans. Speech Audio Process. 9(3), 288–298 (2001)Robertson, S.: A new interpretation of average precision. In: Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2008), pp. 689–690. ACM, USA (2008)Manning, C.D., Raghavan, P., Schutze, H.: Introduction to Information Retrieval. Cambridge University Press, USA (2008)Romero, V., Toselli, A.H., Rodríguez, L., Vidal, E.: Computer assisted transcription for ancient text images. In: Kamel, M.S., Campilho, A. (eds.) ICIAR 2007. LNCS, vol. 4633, pp. 1182–1193. Springer, Heidelberg (2007)Fischer, A., Wuthrich, M., Liwicki, M., Frinken, V., Bunke, H., Viehhauser, G., Stolz, M.: Automatic transcription of handwritten medieval documents. In: 15th International Conference on Virtual Systems and Multimedia, VSMM 2009, pp. 137–142 (2009)Pesch, H., Hamdani, M., Forster, J., Ney, H.: Analysis of preprocessing techniques for latin handwriting recognition. In: ICFHR, pp. 280–284 (2012)Evermann, G.: Minimum Word Error Rate Decoding. Ph.D. thesis, Churchill College, University of Cambridge (1999
Measuring Accuracy of Automated Parsing and Categorization Tools and Processes in Digital Investigations
This work presents a method for the measurement of the accuracy of evidential
artifact extraction and categorization tasks in digital forensic
investigations. Instead of focusing on the measurement of accuracy and errors
in the functions of digital forensic tools, this work proposes the application
of information retrieval measurement techniques that allow the incorporation of
errors introduced by tools and analysis processes. This method uses a `gold
standard' that is the collection of evidential objects determined by a digital
investigator from suspect data with an unknown ground truth. This work proposes
that the accuracy of tools and investigation processes can be evaluated
compared to the derived gold standard using common precision and recall values.
Two example case studies are presented showing the measurement of the accuracy
of automated analysis tools as compared to an in-depth analysis by an expert.
It is shown that such measurement can allow investigators to determine changes
in accuracy of their processes over time, and determine if such a change is
caused by their tools or knowledge.Comment: 17 pages, 2 appendices, 1 figure, 5th International Conference on
Digital Forensics and Cyber Crime; Digital Forensics and Cyber Crime, pp.
147-169, 201
Word Embeddings for Entity-annotated Texts
Learned vector representations of words are useful tools for many information
retrieval and natural language processing tasks due to their ability to capture
lexical semantics. However, while many such tasks involve or even rely on named
entities as central components, popular word embedding models have so far
failed to include entities as first-class citizens. While it seems intuitive
that annotating named entities in the training corpus should result in more
intelligent word features for downstream tasks, performance issues arise when
popular embedding approaches are naively applied to entity annotated corpora.
Not only are the resulting entity embeddings less useful than expected, but one
also finds that the performance of the non-entity word embeddings degrades in
comparison to those trained on the raw, unannotated corpus. In this paper, we
investigate approaches to jointly train word and entity embeddings on a large
corpus with automatically annotated and linked entities. We discuss two
distinct approaches to the generation of such embeddings, namely the training
of state-of-the-art embeddings on raw-text and annotated versions of the
corpus, as well as node embeddings of a co-occurrence graph representation of
the annotated corpus. We compare the performance of annotated embeddings and
classical word embeddings on a variety of word similarity, analogy, and
clustering evaluation tasks, and investigate their performance in
entity-specific tasks. Our findings show that it takes more than training
popular word embedding models on an annotated corpus to create entity
embeddings with acceptable performance on common test cases. Based on these
results, we discuss how and when node embeddings of the co-occurrence graph
representation of the text can restore the performance.Comment: This paper is accepted in 41st European Conference on Information
Retrieva
- …