1,087 research outputs found

    Identifying Unclear Questions in Community Question Answering Websites

    Get PDF
    Thousands of complex natural language questions are submitted to community question answering websites on a daily basis, rendering them as one of the most important information sources these days. However, oftentimes submitted questions are unclear and cannot be answered without further clarification questions by expert community members. This study is the first to investigate the complex task of classifying a question as clear or unclear, i.e., if it requires further clarification. We construct a novel dataset and propose a classification approach that is based on the notion of similar questions. This approach is compared to state-of-the-art text classification baselines. Our main finding is that the similar questions approach is a viable alternative that can be used as a stepping stone towards the development of supportive user interfaces for question formulation.Comment: Proceedings of the 41th European Conference on Information Retrieval (ECIR '19), 201

    A Study on Dialog Act Recognition using Character-Level Tokenization

    Get PDF
    Dialog act recognition is an important step for dialog systems since it reveals the intention behind the uttered words. Most approaches on the task use word-level tokenization. In contrast, this paper explores the use of character-level tokenization. This is relevant since there is information at the sub-word level that is related to the function of the words and, thus, their intention. We also explore the use of different context windows around each token, which are able to capture important elements, such as affixes. Furthermore, we assess the importance of punctuation and capitalization. We performed experiments on both the Switchboard Dialog Act Corpus and the DIHANA Corpus. In both cases, the experiments not only show that character-level tokenization leads to better performance than the typical word-level approaches, but also that both approaches are able to capture complementary information. Thus, the best results are achieved by combining tokenization at both levels.Comment: 11 pages, 2 figures, 4 tables, AIMSA 201

    Hierarchical Re-estimation of Topic Models for Measuring Topical Diversity

    Get PDF
    A high degree of topical diversity is often considered to be an important characteristic of interesting text documents. A recent proposal for measuring topical diversity identifies three elements for assessing diversity: words, topics, and documents as collections of words. Topic models play a central role in this approach. Using standard topic models for measuring diversity of documents is suboptimal due to generality and impurity. General topics only include common information from a background corpus and are assigned to most of the documents in the collection. Impure topics contain words that are not related to the topic; impurity lowers the interpretability of topic models and impure topics are likely to get assigned to documents erroneously. We propose a hierarchical re-estimation approach for topic models to combat generality and impurity; the proposed approach operates at three levels: words, topics, and documents. Our re-estimation approach for measuring documents' topical diversity outperforms the state of the art on PubMed dataset which is commonly used for diversity experiments.Comment: Proceedings of the 39th European Conference on Information Retrieval (ECIR2017

    Gold Standard Online Debates Summaries and First Experiments Towards Automatic Summarization of Online Debate Data

    Full text link
    Usage of online textual media is steadily increasing. Daily, more and more news stories, blog posts and scientific articles are added to the online volumes. These are all freely accessible and have been employed extensively in multiple research areas, e.g. automatic text summarization, information retrieval, information extraction, etc. Meanwhile, online debate forums have recently become popular, but have remained largely unexplored. For this reason, there are no sufficient resources of annotated debate data available for conducting research in this genre. In this paper, we collected and annotated debate data for an automatic summarization task. Similar to extractive gold standard summary generation our data contains sentences worthy to include into a summary. Five human annotators performed this task. Inter-annotator agreement, based on semantic similarity, is 36% for Cohen's kappa and 48% for Krippendorff's alpha. Moreover, we also implement an extractive summarization system for online debates and discuss prominent features for the task of summarizing online debate data automatically.Comment: accepted and presented at the CICLING 2017 - 18th International Conference on Intelligent Text Processing and Computational Linguistic

    Introducing a framework to assess newly created questions with Natural Language Processing

    Full text link
    Statistical models such as those derived from Item Response Theory (IRT) enable the assessment of students on a specific subject, which can be useful for several purposes (e.g., learning path customization, drop-out prediction). However, the questions have to be assessed as well and, although it is possible to estimate with IRT the characteristics of questions that have already been answered by several students, this technique cannot be used on newly generated questions. In this paper, we propose a framework to train and evaluate models for estimating the difficulty and discrimination of newly created Multiple Choice Questions by extracting meaningful features from the text of the question and of the possible choices. We implement one model using this framework and test it on a real-world dataset provided by CloudAcademy, showing that it outperforms previously proposed models, reducing by 6.7% the RMSE for difficulty estimation and by 10.8% the RMSE for discrimination estimation. We also present the results of an ablation study performed to support our features choice and to show the effects of different characteristics of the questions' text on difficulty and discrimination.Comment: Accepted at the International Conference of Artificial Intelligence in Educatio

    Locating bugs without looking back

    Get PDF
    Bug localisation is a core program comprehension task in software maintenance: given the observation of a bug, e.g. via a bug report, where is it located in the source code? Information retrieval (IR) approaches see the bug report as the query, and the source code files as the documents to be retrieved, ranked by relevance. Such approaches have the advantage of not requiring expensive static or dynamic analysis of the code. However, current state-of-the-art IR approaches rely on project history, in particular previously fixed bugs or previous versions of the source code. We present a novel approach that directly scores each current file against the given report, thus not requiring past code and reports. The scoring method is based on heuristics identified through manual inspection of a small sample of bug reports. We compare our approach to eight others, using their own five metrics on their own six open source projects. Out of 30 performance indicators, we improve 27 and equal 2. Over the projects analysed, on average we find one or more affected files in the top 10 ranked files for 76% of the bug reports. These results show the applicability of our approach to software projects without history

    Word-Graph Based Applications for Handwriting Documents: Impact of Word-Graph Size on Their Performances

    Full text link
    The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-319-19390-8 29Computer Assisted Transcription of Text Images (CATTI) and Key-Word Spotting (KWS) applications aim at transcribing and indexing handwritten documents respectively. They both are approached by means of Word Graphs (WG) obtained using segmentation-free handwritten text recognition technology based on N-gram Language Models and Hidden Markov Models. A large WG contains most of the relevant information of the original text (line) image needed for CATTI and KWS but, if it is too large, the computational cost of generating and using it can become unaffordable. Conversely, if it is too small, relevant information may be lost, leading to a reduction of CATTI/KWS in performance accuracy. We study the trade-off between WG size and CATTI &KWS performance in terms of effectiveness and efficiency. Results show that small, computationally cheap WGs can be used without loosing the excellent CATTI/KWS performance achieved with huge WGs.Work partially supported by the Spanish MICINN projects STraDA (TIN2012-37475-C02-01) and by the EU 7th FP tranScriptorium project (Ref:600707).Toselli, AH.; Romero Gómez, V.; Vidal Ruiz, E. (2015). Word-Graph Based Applications for Handwriting Documents: Impact of Word-Graph Size on Their Performances. En Pattern Recognition and Image Analysis. Springer. 253-261. https://doi.org/10.1007/978-3-319-19390-8_29S253261Romero, V., Toselli, A.H., Vidal, E.: Multimodal Interactive Handwritten Text Transcription. Series in Machine Perception and Artificial Intelligence (MPAI). World Scientific Publishing, Singapore (2012)Toselli, A.H., Vidal, E., Romero, V., Frinken, V.: Word-graph based keyword spotting and indexing of handwritten document images. Technical report, Universitat Politècnica de València (2013)Oerder, M., Ney, H.: Word graphs: an efficient interface between continuous-speech recognition and language understanding. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. 119–122, April 1993Bazzi, I., Schwartz, R., Makhoul, J.: An omnifont open-vocabulary OCR system for English and Arabic. IEEE Trans. Pattern Anal. Mach. Intell. 21(6), 495–504 (1999)Jelinek, F.: Statistical Methods for Speech Recognition. MIT Press, Cambridge (1998)Ström, N.: Generation and minimization of word graphs in continuous speech recognition. In: Proceedings of IEEE Workshop on ASR 1995, Snowbird, Utah, pp. 125–126 (1995)Ortmanns, S., Ney, H., Aubert, X.: A word graph algorithm for large vocabulary continuous speech recognition. Comput. Speech Lang. 11(1), 43–72 (1997)Wessel, F., Schluter, R., Macherey, K., Ney, H.: Confidence measures for large vocabulary continuous speech recognition. IEEE Trans. Speech Audio Process. 9(3), 288–298 (2001)Robertson, S.: A new interpretation of average precision. In: Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2008), pp. 689–690. ACM, USA (2008)Manning, C.D., Raghavan, P., Schutze, H.: Introduction to Information Retrieval. Cambridge University Press, USA (2008)Romero, V., Toselli, A.H., Rodríguez, L., Vidal, E.: Computer assisted transcription for ancient text images. In: Kamel, M.S., Campilho, A. (eds.) ICIAR 2007. LNCS, vol. 4633, pp. 1182–1193. Springer, Heidelberg (2007)Fischer, A., Wuthrich, M., Liwicki, M., Frinken, V., Bunke, H., Viehhauser, G., Stolz, M.: Automatic transcription of handwritten medieval documents. In: 15th International Conference on Virtual Systems and Multimedia, VSMM 2009, pp. 137–142 (2009)Pesch, H., Hamdani, M., Forster, J., Ney, H.: Analysis of preprocessing techniques for latin handwriting recognition. In: ICFHR, pp. 280–284 (2012)Evermann, G.: Minimum Word Error Rate Decoding. Ph.D. thesis, Churchill College, University of Cambridge (1999

    Measuring Accuracy of Automated Parsing and Categorization Tools and Processes in Digital Investigations

    Full text link
    This work presents a method for the measurement of the accuracy of evidential artifact extraction and categorization tasks in digital forensic investigations. Instead of focusing on the measurement of accuracy and errors in the functions of digital forensic tools, this work proposes the application of information retrieval measurement techniques that allow the incorporation of errors introduced by tools and analysis processes. This method uses a `gold standard' that is the collection of evidential objects determined by a digital investigator from suspect data with an unknown ground truth. This work proposes that the accuracy of tools and investigation processes can be evaluated compared to the derived gold standard using common precision and recall values. Two example case studies are presented showing the measurement of the accuracy of automated analysis tools as compared to an in-depth analysis by an expert. It is shown that such measurement can allow investigators to determine changes in accuracy of their processes over time, and determine if such a change is caused by their tools or knowledge.Comment: 17 pages, 2 appendices, 1 figure, 5th International Conference on Digital Forensics and Cyber Crime; Digital Forensics and Cyber Crime, pp. 147-169, 201

    On true language understanding

    Get PDF

    Word Embeddings for Entity-annotated Texts

    Full text link
    Learned vector representations of words are useful tools for many information retrieval and natural language processing tasks due to their ability to capture lexical semantics. However, while many such tasks involve or even rely on named entities as central components, popular word embedding models have so far failed to include entities as first-class citizens. While it seems intuitive that annotating named entities in the training corpus should result in more intelligent word features for downstream tasks, performance issues arise when popular embedding approaches are naively applied to entity annotated corpora. Not only are the resulting entity embeddings less useful than expected, but one also finds that the performance of the non-entity word embeddings degrades in comparison to those trained on the raw, unannotated corpus. In this paper, we investigate approaches to jointly train word and entity embeddings on a large corpus with automatically annotated and linked entities. We discuss two distinct approaches to the generation of such embeddings, namely the training of state-of-the-art embeddings on raw-text and annotated versions of the corpus, as well as node embeddings of a co-occurrence graph representation of the annotated corpus. We compare the performance of annotated embeddings and classical word embeddings on a variety of word similarity, analogy, and clustering evaluation tasks, and investigate their performance in entity-specific tasks. Our findings show that it takes more than training popular word embedding models on an annotated corpus to create entity embeddings with acceptable performance on common test cases. Based on these results, we discuss how and when node embeddings of the co-occurrence graph representation of the text can restore the performance.Comment: This paper is accepted in 41st European Conference on Information Retrieva
    • …
    corecore