2,347 research outputs found

    Crowdsourcing Multiple Choice Science Questions

    Full text link
    We present a novel method for obtaining high-quality, domain-targeted multiple choice questions from crowd workers. Generating these questions can be difficult without trading away originality, relevance or diversity in the answer options. Our method addresses these problems by leveraging a large corpus of domain-specific text and a small set of existing questions. It produces model suggestions for document selection and answer distractor choice which aid the human question generation process. With this method we have assembled SciQ, a dataset of 13.7K multiple choice science exam questions (Dataset available at http://allenai.org/data.html). We demonstrate that the method produces in-domain questions by providing an analysis of this new dataset and by showing that humans cannot distinguish the crowdsourced questions from original questions. When using SciQ as additional training data to existing questions, we observe accuracy improvements on real science exams.Comment: accepted for the Workshop on Noisy User-generated Text (W-NUT) 201

    Predictive model for detecting fake reviews: Exploring the possible enhancements of using word embeddings

    Get PDF
    Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Data ScienceFake data contaminates the insights that can be obtained about a product or service and ultimately hurts both businesses and consumers. Being able to correctly identify the truthful reviews will ensure consumers are able to more effectively find products that suit their needs. The following paper aims to develop a predictive model for detecting fake hotel reviews using Natural Language Processing techniques and applying various Machine Learning models. The current research in this area has primarily focused on sentiment analysis and the detection of fake reviews using various text mining methods including bag of words, tokenization, POS tagging and TF-IDF. The research mostly looks at some combination of quantitative and qualitative information. The text component is only analyzed with regards to which words appear in the review, while the semantic relationship is ignored. This research attempts to develop a higher level of performance by implementing pretrained word embeddings during the preprocessing of the text data. The goal is to introduce some context to the text data and see how each model’s performance changes. Traditional text mining models were applied to the dataset to provide a benchmark. Subsequently, GloVe, Word2Vec and BERT word embeddings were implemented and the performance of 8 models was reviewed. The analysis shows a somewhat lower performance obtained by the word embeddings. It seems that in texts of short length, the appearance of words is more indicative of a fake review than the semantic meaning of those words

    Argumentation Mining in User-Generated Web Discourse

    Full text link
    The goal of argumentation mining, an evolving research field in computational linguistics, is to design methods capable of analyzing people's argumentation. In this article, we go beyond the state of the art in several ways. (i) We deal with actual Web data and take up the challenges given by the variety of registers, multiple domains, and unrestricted noisy user-generated Web discourse. (ii) We bridge the gap between normative argumentation theories and argumentation phenomena encountered in actual data by adapting an argumentation model tested in an extensive annotation study. (iii) We create a new gold standard corpus (90k tokens in 340 documents) and experiment with several machine learning methods to identify argument components. We offer the data, source codes, and annotation guidelines to the community under free licenses. Our findings show that argumentation mining in user-generated Web discourse is a feasible but challenging task.Comment: Cite as: Habernal, I. & Gurevych, I. (2017). Argumentation Mining in User-Generated Web Discourse. Computational Linguistics 43(1), pp. 125-17

    Traceability Links Recovery among Requirements and BPMN models

    Full text link
    Tesis por compendio[EN] Throughout the pages of this document, I present the results of the research that was carried out in the context of my PhD studies. During the aforementioned research, I studied the process of Traceability Links Recovery between natural language requirements and industrial software models. More precisely, due to their popularity and extensive usage, I studied the process of Traceability Links Recovery between natural language requirements and Business Process Models, also known as BPMN models. In order to carry out the research, I focused my work on two main objectives: (1) the development of the Traceability Links Recovery techniques between natural language requirements and BPMN models, and (2) the validation and analysis of the results obtained by the developed techniques in industrial domain case studies. The results of the research have been redacted and published in forums, conferences, and journals specialized in the topics and context of the research. This thesis document introduces the topics, context, and objectives of the research, presents the academic publications that have been published as a result of the work, and then discusses the outcomes of the investigation.[ES] A través de las páginas de este documento, presento los resultados de la investigación realizada en el contexto de mis estudios de doctorado. Durante la investigación, he estudiado el proceso de Recuperación de Enlaces de Trazabilidad entre requisitos especificados en lenguaje natural y modelos de software industriales. Más concretamente, debido a su popularidad y uso extensivo, he estudiado el proceso de Recuperación de Enlaces de Trazabilidad entre requisitos especificados en lenguaje natural y Modelos de Procesos de Negocio, también conocidos como modelos BPMN. Para llevar a cabo esta investigación, mi trabajo se ha centrado en dos objetivos principales: (1) desarrollo de técnicas de Recuperación de Enlaces de Trazabilidad entre requisitos especificados en lenguaje natural y modelos BPMN, y (2) validación y análisis de los resultados obtenidos por las técnicas desarrolladas en casos de estudio de dominios industriales. Los resultados de la investigación han sido redactados y publicados en foros, conferencias y revistas especializadas en los temas y contexto de la investigación. Esta tesis introduce los temas, contexto y objetivos de la investigación, presenta las publicaciones académicas que han sido publicadas como resultado del trabajo, y expone los resultados de la investigación.[CA] A través de les pàgines d'aquest document, presente els resultats de la investigació realitzada en el context dels meus estudis de doctorat. Durant la investigació, he estudiat el procés de Recuperació d'Enllaços de Traçabilitat entre requisits especificats en llenguatge natural i models de programari industrials. Més concretament, a causa de la seua popularitat i ús extensiu, he estudiat el procés de Recuperació d'Enllaços de Traçabilitat entre requisits especificats en llenguatge natural i Models de Processos de Negoci, també coneguts com a models BPMN. Per a dur a terme aquesta investigació, el meu treball s'ha centrat en dos objectius principals: (1) desenvolupament de tècniques de Recuperació d'Enllaços de Traçabilitat entre requisits especificats en llenguatge natural i models BPMN, i (2) validació i anàlisi dels resultats obtinguts per les tècniques desenvolupades en casos d'estudi de dominis industrials. Els resultats de la investigació han sigut redactats i publicats en fòrums, conferències i revistes especialitzades en els temes i context de la investigació. Aquesta tesi introdueix els temes, context i objectius de la investigació, presenta les publicacions acadèmiques que han sigut publicades com a resultat del treball, i exposa els resultats de la investigació.Lapeña Martí, R. (2020). Traceability Links Recovery among Requirements and BPMN models [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/149391TESISCompendi

    Web Data Extraction, Applications and Techniques: A Survey

    Full text link
    Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of applications. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction. This survey aims at providing a structured and comprehensive overview of the literature in the field of Web Data Extraction. We provided a simple classification framework in which existing Web Data Extraction applications are grouped into two main classes, namely applications at the Enterprise level and at the Social Web level. At the Enterprise level, Web Data Extraction techniques emerge as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. At the Social Web level, Web Data Extraction techniques allow to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities to analyze human behavior at a very large scale. We discuss also the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains.Comment: Knowledge-based System

    Optimisation Method for Training Deep Neural Networks in Classification of Non- functional Requirements

    Get PDF
    Non-functional requirements (NFRs) are regarded critical to a software system's success. The majority of NFR detection and classification solutions have relied on supervised machine learning models. It is hindered by the lack of labelled data for training and necessitate a significant amount of time spent on feature engineering. In this work we explore emerging deep learning techniques to reduce the burden of feature engineering. The goal of this study is to develop an autonomous system that can classify NFRs into multiple classes based on a labelled corpus. In the first section of the thesis, we standardise the NFRs ontology and annotations to produce a corpus based on five attributes: usability, reliability, efficiency, maintainability, and portability. In the second section, the design and implementation of four neural networks, including the artificial neural network, convolutional neural network, long short-term memory, and gated recurrent unit are examined to classify NFRs. These models, necessitate a large corpus. To overcome this limitation, we proposed a new paradigm for data augmentation. This method uses a sort and concatenates strategy to combine two phrases from the same class, resulting in a two-fold increase in data size while keeping the domain vocabulary intact. We compared our method to a baseline (no augmentation) and an existing approach Easy data augmentation (EDA) with pre-trained word embeddings. All training has been performed under two modifications to the data; augmentation on the entire data before train/validation split vs augmentation on train set only. Our findings show that as compared to EDA and baseline, NFRs classification model improved greatly, and CNN outperformed when trained using our suggested technique in the first setting. However, we saw a slight boost in the second experimental setup with just train set augmentation. As a result, we can determine that augmentation of the validation is required in order to achieve acceptable results with our proposed approach. We hope that our ideas will inspire new data augmentation techniques, whether they are generic or task specific. Furthermore, it would also be useful to implement this strategy in other languages

    The Design of an Interactive Topic Modeling Application for Media Content

    Get PDF
    Topic Modeling has been widely used by data scientists to analyze the increasing amount of text documents. Documents can be assigned to a distribution of topics with techniques like LDA or NMF, that are related to unsupervised soft clustering but consider text semantics. More recently, Interactive Topic Modeling (ITM) has been introduced to incorporate human expertise in the modeling process. This enables real-time hyperparameter optimization and topic manipulation on document and keyword level. However, current ITM applications are mostly accessible to experienced data scientists, who lack domain knowledge. Domain experts, on the other hand, usually lack the data science expertise to build and use ITM applications. This thesis presents an Interactive Topic Modeling application accessible to non-technical data analysts in the broadcasting domain. The application allows domain experts, like journalists, to explore themes in various produced media content in a dynamic, intuitive and efficient manner. An interactive interface, with an embedded NMF topic model, enables users to filter on various data sources, configure and refine the topic model, interpret and evaluate the output by visualizations, and analyze the data in wider context. This application was designed in collaboration with domain experts in focus group sessions, according to human-centered design principles. An evaluation study with ten participants shows that journalists and data analysts without any natural language processing knowledge agree that the application is not only usable, but also very user-friendly, effective and efficient. A SUS score of 81 was received, and user experience and user perceptions of control questionnaires both received an average of 4.1 on a five-point Likert scale. The ITM application thus enables this specific user group to extract meaningful topics from their produced media content, and use these results in broader perspective to perform exploratory data analysis. The success of the final application design presented in this thesis shows that the knowledge gap between data scientists and domain experts in the broadcasting field has been filled. In bigger perspective; machine learning applications can be made more accessible by translating hidden low-level details of complex models into high-level model interactions, presented in a user interface
    • …
    corecore