2,488 research outputs found

    People on Drugs: Credibility of User Statements in Health Communities

    Full text link
    Online health communities are a valuable source of information for patients and physicians. However, such user-generated resources are often plagued by inaccuracies and misinformation. In this work we propose a method for automatically establishing the credibility of user-generated medical statements and the trustworthiness of their authors by exploiting linguistic cues and distant supervision from expert sources. To this end we introduce a probabilistic graphical model that jointly learns user trustworthiness, statement credibility, and language objectivity. We apply this methodology to the task of extracting rare or unknown side-effects of medical drugs --- this being one of the problems where large scale non-expert data has the potential to complement expert medical knowledge. We show that our method can reliably extract side-effects and filter out false statements, while identifying trustworthy users that are likely to contribute valuable medical information

    Mining Knowledge Bases for Question & Answers Websites

    Get PDF
    We studied the problem of searching answers for questions on a Question-and-Answer Website from knowledge bases. A number of research efforts had been developed using Stack Overflow data, which is available for the public. Surprisingly, only a few papers tried to improve the search for better answers. Furthermore, current approaches for searching a Question-and-Answer Website are usually limited to the question database, which is usually the website own content. We showed it is feasible to use knowledge bases as sources for answers. We implemented both vector-space and topic-space representations for our datasets and compared these distinct techniques. Finally, we proposed a hybrid ranking approach that took advantage of a machine-learned classifier to incorporate the tag information into the ranking and showed that it was able to improve the retrieval performance

    Crowdsource Annotation and Automatic Reconstruction of Online Discussion Threads

    Get PDF
    Modern communication relies on electronic messages organized in the form of discussion threads. Emails, IMs, SMS, website comments, and forums are all composed of threads, which consist of individual user messages connected by metadata and discourse coherence to messages from other users. Threads are used to display user messages effectively in a GUI such as an email client, providing a background context for understanding a single message. Many messages are meaningless without the context provided by their thread. However, a number of factors may result in missing thread structure, ranging from user mistake (replying to the wrong message), to missing metadata (some email clients do not produce/save headers that fully encapsulate thread structure; and, conversion of archived threads from over repository to another may also result in lost metadata), to covert use (users may avoid metadata to render discussions difficult for third parties to understand). In the field of security, law enforcement agencies may obtain vast collections of discussion turns that require automatic thread reconstruction to understand. For example, the Enron Email Corpus, obtained by the Federal Energy Regulatory Commission during its investigation of the Enron Corporation, has no inherent thread structure. In this thesis, we will use natural language processing approaches to reconstruct threads from message content. Reconstruction based on message content sidesteps the problem of missing metadata, permitting post hoc reorganization and discussion understanding. We will investigate corpora of email threads and Wikipedia discussions. However, there is a scarcity of annotated corpora for this task. For example, the Enron Emails Corpus contains no inherent thread structure. Therefore, we also investigate issues faced when creating crowdsourced datasets and learning statistical models of them. Several of our findings are applicable for other natural language machine classification tasks, beyond thread reconstruction. We will divide our investigation of discussion thread reconstruction into two parts. First, we explore techniques needed to create a corpus for our thread reconstruction research. Like other NLP pairwise classification tasks such as Wikipedia discussion turn/edit alignment and sentence pair text similarity rating, email thread disentanglement is a heavily class-imbalanced problem, and although the advent of crowdsourcing has reduced annotation costs, the common practice of crowdsourcing redundancy is too expensive for class-imbalanced tasks. As the first contribution of this thesis, we evaluate alternative strategies for reducing crowdsourcing annotation redundancy for class-imbalanced NLP tasks. We also examine techniques to learn the best machine classifier from our crowdsourced labels. In order to reduce noise in training data, most natural language crowdsourcing annotation tasks gather redundant labels and aggregate them into an integrated label, which is provided to the classifier. However, aggregation discards potentially useful information from linguistically ambiguous instances. For the second contribution of this thesis, we show that, for four of five natural language tasks, filtering of the training dataset based on crowdsource annotation item agreement improves task performance, while soft labeling based on crowdsource annotations does not improve task performance. Second, we investigate thread reconstruction as divided into the tasks of thread disentanglement and adjacency recognition. We present the Enron Threads Corpus, a newly-extracted corpus of 70,178 multi-email threads with emails from the Enron Email Corpus. In the original Enron Emails Corpus, emails are not sorted by thread. To disentangle these threads, and as the third contribution of this thesis, we perform pairwise classification, using text similarity measures on non-quoted texts in emails. We show that i) content text similarity metrics outperform style and structure text similarity metrics in both a class-balanced and class-imbalanced setting, and ii) although feature performance is dependent on the semantic similarity of the corpus, content features are still effective even when controlling for semantic similarity. To reconstruct threads, it is also necessary to identify adjacency relations among pairs. For the forum of Wikipedia discussions, metadata is not available, and dialogue act typologies, helpful for other domains, are inapplicable. As our fourth contribution, via our experiments, we show that adjacency pair recognition can be performed using lexical pair features, without a dialogue act typology or metadata, and that this is robust to controlling for topic bias of the discussions. Yet, lexical pair features do not effectively model the lexical semantic relations between adjacency pairs. To model lexical semantic relations, and as our fifth contribution, we perform adjacency recognition using extracted keyphrases enhanced with semantically related terms. While this technique outperforms a most frequent class baseline, it fails to outperform lexical pair features or tf-idf weighted cosine similarity. Our investigation shows that this is the result of poor word sense disambiguation and poor keyphrase extraction causing spurious false positive semantic connections. In concluding this thesis, we also reflect on open issues and unanswered questions remaining after our research contributions, discuss applications for thread reconstruction, and suggest some directions for future work

    A Search Engine for Finding and Reusing Architecturally Significant Code

    Get PDF
    Architectural tactics are the building blocks of software architecture. They describe solutions for addressing specific quality concerns, and are prevalent across many software systems. Once a decision is made to utilize a tactic, the developer must generate a concrete plan for implementing the tactic in the code. Unfortunately, this is a non-trivial task even for experienced developers. Developers often resort to using search engines, crowd-sourcing websites, or discussion forums to find sample code snippets to implement a tactic. A fundamental problem of finding implementation for architectural patterns/tactics is the mismatch between the high-level intent reflected in the descriptions of these patterns ,and low-level implementation details of them. To reduce this mismatch, we created a novel Tactic Search Engine called ArchEngine (ARCHitecture search ENGINE). ArchEngine can replace this manual Internet-based search process and help developers to reuse proper architectural knowledge and accurately implement tactics and patterns from a wide range of open source systems. ArchEngine helps developers find implementation examples of tactic for a given technical context. It uses information retrieval and program analysis techniques to retrieve applications that implement these design concepts. Furthermore, the search engine lists the code snippets where the patterns/tactics are located. Our case study with 21 professional software developers shows that ArchEngine is more effective than other search engines (e.g. SourceForge and Koders) in helping programmers to quickly find implementations of architectural tactics/patterns

    Technique Integration for Requirements Assessment

    Get PDF
    In determining whether to permit a safety-critical software system to be certified and in performing independent verification and validation (IV&V) of safety- or mission-critical systems, the requirements traceability matrix (RTM) delivered by the developer must be assessed for accuracy. The current state of the practice is to perform this work manually, or with the help of general-purpose tools such as word processors and spreadsheets Such work is error-prone and person-power intensive. In this paper, we extend our prior work in application of Information Retrieval (IR) methods for candidate link generation to the problem of RTM accuracy assessment. We build voting committees from five IR methods, and use a variety of voting schemes to accept or reject links from given candidate RTMs. We report on the results of two experiments. In the first experiment, we used 25 candidate RTMs built by human analysts for a small tracing task involving a portion of a NASA scientific instrument specification. In the second experiment, we randomly seeded faults in the RTM for the entire specification. Results of the experiments are presented

    Social impact retrieval: measuring author influence on information retrieval

    Get PDF
    The increased presence of technologies collectively referred to as Web 2.0 mean the entire process of new media production and dissemination has moved away from an authorcentric approach. Casual web users and browsers are increasingly able to play a more active role in the information creation process. This means that the traditional ways in which information sources may be validated and scored must adapt accordingly. In this thesis we propose a new way in which to look at a user's contributions to the network in which they are present, using these interactions to provide a measure of authority and centrality to the user. This measure is then used to attribute an query-independent interest score to each of the contributions the author makes, enabling us to provide other users with relevant information which has been of greatest interest to a community of like-minded users. This is done through the development of two algorithms; AuthorRank and MessageRank. We present two real-world user experiments which focussed around multimedia annotation and browsing systems that we built; these systems were novel in themselves, bringing together video and text browsing, as well as free-text annotation. Using these systems as examples of real-world applications for our approaches, we then look at a larger-scale experiment based on the author and citation networks of a ten year period of the ACM SIGIR conference on information retrieval between 1997-2007. We use the citation context of SIGIR publications as a proxy for annotations, constructing large social networks between authors. Against these networks we show the effectiveness of incorporating user generated content, or annotations, to improve information retrieval

    Top Comment or Flop Comment? Predicting and Explaining User Engagement in Online News Discussions

    Full text link
    Comment sections below online news articles enjoy growing popularity among readers. However, the overwhelming number of comments makes it infeasible for the average news consumer to read all of them and hinders engaging discussions. Most platforms display comments in chronological order, which neglects that some of them are more relevant to users and are better conversation starters. In this paper, we systematically analyze user engagement in the form of the upvotes and replies that a comment receives. Based on comment texts, we train a model to distinguish comments that have either a high or low chance of receiving many upvotes and replies. Our evaluation on user comments from TheGuardian.com compares recurrent and convolutional neural network models, and a traditional feature-based classifier. Further, we investigate what makes some comments more engaging than others. To this end, we identify engagement triggers and arrange them in a taxonomy. Explanation methods for neural networks reveal which input words have the strongest influence on our model's predictions. In addition, we evaluate on a dataset of product reviews, which exhibit similar properties as user comments, such as featuring upvotes for helpfulness.Comment: Accepted at the International Conference on Web and Social Media (ICWSM 2020); 11 pages; code and data are available at https://hpi.de/naumann/projects/repeatability/text-mining.htm
    corecore