6 research outputs found

    Detecting Machine-obfuscated Plagiarism

    Full text link
    Related dataset is at https://doi.org/10.7302/bewj-qx93 and also listed in the dc.relation field of the full item record.Research on academic integrity has identified online paraphrasing tools as a severe threat to the effectiveness of plagiarism detection systems. To enable the automated identification of machine-paraphrased text, we make three contributions. First, we evaluate the effectiveness of six prominent word embedding models in combination with five classifiers for distinguishing human-written from machine-paraphrased text. The best performing classification approach achieves an accuracy of 99.0% for documents and 83.4% for paragraphs. Second, we show that the best approach outperforms human experts and established plagiarism detection systems for these classification tasks. Third, we provide a Web application that uses the best performing classification approach to indicate whether a text underwent machine-paraphrasing. The data and code of our study are openly available.Peer Reviewedhttps://deepblue.lib.umich.edu/bitstream/2027.42/152346/1/Foltynek2020_Paraphrase_Detection.pdfDescription of Foltynek2020_Paraphrase_Detection.pdf : Foltynek2020_Paraphrase_Detectio

    A Frame-Based NLP System for Cancer-Related Information Extraction.

    Get PDF
    We propose a frame-based natural language processing (NLP) method that extracts cancer-related information from clinical narratives. We focus on three frames: cancer diagnosis, cancer therapeutic procedure, and tumor description. We utilize a deep learning-based approach, bidirectional Long Short-term Memory (LSTM) Conditional Random Field (CRF), which uses both character and word embeddings. The system consists of two constituent sequence classifiers: a frame identification (lexical unit) classifier and a frame element classifier. The classifier achieves an

    Generalized and Transferable Patient Language Representation for Phenotyping with Limited Data

    Get PDF
    The paradigm of representation learning through transfer learning has the potential to greatly enhance clinical natural language processing. In this work, we propose a multi-task pre-training and fine-tuning approach for learning generalized and transferable patient representations from medical language. The model is first pre-trained with different but related high-prevalence phenotypes and further fine-tuned on downstream target tasks. Our main contribution focuses on the impact this technique can have on low-prevalence phenotypes, a challenging task due to the dearth of data. We validate the representation from pre-training, and fine-tune the multi-task pre-trained models on low-prevalence phenotypes including 38 circulatory diseases, 23 respiratory diseases, and 17 genitourinary diseases. We find multi-task pre-training increases learning efficiency and achieves consistently high performance across the majority of phenotypes. Most important, the multi-task pre-training is almost always either the best-performing model or performs tolerably close to the best-performing model, a property we refer to as robust. All these results lead us to conclude that this multi-task transfer learning architecture is a robust approach for developing generalized and transferable patient language representations for numerous phenotypes.Comment: Journal of Biomedical Informatics (in press

    VaxInsight: an artificial intelligence system to access large-scale public perceptions of vaccination from social media

    Get PDF
    Vaccination is considered one of the greatest public health achievements of the 20th century. A high vaccination rate is required to reduce the prevalence and incidence of vaccine-preventable diseases. However, in the last two decades, there has been a significant and increasing number of people who refuse or delay getting vaccinated and who prohibit their children from receiving vaccinations. Importantly, under-vaccination is associated with infectious disease outbreaks. A good understanding of public perceptions regarding vaccinations is important if we are to develop effective vaccination promotion strategies. Traditional methods of research, such as surveys, suffer limitations that impede our understanding of public perceptions, including resources cost, delays in data collection and analysis, especially in large samples. The popularity of social media (e.g. Twitter), combined with advances in artificial intelligence algorithms (e.g. natural language processing, deep learning), open up new avenues for accessing large scale data on public perceptions related to vaccinations. This dissertation reports on an original and systematic effort to develop artificial intelligence algorithms that will increase our ability to use Twitter discussions to understand vaccine-related perceptions and intentions. The research is framed within the perspectives offered by grounded behavior change theories. Tweets concerning the human papillomavirus (HPV) vaccine were used to accomplish three major aims: 1) Develop a deep learning-based system to better understand public perceptions of the HPV vaccine, using Twitter data and behavior change theories; 2) Develop a deep learning-based system to infer Twitter users’ demographic characteristics (e.g. gender and home location) and investigate demographic differences in public perceptions of the HPV vaccine; 3) Develop a web-based interactive visualization system to monitor real-time Twitter discussions of the HPV vaccine. For Aim 1, the bi-directional long short-term memory (LSTM) network with attention mechanism outperformed traditional machine learning and competitive deep learning algorithms in mapping Twitter discussions to the theoretical constructs of behavior change theories. Domain-specific embedding trained on HPV vaccine-related Twitter corpus by fastText algorithms further improved performance on some tasks. Time series analyses revealed evolving trends of public perceptions regarding the HPV vaccine. For Aim 2, the character-based convolutional neural network model achieved favorable state-of-the-art performance in Twitter gender inference on a Public Author Profiling challenge. The trained models then were applied to the Twitter corpus and they identified gender differences in public perceptions of the HPV vaccine. The findings on gender differences were largely consistent with previous survey-based studies. For the Twitter users’ home location inference, geo-tagging was framed as text classification tasks that resulted in a character-based recurrent neural network model. The model outperformed machine learning and deep learning baselines on home location tagging. Interstate variations in public perceptions of the HPV vaccine also were identified. For Aim 3, a prototype web-based interactive dashboard, VaxInsight, was built to synthesize HPV vaccine-related Twitter discussions in a comprehendible format. The usability test of VaxInsight showed high usability of the system. Notably, this maybe the first study to use deep learning algorithms to understand Twitter discussions of the HPV vaccine within the perspective of grounded behavior change theories. VaxInsight is also the first system that allows users to explore public health beliefs of vaccine related topics from Twitter. Thus, the present research makes original and systematical contributions to medical informatics by combining cutting-edge artificial intelligence algorithms and grounded behavior change theories. This work also builds a foundation for the next generation of real-time public health surveillance and research
    corecore