252 research outputs found

    Fake Product Review Monitoring & Removal and Sentiment Analysis of Genuine Reviews

    Get PDF
    Any E-Commerce website gets bad reputation if they sell a product which has bad review, the user blames the e- Commerce website rather than manufacturers most of the times. In some review sites some great audits are included by the item organization individuals itself so as to make so as to deliver false positive item reviews. To eliminate these type of fake product review, we will create a system that finds out the fake reviews and eliminates all the fake reviews by using machine learning. We also remove the reviews that are flood by a marketing agency in order to boost up the ratings of a particular product .Finally Sentiment analysis is done for the genuine reviews to classify them into positive and negative. We will use Bag-of-words to label individual words according to their sentiment

    Data-Driven Techniques For Vulnerability Assessments

    Get PDF
    Security vulnerabilities have been puzzling researchers and practitioners for decades.As highlighted by the recent WannaCry and NotPetya ransomware campaigns, which resulted in billions of dollars of losses, weaponized exploits against vulnerabilities remain one of the main tools for cybercrime. The upward trend in the number of vulnerabilities reported annually and technical challenges in the way of remediation lead to large exposure windows for the vulnerable populations. On the other hand, due to sustained efforts in application and operating system security, few vulnerabilities are exploited in real-world attacks. Existing metrics for severity assessments err on the side of caution and overestimate the risk posed by vulnerabilities, further affecting remediation efforts that rely on prioritization. In this dissertation we show that severity assessments can be improved by taking into account public information about vulnerabilities and exploits.The disclosure of vulnerabilities is followed by artifacts such as social media discussions, write-ups and proof-of-concepts, containing technical information related to the vulnerabilities and their exploitation. These artifacts can be mined to detect active exploits or predict their development. However, we first need to understand: What features are required for different tasks? What biases are present in public data and how are data-driven systems affected? What security threats do these systems face when deployed operationally? We explore the questions by first collecting vulnerability-related posts on social media and analyzing the community and the content of their discussions.This analysis reveals that victims of attacks often share their experience online, and we leverage this finding to build an early detector of exploits active in the wild. Our detector significantly improves on the precision of existing severity metrics and can detect active exploits a median of 5 days earlier than a commercial intrusion prevention product. Next, we investigate the utility of various artifacts in predicting the development of functional exploits. We engineer features causally linked to the ease of exploitation, highlight trade-offs between timeliness and predictive utility of various artifacts, and characterize the biases that affect the ground truth for exploit prediction tasks. Using these insights, we propose a machine learning-based system that continuously collects artifacts and predicts the likelihood of exploits being developed against these vulnerabilities. We demonstrate our system's practical utility through its ability to highlight critical vulnerabilities and predict imminent exploits. Lastly, we explore the adversarial threats faced by data-driven security systems that rely on inputs of unknown provenance.We propose a framework for defining algorithmic threat models and for exploring adversaries with various degrees of knowledge and capabilities. Using this framework, we model realistic adversaries that could target our systems, design data poisoning attacks to measure their robustness, and highlight promising directions for future defenses against such attacks

    Sentiment Analysis in Digital Spaces: An Overview of Reviews

    Full text link
    Sentiment analysis (SA) is commonly applied to digital textual data, revealing insight into opinions and feelings. Many systematic reviews have summarized existing work, but often overlook discussions of validity and scientific practices. Here, we present an overview of reviews, synthesizing 38 systematic reviews, containing 2,275 primary studies. We devise a bespoke quality assessment framework designed to assess the rigor and quality of systematic review methodologies and reporting standards. Our findings show diverse applications and methods, limited reporting rigor, and challenges over time. We discuss how future research and practitioners can address these issues and highlight their importance across numerous applications.Comment: 44 pages, 4 figures, 6 tables, 3 appendice

    Utilizing public repositories to improve the decision process for security defect resolution and information reuse in the development environment

    Get PDF
    Security risks are contained in solutions in software systems that could have been avoided if the design choices were analyzed by using public information security data sources. Public security sources have been shown to contain more relevant and recent information on current technologies than any textbook or research article, and these sources are often used by developers for solving software related problems. However, solutions copied from public discussion forums such as StackOverflow may contain security implications when copied directly into the developers environment. Several different methods to identify security bugs are being implemented, and recent efforts are looking into identifying security bugs from communication artifacts during software development lifecycle as well as using public security information sources to support secure design and development. The primary goal of this thesis is to investigate how to utilize public information sources to reduce security defects in software artifacts through improving the decision process for defect resolution and information reuse in the development environment. We build a data collection tool for collecting data from public information security sources and public discussion forums, construct machine learning models for classifying discussion forum posts and bug reports as security or not-security related, as well as word embedding models for finding matches between public security sources and public discussion forum posts or bug reports. The results of this thesis demonstrate that using public information security sources can provide additional validation layers for defect classification models, as well as provide additional security context for public discussion forum posts. The contributions of this thesis are to provide understanding of how public information security sources can better provide context for bug reports and discussion forums. Additionally, we provide data collection APIs for collecting datasets from these sources, and classification and word embedding models for recommending related security sources for bug reports and public discussion forum posts.Masteroppgave i Programutvikling samarbeid med HVLPROG399MAMN-PRO

    Sentiment classification with case-base approach

    Get PDF
    L'augmentation de la croissance des réseaux, des blogs et des utilisateurs des sites d'examen sociaux font d'Internet une énorme source de données, en particulier sur la façon dont les gens pensent, sentent et agissent envers différentes questions. Ces jours-ci, les opinions des gens jouent un rôle important dans la politique, l'industrie, l'éducation, etc. Alors, les gouvernements, les grandes et petites industries, les instituts universitaires, les entreprises et les individus cherchent à étudier des techniques automatiques fin d’extraire les informations dont ils ont besoin dans les larges volumes de données. L’analyse des sentiments est une véritable réponse à ce besoin. Elle est une application de traitement du langage naturel et linguistique informatique qui se compose de techniques de pointe telles que l'apprentissage machine et les modèles de langue pour capturer les évaluations positives, négatives ou neutre, avec ou sans leur force, dans des texte brut. Dans ce mémoire, nous étudions une approche basée sur les cas pour l'analyse des sentiments au niveau des documents. Notre approche basée sur les cas génère un classificateur binaire qui utilise un ensemble de documents classifies, et cinq lexiques de sentiments différents pour extraire la polarité sur les scores correspondants aux commentaires. Puisque l'analyse des sentiments est en soi une tâche dépendante du domaine qui rend le travail difficile et coûteux, nous appliquons une approche «cross domain» en basant notre classificateur sur les six différents domaines au lieu de le limiter à un seul domaine. Pour améliorer la précision de la classification, nous ajoutons la détection de la négation comme une partie de notre algorithme. En outre, pour améliorer la performance de notre approche, quelques modifications innovantes sont appliquées. Il est intéressant de mentionner que notre approche ouvre la voie à nouveaux développements en ajoutant plus de lexiques de sentiment et ensembles de données à l'avenir.Increasing growth of the social networks, blogs, and user review sites make Internet a huge source of data especially about how people think, feel, and act toward different issues. These days, people opinions play an important role in the politic, industry, education, etc. Thus governments, large and small industries, academic institutes, companies, and individuals are looking for investigating automatic techniques to extract their desire information from large amount of data. Sentiment analysis is one true answer to this need. Sentiment analysis is an application of natural language processing and computational linguistic that consists of advanced techniques such as machine learning and language model approaches to capture the evaluative factors such as positive, negative, or neutral, with or without their strength, from plain texts. In this thesis we study a case-based approach on cross-domain for sentiment analysis on the document level. Our case-based algorithm generates a binary classifier that uses a set of the processed cases, and five different sentiment lexicons to extract the polarity along the corresponding scores from the reviews. Since sentiment analysis inherently is a domain dependent task that makes it problematic and expensive work, we use a cross-domain approach by training our classifier on the six different domains instead of limiting it to one domain. To improve the accuracy of the classifier, we add negation detection as a part of our algorithm. Moreover, to improve the performance of our approach, some innovative modifications are applied. It is worth to mention that our approach allows for further developments by adding more sentiment lexicons and data sets in the future

    The Utility of Large Language Models and Generative AI for Education Research

    Full text link
    The use of natural language processing (NLP) techniques in engineering education can provide valuable insights into the underlying processes involved in generating text. While accessing these insights can be labor-intensive if done manually, recent advances in NLP and large language models have made it a realistic option for individuals. This study explores and evaluates a combination of clustering, summarization, and prompting techniques to analyze over 1,000 student essays in which students discussed their career interests. The specific assignment prompted students to define and explain their career goals as engineers. Using text embedding representations of student responses, we clustered the responses together to identify thematically similar statements from students. The clustered responses were then summarized to quickly identify career interest themes. We also used a set of a priori codes about career satisfaction and sectors to demonstrate an alternative approach to using these generative text models to analyze student writing. The results of this study demonstrate the feasibility and usefulness of NLP techniques in engineering education research. By automating the initial analysis of student essays, researchers and educators can more efficiently and accurately identify key themes and patterns in student writing. The methods presented in this paper have broader applications for engineering education and research purposes beyond analyzing student essays. By explaining these methods to the engineering education community, readers can utilize them in their own contexts.Comment: 3 figures, 10 table

    The Best Explanation:Beyond Right and Wrong in Question Answering

    Get PDF

    Unsupervised Intrusion Detection with Cross-Domain Artificial Intelligence Methods

    Get PDF
    Cybercrime is a major concern for corporations, business owners, governments and citizens, and it continues to grow in spite of increasing investments in security and fraud prevention. The main challenges in this research field are: being able to detect unknown attacks, and reducing the false positive ratio. The aim of this research work was to target both problems by leveraging four artificial intelligence techniques. The first technique is a novel unsupervised learning method based on skip-gram modeling. It was designed, developed and tested against a public dataset with popular intrusion patterns. A high accuracy and a low false positive rate were achieved without prior knowledge of attack patterns. The second technique is a novel unsupervised learning method based on topic modeling. It was applied to three related domains (network attacks, payments fraud, IoT malware traffic). A high accuracy was achieved in the three scenarios, even though the malicious activity significantly differs from one domain to the other. The third technique is a novel unsupervised learning method based on deep autoencoders, with feature selection performed by a supervised method, random forest. Obtained results showed that this technique can outperform other similar techniques. The fourth technique is based on an MLP neural network, and is applied to alert reduction in fraud prevention. This method automates manual reviews previously done by human experts, without significantly impacting accuracy

    Finding the online cry for help : automatic text classification for suicide prevention

    Get PDF
    Successful prevention of suicide, a serious public health concern worldwide, hinges on the adequate detection of suicide risk. While online platforms are increasingly used for expressing suicidal thoughts, manually monitoring for such signals of distress is practically infeasible, given the information overload suicide prevention workers are confronted with. In this thesis, the automatic detection of suicide-related messages is studied. It presents the first classification-based approach to online suicidality detection, and focuses on Dutch user-generated content. In order to evaluate the viability of such a machine learning approach, we developed a gold standard corpus, consisting of message board and blog posts. These were manually labeled according to a newly developed annotation scheme, grounded in suicide prevention practice. The scheme provides for the annotation of a post's relevance to suicide, and the subject and severity of a suicide threat, if any. This allowed us to derive two tasks: the detection of suicide-related posts, and of severe, high-risk content. In a series of experiments, we sought to determine how well these tasks can be carried out automatically, and which information sources and techniques contribute to classification performance. The experimental results show that both types of messages can be detected with high precision. Therefore, the amount of noise generated by the system is minimal, even on very large datasets, making it usable in a real-world prevention setting. Recall is high for the relevance task, but at around 60%, it is considerably lower for severity. This is mainly attributable to implicit references to suicide, which often go undetected. We found a variety of information sources to be informative for both tasks, including token and character ngram bags-of-words, features based on LSA topic models, polarity lexicons and named entity recognition, and suicide-related terms extracted from a background corpus. To improve classification performance, the models were optimized using feature selection, hyperparameter, or a combination of both. A distributed genetic algorithm approach proved successful in finding good solutions for this complex search problem, and resulted in more robust models. Experiments with cascaded classification of the severity task did not reveal performance benefits over direct classification (in terms of F1-score), but its structure allows the use of slower, memory-based learning algorithms that considerably improved recall. At the end of this thesis, we address a problem typical of user-generated content: noise in the form of misspellings, phonetic transcriptions and other deviations from the linguistic norm. We developed an automatic text normalization system, using a cascaded statistical machine translation approach, and applied it to normalize the data for the suicidality detection tasks. Subsequent experiments revealed that, compared to the original data, normalized data resulted in fewer and more informative features, and improved classification performance. This extrinsic evaluation demonstrates the utility of automatic normalization for suicidality detection, and more generally, text classification on user-generated content
    corecore