247,885 research outputs found

    Ransomware note detection techniques using supervised machine learning

    Get PDF
    This project is about the detection of ransomware by detecting ransomware notes using supervised machine learning. The goal of the project is to study old ransomnote data to detect notes used in new ransomware campaigns. This is done by extracting the word combinations out of fifty-nine ransom notes and fifty-nine non-ransom notes to define a binary (is or is-not) system of text classification. The hypothesis posed by this project is: A machine learning model trained using ransomnotes from past campaigns will be able to detect notes made in future campaigns. Two machine learning (ML) algorithms are studied; Decision Trees and Support Vector machines (SVM). These ML algorithms were chosen for their ease of implementation and low data requirements. The studied dataset has fewer than sixty raw text documents, therefore models requiring a minimal amount of training data, such as SVM, are prioritized. After training and testing the ML models, the performance of the models is verified using a separate and newer dataset. Most of the project is implemented using Python for application logic and data manipulation while Scikit Learn (sklearn) was used for the training and analysis of the machine learning models. Data is stored using regular files. Incremental comparisons are made using varying levels of data cleaning and feature selection to study which methodologies produce ideal ML models capable of detecting ransomware notes with a low false positive rate. The results of this project are favorable to the goal - it is demonstrated that a single ML model can recognize a ransom note by checking as few as twenty features. Shorter notes tend to have fewer features to check and therefore require an ML model biased towards false positives for reliable detection. It is proposed to combine the output of multiple models in a stacked or "ensemble" configuration [1] to create a system for indicating how confident a positive detection is

    Understanding Graph Data Through Deep Learning Lens

    Get PDF
    Deep neural network models have established themselves as an unparalleled force in the domains of vision, speech and text processing applications in recent years. However, graphs have formed a significant component of data analytics including applications in Internet of Things, social networks, pharmaceuticals and bioinformatics. An important characteristic of these deep learning techniques is their ability to learn the important features which are necessary to excel at a given task, unlike traditional machine learning algorithms which are dependent on handcrafted features. However, there have been comparatively fewer e�orts in deep learning to directly work on graph inputs. Various real-world problems can be easily solved by posing them as a graph analysis problem. Considering the direct impact of the success of graph analysis on business outcomes, importance of studying these complex graph data has increased exponentially over the years. In this thesis, we address three contributions towards understanding graph data: (i) The first contribution seeks to find anomalies in graphs using graphical models; (ii) The second contribution uses deep learning with spatio-temporal random walks to learn representations of graph trajectories (paths) and shows great promise on standard graph datasets; and (iii) The third contribution seeks to propose a novel deep neural network that implicitly models attention to allow for interpretation of graph classification.

    This Just In: Fake News Packs a Lot in Title, Uses Simpler, Repetitive Content in Text Body, More Similar to Satire than Real News

    Full text link
    The problem of fake news has gained a lot of attention as it is claimed to have had a significant impact on 2016 US Presidential Elections. Fake news is not a new problem and its spread in social networks is well-studied. Often an underlying assumption in fake news discussion is that it is written to look like real news, fooling the reader who does not check for reliability of the sources or the arguments in its content. Through a unique study of three data sets and features that capture the style and the language of articles, we show that this assumption is not true. Fake news in most cases is more similar to satire than to real news, leading us to conclude that persuasion in fake news is achieved through heuristics rather than the strength of arguments. We show overall title structure and the use of proper nouns in titles are very significant in differentiating fake from real. This leads us to conclude that fake news is targeted for audiences who are not likely to read beyond titles and is aimed at creating mental associations between entities and claims.Comment: Published at The 2nd International Workshop on News and Public Opinion at ICWS

    Analyzing collaborative learning processes automatically

    Get PDF
    In this article we describe the emerging area of text classification research focused on the problem of collaborative learning process analysis both from a broad perspective and more specifically in terms of a publicly available tool set called TagHelper tools. Analyzing the variety of pedagogically valuable facets of learners’ interactions is a time consuming and effortful process. Improving automated analyses of such highly valued processes of collaborative learning by adapting and applying recent text classification technologies would make it a less arduous task to obtain insights from corpus data. This endeavor also holds the potential for enabling substantially improved on-line instruction both by providing teachers and facilitators with reports about the groups they are moderating and by triggering context sensitive collaborative learning support on an as-needed basis. In this article, we report on an interdisciplinary research project, which has been investigating the effectiveness of applying text classification technology to a large CSCL corpus that has been analyzed by human coders using a theory-based multidimensional coding scheme. We report promising results and include an in-depth discussion of important issues such as reliability, validity, and efficiency that should be considered when deciding on the appropriateness of adopting a new technology such as TagHelper tools. One major technical contribution of this work is a demonstration that an important piece of the work towards making text classification technology effective for this purpose is designing and building linguistic pattern detectors, otherwise known as features, that can be extracted reliably from texts and that have high predictive power for the categories of discourse actions that the CSCL community is interested in

    Classifying document types to enhance search and recommendations in digital libraries

    Full text link
    In this paper, we address the problem of classifying documents available from the global network of (open access) repositories according to their type. We show that the metadata provided by repositories enabling us to distinguish research papers, thesis and slides are missing in over 60% of cases. While these metadata describing document types are useful in a variety of scenarios ranging from research analytics to improving search and recommender (SR) systems, this problem has not yet been sufficiently addressed in the context of the repositories infrastructure. We have developed a new approach for classifying document types using supervised machine learning based exclusively on text specific features. We achieve 0.96 F1-score using the random forest and Adaboost classifiers, which are the best performing models on our data. By analysing the SR system logs of the CORE [1] digital library aggregator, we show that users are an order of magnitude more likely to click on research papers and thesis than on slides. This suggests that using document types as a feature for ranking/filtering SR results in digital libraries has the potential to improve user experience.Comment: 12 pages, 21st International Conference on Theory and Practise of Digital Libraries (TPDL), 2017, Thessaloniki, Greec

    All mixed up? Finding the optimal feature set for general readability prediction and its application to English and Dutch

    Get PDF
    Readability research has a long and rich tradition, but there has been too little focus on general readability prediction without targeting a specific audience or text genre. Moreover, though NLP-inspired research has focused on adding more complex readability features there is still no consensus on which features contribute most to the prediction. In this article, we investigate in close detail the feasibility of constructing a readability prediction system for English and Dutch generic text using supervised machine learning. Based on readability assessments by both experts and a crowd, we implement different types of text characteristics ranging from easy-to-compute superficial text characteristics to features requiring a deep linguistic processing, resulting in ten different feature groups. Both a regression and classification setup are investigated reflecting the two possible readability prediction tasks: scoring individual texts or comparing two texts. We show that going beyond correlation calculations for readability optimization using a wrapper-based genetic algorithm optimization approach is a promising task which provides considerable insights in which feature combinations contribute to the overall readability prediction. Since we also have gold standard information available for those features requiring deep processing we are able to investigate the true upper bound of our Dutch system. Interestingly, we will observe that the performance of our fully-automatic readability prediction pipeline is on par with the pipeline using golden deep syntactic and semantic information