114 research outputs found

    Content-based genre classification of large texts

    Get PDF
    The advent of Natural Language Processing (NLP) and deep learning allows us to achieve tasks that sounded impossible about 10 years ago, one of those tasks is genre classification for large text bodies. Movies, books, novels, and various other texts more often than not, belong to one or more genres, the purpose of this research is to classify those texts into their genres while also calculating the weighed presence of this genre in the aforementioned texts. Movies in particular are classified into genres mostly for marketing purposes, and with no indication on which genre is the most autocratic. In this thesis, we explore the possibility of using deep neural networks and NLP to classify movies using the contents of the movie script. We follow the philosophy that scenes makes movies and generate the final result based on the classification of each individual scene. the results were obtained by training Convolutional Neural Networks (ConvNet or CNN) and Hierarchical Attention Networks (HAN) and compare their performance to the de-facto architectures for NLP, namely Recurrent Neural Networks (RNN) and Attention Models. The results we got on the validation data-set are comparable to those obtained by similar research done mostly on sentiment analysis or rating predictions, the accuracy is about 85% which is an acceptable measure in the literature. We dedicated a part iii of our conclusion discussing how our models would perform on a larger dataset and what steps could be taken to increase the accuracy

    Multi-Modal Medical Imaging Analysis with Modern Neural Networks

    Get PDF
    Medical imaging is an important non-invasive tool for diagnostic and treatment purposes in medical practice. However, interpreting medical images is a time consuming and challenging task. Computer-aided diagnosis (CAD) tools have been used in clinical practice to assist medical practitioners in medical imaging analysis since the 1990s. Most of the current generation of CADs are built on conventional computer vision techniques, such as manually defined feature descriptors. Deep convolutional neural networks (CNNs) provide robust end-to-end methods that can automatically learn feature representations. CNNs are a promising building block of next-generation CADs. However, applying CNNs to medical imaging analysis tasks is challenging. This dissertation addresses three major issues that obstruct utilizing modern deep neural networks on medical image analysis tasks---lack of domain knowledge in architecture design, lack of labeled data in model training, and lack of uncertainty estimation in deep neural networks. We evaluated the proposed methods on six large, clinically-relevant datasets. The result shows that the proposed methods can significantly improve the deep neural network performance on medical imaging analysis tasks

    A multimodal deep learning approach for food tray recognition

    Get PDF
    Treballs Finals de Grau d'Enginyeria Informàtica, Facultat de Matemàtiques, Universitat de Barcelona, Any: 2020, Director: Marc Bolaños i Petia Radeva[en] Food recognition, object detection and classification applied to the food domain, is the main topic of this work. We have studied the problem of recognising food instances in tray images of self-service restaurants and have proposed a novel multimodal deep learning approach. From images and daily menus, the model presented uses two state of the art models in object detection and classification and a multimodal neural network to make significantly refined predictions compared to the baseline object detection model, achieving a class weighted average F1-score of 0.862. An ensemble model built from the proposed and the baseline models, also presented in this work, improves the results achieving a class weighted average F1-score of 0.877

    A history and theory of textual event detection and recognition

    Get PDF

    A systematic survey of online data mining technology intended for law enforcement

    Get PDF
    As an increasing amount of crime takes on a digital aspect, law enforcement bodies must tackle an online environment generating huge volumes of data. With manual inspections becoming increasingly infeasible, law enforcement bodies are optimising online investigations through data-mining technologies. Such technologies must be well designed and rigorously grounded, yet no survey of the online data-mining literature exists which examines their techniques, applications and rigour. This article remedies this gap through a systematic mapping study describing online data-mining literature which visibly targets law enforcement applications, using evidence-based practices in survey making to produce a replicable analysis which can be methodologically examined for deficiencies

    Detecting Abnormal Behavior in Web Applications

    Get PDF
    The rapid advance of web technologies has made the Web an essential part of our daily lives. However, network attacks have exploited vulnerabilities of web applications, and caused substantial damages to Internet users. Detecting network attacks is the first and important step in network security. A major branch in this area is anomaly detection. This dissertation concentrates on detecting abnormal behaviors in web applications by employing the following methodology. For a web application, we conduct a set of measurements to reveal the existence of abnormal behaviors in it. We observe the differences between normal and abnormal behaviors. By applying a variety of methods in information extraction, such as heuristics algorithms, machine learning, and information theory, we extract features useful for building a classification system to detect abnormal behaviors.;In particular, we have studied four detection problems in web security. The first is detecting unauthorized hotlinking behavior that plagues hosting servers on the Internet. We analyze a group of common hotlinking attacks and web resources targeted by them. Then we present an anti-hotlinking framework for protecting materials on hosting servers. The second problem is detecting aggressive behavior of automation on Twitter. Our work determines whether a Twitter user is human, bot or cyborg based on the degree of automation. We observe the differences among the three categories in terms of tweeting behavior, tweet content, and account properties. We propose a classification system that uses the combination of features extracted from an unknown user to determine the likelihood of being a human, bot or cyborg. Furthermore, we shift the detection perspective from automation to spam, and introduce the third problem, namely detecting social spam campaigns on Twitter. Evolved from individual spammers, spam campaigns manipulate and coordinate multiple accounts to spread spam on Twitter, and display some collective characteristics. We design an automatic classification system based on machine learning, and apply multiple features to classifying spam campaigns. Complementary to conventional spam detection methods, our work brings efficiency and robustness. Finally, we extend our detection research into the blogosphere to capture blog bots. In this problem, detecting the human presence is an effective defense against the automatic posting ability of blog bots. We introduce behavioral biometrics, mainly mouse and keyboard dynamics, to distinguish between human and bot. By passively monitoring user browsing activities, this detection method does not require any direct user participation, and improves the user experience

    Extracting business performance signals from Twitter news

    Get PDF
    Social media and social networks underpin a revolution in communication between people, with the particular feature that much of that communication is open to all. This provides a massive pool of data that can be exploited by researchers for a wide variety of different applications. Data from Twitter is of particular interest in this sense, given its large global usage levels, and the availability of APIs and other tools that enable easy access to the publicly available stream of tweets. Owing to the wide public penetration of Twitter, many businesses make use of it to share their latest news, effectively using Twitter as a gateway to connect to end-users, consumers and/or investors. In this thesis, we focus on the potential for extracting information from Twitter that is relevant to the financial and competitiveness status of a business. We consider a collection of well-regarded Twitter accounts that are known for communicating recent business news, and we investigate the automated analysis of the stream of tweets from these sources, with a view to learning business-relevant information about specific companies. A key aspect of our approach is the idea of extracting specific areas of business performance: we explore three such areas: productivity, competitiveness, and industrial risk. We propose a two-step model which first classifies a tweet into one of these areas, and then assigns a sentiment value (on a positive/negative scale). The resulting sentiment values across specific aspects represent novel business indicators that could add significant value to the toolset used by business analysts. Our experiments are based on a new manually pre-classified data set (available from a URL provided). Additionally, we propose n-grams made from non-contiguous words as a novel feature to enhance performance in this context. Experiments involving a range of feature selection methods show that these new features provide valuable benefits in comparison with standard n-gram features. We also interduce the concept of an extra layer added to the primary classifier, with the role of filtering out noisy tweets before they enter the system. We use a One-Class SVM for this purpose. Broadly, we show that the methods developed in this thesis achieve promising results in both topic and sentiment classification in the business performance context, suggesting that twitter can indeed be a useful source of signals related to different aspects of business performance. We also find that our system can provide valuable insight into unseen test data. However, more research is needed to be able to extract robust signals for industrial risk, and there seems to be a considerable promise for further development
    corecore