863 research outputs found

    Drift Detection using Uncertainty Distribution Divergence

    Get PDF
    Data generated from naturally occurring processes tends to be non-stationary. For example, seasonal and gradual changes in climate data and sudden changes in financial data. In machine learning the degradation in classifier performance due to such changes in the data is known as concept drift and there are many approaches to detecting and handling it. Most approaches to detecting concept drift, however, make the assumption that true classes for test examples will be available at no cost shortly after classification and base the detection of concept drift on measures relying on these labels. The high labelling cost in many domains provides a strong motivation to reduce the number of labelled instances required to detect and handle concept drift. Triggered detection approaches that do not require labelled instances to detect concept drift show great promise for achieving this. In this paper we present Confidence Distribution Batch Detection (CDBD), an approach that provides a signal correlated to changes in concept without using labelled data. This signal combined with a trigger and a rebuild policy can maintain classifier accuracy which, in most cases, matches the accuracy achieved using classification error based detection techniques but using only a limited amount of labelled data

    Identification of Informativeness in Text using Natural Language Stylometry

    Get PDF
    In this age of information overload, one experiences a rapidly growing over-abundance of written text. To assist with handling this bounty, this plethora of texts is now widely used to develop and optimize statistical natural language processing (NLP) systems. Surprisingly, the use of more fragments of text to train these statistical NLP systems may not necessarily lead to improved performance. We hypothesize that those fragments that help the most with training are those that contain the desired information. Therefore, determining informativeness in text has become a central issue in our view of NLP. Recent developments in this field have spawned a number of solutions to identify informativeness in text. Nevertheless, a shortfall of most of these solutions is their dependency on the genre and domain of the text. In addition, most of them are not efficient regardless of the natural language processing problem areas. Therefore, we attempt to provide a more general solution to this NLP problem. This thesis takes a different approach to this problem by considering the underlying theme of a linguistic theory known as the Code Quantity Principle. This theory suggests that humans codify information in text so that readers can retrieve this information more efficiently. During the codification process, humans usually change elements of their writing ranging from characters to sentences. Examples of such elements are the use of simple words, complex words, function words, content words, syllables, and so on. This theory suggests that these elements have reasonable discriminating strength and can play a key role in distinguishing informativeness in natural language text. In another vein, Stylometry is a modern method to analyze literary style and deals largely with the aforementioned elements of writing. With this as background, we model text using a set of stylometric attributes to characterize variations in writing style present in it. We explore their effectiveness to determine informativeness in text. To the best of our knowledge, this is the first use of stylometric attributes to determine informativeness in statistical NLP. In doing so, we use texts of different genres, viz., scientific papers, technical reports, emails and newspaper articles, that are selected from assorted domains like agriculture, physics, and biomedical science. The variety of NLP systems that have benefitted from incorporating these stylometric attributes somewhere in their computational realm dealing with this set of multifarious texts suggests that these attributes can be regarded as an effective solution to identify informativeness in text. In addition to the variety of text genres and domains, the potential of stylometric attributes is also explored in some NLP application areas---including biomedical relation mining, automatic keyphrase indexing, spam classification, and text summarization---where performance improvement is both important and challenging. The success of the attributes in all these areas further highlights their usefulness

    Handling Concept Drift in the Context of Expensive Labels

    Get PDF
    Machine learning has been successfully applied to a wide range of prediction problems, yet its application to data streams can be complicated by concept drift. Existing approaches to handling concept drift are overwhelmingly reliant on the assumption that it is possible to obtain the true label of an instance shortly after classification at a negligible cost. The aim of this thesis is to examine, and attempt to address, some of the problems related to handling concept drift when the cost of obtaining labels is high. This thesis presents Decision Value Sampling (DVS), a novel concept drift handling approach which periodically chooses a small number of the most useful instances to label. The newly labelled instances are then used to re-train the classifier, an SVM with a linear kernel, to handle any change in concept that might occur. In this way, only the instances that are required to keep the classifier up-to-date are labelled. The evaluation of the system indicates that a classifier can be kept up-to-date with changes in concept while only requiring 15% of the data stream to be labelled. In a data stream with a high throughput this represents a significant reduction in the number of labels required. The second novel concept drift handling approach proposed in this thesis is Confidence Distribution Batch Detection (CDBD). CDBD uses a heuristic based on the distribution of an SVM’s confidence in its predictions to decide when to rebuild the clas- sifier. The evaluation shows that CDBD can be used to reliably detect when a change in concept has taken place and that concept drift can be handled if the classifier is rebuilt when CDBD sig- nals a change in concept. The evaluation also shows that CDBD obtains a considerable labels saving as it only requires labelled data when a change in concept has been detected. The two concept drift handling approaches deal with concept drift in a different manner, DVS continuously adapts the clas- sifier, whereas CDBD only adapts the classifier when a sizeable change in concept is suspected. They reflect a divide also found in the literature, between continuous rebuild approaches (like DVS) and triggered rebuild approaches (like CDBD). The final major contribution in this thesis is a comparison between continuous and triggered rebuild approaches, as this is an underexplored area. An empirical comparison between representative techniques from both types of approaches shows that triggered rebuild works slightly better on large datasets where the changes in concepts occur infrequently, but in general a continuous rebuild approach works the best

    A Comprehensive Survey of Data Mining-based Fraud Detection Research

    Full text link
    This survey paper categorises, compares, and summarises from almost all published technical and review articles in automated fraud detection within the last 10 years. It defines the professional fraudster, formalises the main types and subtypes of known fraud, and presents the nature of data evidence collected within affected industries. Within the business context of mining the data to achieve higher cost savings, this research presents methods and techniques together with their problems. Compared to all related reviews on fraud detection, this survey covers much more technical articles and is the only one, to the best of our knowledge, which proposes alternative data and solutions from related domains.Comment: 14 page

    Spam Detection Using Machine Learning and Deep Learning

    Get PDF
    Text messages are essential these days; however, spam texts have contributed negatively to the success of this communication mode. The compromised authenticity of such messages has given rise to several security breaches. Using spam messages, malicious links have been sent to either harm the system or obtain information detrimental to the user. Spam SMS messages as well as emails have been used as media for attacks such as masquerading and smishing ( a phishing attack through text messaging), and this has threatened both the user and service providers. Therefore, given the waves of attacks, the need to identify and remove these spam messages is important. This dissertation explores the process of text classification from data input to embedded representation of the words in vector form and finally the classification process. Therefore, we have applied different embedding methods to capture both the linguistic and semantic meanings of words. Static embedding methods that are used include Word to Vector (Word2Vec) and Global Vectors (GloVe), while for dynamic embedding the transfer learning of the Bidirectional Encoder Representations from Transformers (BERT) was employed. For classification, both machine learning and deep learning techniques were used to build an efficient and sensitive classification model with good accuracy and low false positive rate. Our result established that the combination of BERT for embedding and machine learning for classification produced better classification results than other combinations. With these results, we developed models that combined the self-feature extraction advantage of deep learning and the effective classification of machine learning. These models were tested on four different datasets, namely: SMS Spam dataset, Ling dataset, Spam Assassin dataset and Enron dataset. BERT+SVC (hybrid model) produced the result with highest accuracy and lowest false positive rate

    Security Enhancements in Voice Over Ip Networks

    Get PDF
    Voice delivery over IP networks including VoIP (Voice over IP) and VoLTE (Voice over LTE) are emerging as the alternatives to the conventional public telephony networks. With the growing number of subscribers and the global integration of 4/5G by operations, VoIP/VoLTE as the only option for voice delivery becomes an attractive target to be abused and exploited by malicious attackers. This dissertation aims to address some of the security challenges in VoIP/VoLTE. When we examine the past events to identify trends and changes in attacking strategies, we find that spam calls, caller-ID spoofing, and DoS attacks are the most imminent threats to VoIP deployments. Compared to email spam, voice spam will be much more obnoxious and time consuming nuisance for human subscribers to filter out. Since the threat of voice spam could become as serious as email spam, we first focus on spam detection and propose a content-based approach to protect telephone subscribers\u27 voice mailboxes from voice spam. Caller-ID has long been used to enable the callee parties know who is calling, verify his identity for authentication and his physical location for emergency services. VoIP and other packet switched networks such as all-IP Long Term Evolution (LTE) network provide flexibility that helps subscribers to use arbitrary caller-ID. Moreover, interconnecting between IP telephony and other Circuit-Switched (CS) legacy telephone networks has also weakened the security of caller-ID systems. We observe that the determination of true identity of a calling device helps us in preventing many VoIP attacks, such as caller-ID spoofing, spamming and call flooding attacks. This motivates us to take a very different approach to the VoIP problems and attempt to answer a fundamental question: is it possible to know the type of a device a subscriber uses to originate a call? By exploiting the impreciseness of the codec sampling rate in the caller\u27s RTP streams, we propose a fuzzy rule-based system to remotely identify calling devices. Finally, we propose a caller-ID based public key infrastructure for VoIP and VoLTE that provides signature generation at the calling party side as well as signature verification at the callee party side. The proposed signature can be used as caller-ID trust to prevent caller-ID spoofing and unsolicited calls. Our approach is based on the identity-based cryptography, and it also leverages the Domain Name System (DNS) and proxy servers in the VoIP architecture, as well as the Home Subscriber Server (HSS) and Call Session Control Function (CSCF) in the IP Multimedia Subsystem (IMS) architecture. Using OPNET, we then develop a comprehensive simulation testbed for the evaluation of our proposed infrastructure. Our simulation results show that the average call setup delays induced by our infrastructure are hardly noticeable by telephony subscribers and the extra signaling overhead is negligible. Therefore, our proposed infrastructure can be adopted to widely verify caller-ID in telephony networks
    corecore