9 research outputs found

    Semi-supervised learning approach using modified self-training algorithm to counter burst header packet flooding attack in optical burst switching network

    Get PDF
    Burst header packet flooding is an attack on optical burst switching (OBS) network which may cause denial of service. Application of machine learning technique to detect malicious nodes in OBS network is relatively new. As finding sufficient amount of labeled data to perform supervised learning is difficult, semi-supervised method of learning (SSML) can be leveraged. In this paper, we studied the classical self-training algorithm (ST) which uses SSML paradigm. Generally, in ST, the available true-labeled data (L) is used to train a base classifier. Then it predicts the labels of unlabeled data (U). A portion from the newly labeled data is removed from U based on prediction confidence and combined with L. The resulting data is then used to re-train the classifier. This process is repeated until convergence. This paper proposes a modified self-training method (MST). We trained multiple classifiers on L in two stages and leveraged agreement among those classifiers to determine labels. The performance of MST was compared with ST on several datasets and significant improvement was found. We applied the MST on a simulated OBS network dataset and found very high accuracy with a small number of labeled data. Finally we compared this work with some related works

    Disagreement-Based Co-training

    Get PDF

    Semi-supervised Self-training for Sentence Subjectivity Classification

    No full text

    Semi-Supervised Self-Training for Sentence Subjectivity Classification

    No full text
    Recent natural language processing (NLP) research shows that identifying and extracting subjective information from texts can benefit many NLP applications. In this paper, we address a semi-supervised learning approach, self-training, for sentence subjectivity classification. In self-training, the confidence degree that depends on the ranking of class membership probabilities is commonly used as the selection metric that ranks and selects the unlabeled instances for next training of underlying classifier. Naive Bayes (NB) is often used as the underlying classifier because its class membership probability estimates have good ranking performance. The first contribution of this paper is to study the performance of self-training using decision tree models, such as C4.5, C4.4, and naive Bayes tree (NBTree), as the underlying classifiers. The second contribution is that we propose an adapted Value Difference Metric (VDM) as the selection metric in self-training, which does not depend on class membership probabilities. Based on the Multi-Perspective Question Answering (MPQA) corpus, a set of experiments have been designed to compare the performance of self-training with different underlying classifiers using different selection metrics under various conditions. The experimental results show that the performance of self-training is improved by using VDM instead of the confidence degree, and self-training with NBTree and VDM outperforms self-training with other combinations of underlying classifiers and selection metrics. The results also show that the self-training approach can achieve comparable performance to the supervised learning models.Les recherches r\ue9centes sur le traitement des langues naturelles montre que l'identification et l'extraction d'information subjective \ue0 partir de textes peuvent contribuer grandement \ue0 de nombreuses applications du traitement des langues naturelles. Dans ce document, nous traitons d'une approche faisant appel \ue0 l'apprentissage semi-supervis\ue9 en vue de la classification de la subjectivit\ue9 des phrases. En auto-apprentissage, le degr\ue9 de confiance, qui est fonction de l'ordonnancement des probabilit\ue9s d'appartenance \ue0 des classes, est souvent utilis\ue9 comme param\ue8tre de s\ue9lection qui ordonne par rangs et s\ue9lectionne les instances non \ue9tiquet\ue9es pour l'apprentissage subs\ue9quent appliqu\ue9 au classificateur sous-jacent. Le classificateur bay\ue9sien na\ueff (NB) est souvent utilis\ue9 comme classificateur sous-jacent parce que ses estim\ue9s de probabilit\ue9 d'appartenance \ue0 une classe pr\ue9sentent une bonne performance sur le plan de l'ordonnancement. La premi\ue8re contribution du pr\ue9sent document est l'\ue9tude des performances de l'auto-apprentissage au moyen de mod\ue8les d'arbres de d\ue9cision comme C4.5, C4.4 et de l'arbre bay\ue9sien na\ueff, comme classificateurs sous-jacents. Notre seconde contribution consiste \ue0 proposer un param\ue8tre de diff\ue9rence de valeur adapt\ue9 comme param\ue8tre de s\ue9lection en auto-apprentissage qui n'est pas fonction de probabilit\ue9s d'appartenance \ue0 une classe. Nous nous sommes bas\ue9s sur le corpus MPQA (r\ue9ponse \ue0 des interrogations \ue0 perspectives multiples) pour cr\ue9er un ensemble d'exp\ue9riences con\ue7ues afin de comparer les rendements de l'auto-apprentissage avec divers classificateurs sous-jacents utilisant des param\ue8tres de s\ue9lection diff\ue9rents dans diverses conditions. Les r\ue9sultats exp\ue9rimentaux montrent que le rendement de l'auto-apprentissage est am\ue9lior\ue9 lorsqu'on utilise des param\ue8tres de diff\ue9rence de valeur plut\uf4t qu'un niveau de confiance et que l'auto-apprentissage effectu\ue9 avec un arbre bay\ue9sien na\ueff et des param\ue8tres de diff\ue9rence de valeur pr\ue9sente de meilleures performances que l'auto-apprentissage effectu\ue9 avec d'autres combinaisons de classificateurs sous-jacents et param\ue8tres de s\ue9lection. Il est aussi d\ue9montr\ue9 que la d\ue9marche d'auto-apprentissage produit des rendements comparables aux mod\ue8les d'apprentissage supervis\ue9s.NRC publication: Ye

    APPLICATION OF LINK GRAMMAR IN SEMI-SUPERVISED NAMED ENTITY RECOGNITION FOR ACCIDENT DOMAIN

    Get PDF
    Accident document typically contains some crucial information that might be useful for analysis process for future accident investigation i.e. date and time when the accident happened, location where the accident occurred and also the person involved in the accident. This document is largely available in free text; it can be in the form of news wire articles or accident reports. Although it is possible to identify the information manually, due to the high volumes of data involved, this task can be time consuming and prone to error. Information Extraction (IE) has been identified as a potential solution to this problem. IE has the ability to extract crucial information from unstructured texts and convert them into a more structured representation. This research is attempted to explore Name Entity Recognition (NER), one of the important tasks in IE research aimed to identify and classify entities in the text documents into some predefined categories. Numerous related research works on IE and NER have been published and commercialized. However, to the best of our knowledge, there exists only a handful of IE research works that are really focused on accident domain. In addition, none of these works have attempted to either explore or focus on NER, which becomes the main motivation for this research. The work presented in this thesis proposed an NER approach for accident documents that applies syntactical and word features in combination with Self-Training algorithm. In order to satisfy the research objectives, this thesis comes with three main contributions. The first contribution is the identification of the entity boundary. Entity segmentation or identification of entity boundary is required since named entity may consist of one or more words. We adopted Stanford Part-of-Speech (POS) tagger for the word POS tag and connectors from the Link Grammar (LG) parser to determine the starting and stopping word. The second contribution is the extraction pattern construction. Each named entity candidate will be assigned with an extraction pattern constructed from a set of word and syntactical feature. Current NER system used restricted syntactical features which are associated with a number of limitations. It is therefore a great challenge to propose a new NER approach using syntactical features that could capture all syntactical structure in a sentence. For the third contribution, we have applied the Self-Training algorithm which is one of the semi-supervised machines learning technique. The algorithm is utilized for predicting a huge set of unlabeled data, given a small number of labelled data. In our research, extraction pattern from the first module will be fed to this algorithm and is used to make the prediction of named entity candidate category. The Self-Training algorithm greatly benefits semi-supervised learning which allows classification of entities given only a small-size of labelled data. The algorithm reduces the training efforts and generates almost similar result as compared to the conventional supervised learning technique. The proposed system was tested on 100 accident news from Reuters to recognize three different named entities: date, person and location which are universally accepted categories in most NER applications. Exact Match evaluation method which consists of three evaluation metrics; precision, recall and F-measure is used to measure the proposed system performance against three existing NER systems. The proposed system has successfully outperforms one of those systems with an overall F-measure of approximately 9% but in the other hand it shows a slight decrease as compared to other two systems identified in our benchmarking. However, we believe that this difference is due to the different nature and techniques used in the three systems. We consider our semi-supervised approach as a promising method even though only two features are utilized: syntactical and word features. Further manual inspection during the experiments suggested that by using complete word and syntactical features or combination of these features with other features such as the semantic feature, would yield an improved result

    Personal Email Spam Filtering with Minimal User Interaction

    Get PDF
    This thesis investigates ways to reduce or eliminate the necessity of user input to learning-based personal email spam filters. Personal spam filters have been shown in previous studies to yield superior effectiveness, at the cost of requiring extensive user training which may be burdensome or impossible. This work describes new approaches to solve the problem of building a personal spam filter that requires minimal user feedback. An initial study investigates how well a personal filter can learn from different sources of data, as opposed to user’s messages. Our initial studies show that inter-user training yields substantially inferior results to intra-user training using the best known methods. Moreover, contrary to previous literature, it is found that transfer learning degrades the performance of spam filters when the source of training and test sets belong to two different users or different times. We also adapt and modify a graph-based semi-supervising learning algorithm to build a filter that can classify an entire inbox trained on twenty or fewer user judgments. Our experiments show that this approach compares well with previous techniques when trained on as few as two training examples. We also present the toolkit we developed to perform privacy-preserving user studies on spam filters. This toolkit allows researchers to evaluate any spam filter that conforms to a standard interface defined by TREC, on real users’ email boxes. Researchers have access only to the TREC-style result file, and not to any content of a user’s email stream. To eliminate the necessity of feedback from the user, we build a personal autonomous filter that learns exclusively on the result of a global spam filter. Our laboratory experiments show that learning filters with no user input can substantially improve the results of open-source and industry-leading commercial filters that employ no user-specific training. We use our toolkit to validate the performance of the autonomous filter in a user study

    A Socio-mathematical and Structure-Based Approach to Model Sentiment Dynamics in Event-Based Text

    Get PDF
    Natural language texts are often meant to express or impact the emotions of individuals. Recognizing the underlying emotions expressed in or triggered by textual content is essential if one is to arrive at an understanding of the full meaning that textual content conveys. Sentiment analysis (SA) researchers are becoming increasingly interested in investigating natural language processing techniques as well as emotion theory in order to detect, extract, and classify the sentiments that natural language text expresses. Most SA research is focused on the analysis of subjective documents from the writer’s perspective and their classification into categorical labels or sentiment polarity, in which text is associated with a descriptive label or a point on a continuum between two polarities. Researchers often perform sentiment or polarity classification tasks using machine learning (ML) techniques, sentiment lexicons, or hybrid-based approaches. Most ML methods rely on count-based word representations that fail to take word order into account. Despite the successful use of these flat word representations in topic-modelling problems, SA problems require a deeper understanding of sentence structure, since the entire meaning of words can be reversed through negations or word modifiers. On the other hand, approaches based on semantic lexicons are limited by the relatively small number of words they contain, which do not begin to embody the extensive and growing vocabulary on the Internet. The research presented in this thesis represents an effort to tackle the problem of sentiment analysis from a different viewpoint than those underlying current mainstream studies in this research area. A cross-disciplinary approach is proposed that incorporates affect control theory (ACT) into a structured model for determining the sentiment polarity of event-based articles from the perspectives of readers and interactants. A socio-mathematical theory, ACT provides valuable resources for handling interactions between words (event entities) and for predicting situational sentiments triggered by social events. ACT models human emotions arising from social event terms through the use of multidimensional representations that have been verified both empirically and theoretically. To model human emotions regarding textual content, the first step was to develop a fine-grained event extraction algorithm that extracts events and their entities from event-based textual information using semantic and syntactic parsing techniques. The results of the event extraction method were compared against a supervised learning approach on two human-coded corpora (a grammatically correct and a grammatically incorrect structured corpus). For both corpora, the semantic-syntactic event extraction method yielded a higher degree of accuracy than the supervised learning approach. The three-dimensional ACT lexicon was also augmented in a semi-supervised fashion using graph-based label propagation built from semantic and neural network word embeddings. The word embeddings were obtained through the training of commonly used count-based and neural-network-based algorithms on a single corpus, and each method was evaluated with respect to the reconstruction of a sentiment lexicon. The results show that, relative to other word embeddings and state-of-the-art methods, combining both semantic and neural word embeddings yielded the highest correlation scores and lowest error rates. Using the augmented lexicon and ACT mathematical equations, human emotions were modelled according to different levels of granularity (i.e., at the sentence and document levels). The initial stage involved the development of a proposed entity-based SA approach that models reader emotions triggered by event-based sentences. The emotions are modelled in a three-dimensional space based on reader sentiment toward different entities (e.g., subject and object) in the sentence. The new approach was evaluated using a human-annotated news-headline corpus; the results revealed the proposed method to be competitive with benchmark ML techniques. The second phase entailed the creation of a proposed ACT-based model for predicting the temporal progression of the emotions of the interactants and their optimal behaviour over a sequence of interactions. The model was evaluated using three different corpora: fairy tales, news articles, and a handcrafted corpus. The results produced by the proposed model demonstrate that, despite the challenging sentence structure, a reasonable agreement was achieved between the estimated emotions and behaviours and the corresponding ground truth

    Text mining patient experiences from online health communities

    Get PDF
    Social media has had an impact on how patients experience healthcare. Through online channels, patients are sharing information and their experiences with potentially large audiences all over the world. While sharing in this way may offer immediate benefits to themselves and their readership (e.g. other patients) these unprompted, self-authored accounts of illness are also an important resource for healthcare researchers. They offer unprecedented insight into understanding patients’experience of illness. Work has been undertaken through qualitative analysis in order to explore this source of data and utilising the information expressed through these media. However, the manual nature of the analysis means that scope is limited to a small proportion of the hundreds of thousands of authors who are creating content. In our research, we aim to explore utilising text mining to support traditional qualitative analysis of this data. Text mining uses a number of processes in order to extract useful facts from text and analyse patterns within – the ultimate aim is to generate new knowledge by analysing textual data en mass. We developed QuTiP – a Text Mining framework which can enable large scale qualitative analyses of patient narratives shared over social media. In this thesis, we describe QuTiP and our application of the framework to analyse the accounts of patients living with chronic lung disease. As well as a qualitative analysis, we describe our approaches to automated information extraction, term recognition and text classification in order to automatically extract relevant information from blog post data. Within the QuTiP framework, these individual automated approaches can be brought together to support further analyses of large social media datasets
    corecore