29 research outputs found

    PRIVAFRAME: A Frame-Based Knowledge Graph for Sensitive Personal Data

    Get PDF
    The pervasiveness of dialogue systems and virtual conversation applications raises an important theme: the potential of sharing sensitive information, and the consequent need for protection. To guarantee the subject’s right to privacy, and avoid the leakage of private content, it is important to treat sensitive information. However, any treatment requires firstly to identify sensitive text, and appropriate techniques to do it automatically. The Sensitive Information Detection (SID) task has been explored in the literature in different domains and languages, but there is no common benchmark. Current approaches are mostly based on artificial neural networks (ANN) or transformers based on them. Our research focuses on identifying categories of personal data in informal English sentences, by adopting a new logical-symbolic approach, and eventually hybridising it with ANN models. We present a frame-based knowledge graph built for personal data categories defined in the Data Privacy Vocabulary (DPV). The knowledge graph is designed through the logical composition of already existing frames, and has been evaluated as background knowledge for a SID system against a labeled sensitive information dataset. The accuracy of PRIVAFRAME reached 78%. By comparison, a transformer-based model achieved 12% lower performance on the same dataset. The top-down logical-symbolic frame-based model allows a granular analysis, and does not require a training dataset. These advantages lead us to use it as a layer in a hybrid model, where the logical SID is combined with an ANNs SID tested in a previous study by the authors

    A Supervised Approach for Enriching the Relational Structure of Frame Semantics in FrameNet

    Get PDF
    Frame semantics is a theory of linguistic meanings, and is considered to be a useful framework for shallow semantic analysis of natural language. FrameNet, which is based on frame semantics, is a popular lexical semantic resource. In addition to providing a set of core semantic frames and their frame elements, FrameNet also provides relations between those frames (hence providing a network of frames i.e. FrameNet). We address here the limited coverage of the network of conceptual relations between frames in FrameNet, which has previously been pointed out by others. We present a supervised model using rich features from three different sources: structural features from the existing FrameNet network, information from the WordNet relations between synsets projected into semantic frames, and corpus-collected lexical associations. We show large improvements over baselines consisting of each of the three groups of features in isolation. We then use this model to select frame pairs as candidate relations, and perform evaluation on a sample with good precision

    Automatic text summarisation using linguistic knowledge-based semantics

    Get PDF
    Text summarisation is reducing a text document to a short substitute summary. Since the commencement of the field, almost all summarisation research works implemented to this date involve identification and extraction of the most important document/cluster segments, called extraction. This typically involves scoring each document sentence according to a composite scoring function consisting of surface level and semantic features. Enabling machines to analyse text features and understand their meaning potentially requires both text semantic analysis and equipping computers with an external semantic knowledge. This thesis addresses extractive text summarisation by proposing a number of semantic and knowledge-based approaches. The work combines the high-quality semantic information in WordNet, the crowdsourced encyclopaedic knowledge in Wikipedia, and the manually crafted categorial variation in CatVar, to improve the summary quality. Such improvements are accomplished through sentence level morphological analysis and the incorporation of Wikipedia-based named-entity semantic relatedness while using heuristic algorithms. The study also investigates how sentence-level semantic analysis based on semantic role labelling (SRL), leveraged with a background world knowledge, influences sentence textual similarity and text summarisation. The proposed sentence similarity and summarisation methods were evaluated on standard publicly available datasets such as the Microsoft Research Paraphrase Corpus (MSRPC), TREC-9 Question Variants, and the Document Understanding Conference 2002, 2005, 2006 (DUC 2002, DUC 2005, DUC 2006) Corpora. The project also uses Recall-Oriented Understudy for Gisting Evaluation (ROUGE) for the quantitative assessment of the proposed summarisers’ performances. Results of our systems showed their effectiveness as compared to related state-of-the-art summarisation methods and baselines. Of the proposed summarisers, the SRL Wikipedia-based system demonstrated the best performance

    Analysis and implementation of methods for the text categorization

    Get PDF
    Text Categorization (TC) is the automatic classification of text documents under pre-defined categories, or classes. Popular TC approaches map categories into symbolic labels and use a training set of documents, previously labeled by human experts, to build a classifier which enables the automatic TC of unlabeled documents. Suitable TC methods come from the field of data mining and information retrieval, however the following issues remain unsolved. First, the classifier performance depends heavily on hand-labeled documents that are the only source of knowledge for learning the classifier. Being a labor-intensive and time consuming activity, the manual attribution of documents to categories is extremely costly. This creates a serious limitations when a set of manual labeled data is not available, as it happens in most cases. Second, even a moderately sized text collection often has tens of thousands of terms in that making the classification cost prohibitive for learning algorithms that do not scale well to large problem sizes. Most important, TC should be based on the text content rather than on a set of hand-labeled documents whose categorization depends on the subjective judgment of a human classifier. This thesis aims at facing the above issues by proposing innovative approaches which leverage techniques from data mining and information retrieval. To face problems about both the high dimensionality of the text collection and the large number of terms in a single text, the thesis proposes a hybrid model for term selection which combines and takes advantage of both filter and wrapper approaches. In detail, the proposed model uses a filter to rank the list of terms present in documents to ensure that useful terms are unlikely to be screened out. Next, to limit classification problems due to the correlation among terms, this ranked list is refined by a wrapper that uses a Genetic Algorithm (GA) to retaining the most informative and discriminative terms. Experimental results compare well with some of the top-performing learning algorithms for TC and seems to confirm the effectiveness of the proposed model. To face the issues about the lack and the subjectivity of manually labeled datasets, the basic idea is to use an ontology-based approach which does not depend on the existence of a training set and relies solely on a set of concepts within a given domain and the relationships between concepts. In this regard, the thesis proposes a text categorization approach that applies WordNet for selecting the correct sense of words in a document, and utilizes domain names in WordNet Domains for classification purposes. Experiments show that the proposed approach performs well in classifying a large corpus of documents. This thesis contributes to the area of data mining and information retrieval. Specifically, it introduces and evaluates novel techniques to the field of text categorization. The primary objective of this thesis is to test the hypothesis that: text categorization requires and benefits from techniques designed to exploit document content. hybrid methods from data mining and information retrieval can better support problems about high dimensionality that is the main aspect of large document collections. in absence of manually annotated documents, WordNet domain abstraction can be used that is both useful and general enough to categorize any documents collection. As a final remark, it is important to acknowledge that much of the inspiration and motivation for this work derived from the vision of the future of text categorization processes which are related to specific application domains such as the business area and the industrial sectors, just to cite a few. In the end, it is this vision that provided the guiding framework. However, it is equally important to understand that many of the results and techniques developed in this thesis are not limited to text categorization. For example, the evaluation of disambiguation methods is interesting in its own right and is likely to be relevant to other application fields

    Analysis and implementation of methods for the text categorization

    Get PDF
    Text Categorization (TC) is the automatic classification of text documents under pre-defined categories, or classes. Popular TC approaches map categories into symbolic labels and use a training set of documents, previously labeled by human experts, to build a classifier which enables the automatic TC of unlabeled documents. Suitable TC methods come from the field of data mining and information retrieval, however the following issues remain unsolved. First, the classifier performance depends heavily on hand-labeled documents that are the only source of knowledge for learning the classifier. Being a labor-intensive and time consuming activity, the manual attribution of documents to categories is extremely costly. This creates a serious limitations when a set of manual labeled data is not available, as it happens in most cases. Second, even a moderately sized text collection often has tens of thousands of terms in that making the classification cost prohibitive for learning algorithms that do not scale well to large problem sizes. Most important, TC should be based on the text content rather than on a set of hand-labeled documents whose categorization depends on the subjective judgment of a human classifier. This thesis aims at facing the above issues by proposing innovative approaches which leverage techniques from data mining and information retrieval. To face problems about both the high dimensionality of the text collection and the large number of terms in a single text, the thesis proposes a hybrid model for term selection which combines and takes advantage of both filter and wrapper approaches. In detail, the proposed model uses a filter to rank the list of terms present in documents to ensure that useful terms are unlikely to be screened out. Next, to limit classification problems due to the correlation among terms, this ranked list is refined by a wrapper that uses a Genetic Algorithm (GA) to retaining the most informative and discriminative terms. Experimental results compare well with some of the top-performing learning algorithms for TC and seems to confirm the effectiveness of the proposed model. To face the issues about the lack and the subjectivity of manually labeled datasets, the basic idea is to use an ontology-based approach which does not depend on the existence of a training set and relies solely on a set of concepts within a given domain and the relationships between concepts. In this regard, the thesis proposes a text categorization approach that applies WordNet for selecting the correct sense of words in a document, and utilizes domain names in WordNet Domains for classification purposes. Experiments show that the proposed approach performs well in classifying a large corpus of documents. This thesis contributes to the area of data mining and information retrieval. Specifically, it introduces and evaluates novel techniques to the field of text categorization. The primary objective of this thesis is to test the hypothesis that: text categorization requires and benefits from techniques designed to exploit document content. hybrid methods from data mining and information retrieval can better support problems about high dimensionality that is the main aspect of large document collections. in absence of manually annotated documents, WordNet domain abstraction can be used that is both useful and general enough to categorize any documents collection. As a final remark, it is important to acknowledge that much of the inspiration and motivation for this work derived from the vision of the future of text categorization processes which are related to specific application domains such as the business area and the industrial sectors, just to cite a few. In the end, it is this vision that provided the guiding framework. However, it is equally important to understand that many of the results and techniques developed in this thesis are not limited to text categorization. For example, the evaluation of disambiguation methods is interesting in its own right and is likely to be relevant to other application fields

    Data sensitivity detection in chat interactions for privacy protection

    Get PDF
    In recent years, there has been exponential growth in using virtual spaces, including dialogue systems, that handle personal information. The concept of personal privacy in the literature is discussed and controversial, whereas, in the technological field, it directly influences the degree of reliability perceived in the information system (privacy ‘as trust’). This work aims to protect the right to privacy on personal data (GDPR, 2018) and avoid the loss of sensitive content by exploring sensitive information detection (SID) task. It is grounded on the following research questions: (RQ1) What does sensitive data mean? How to define a personal sensitive information domain? (RQ2) How to create a state-of-the-art model for SID?(RQ3) How to evaluate the model? RQ1 theoretically investigates the concepts of privacy and the ontological state-of-the-art representation of personal information. The Data Privacy Vocabulary (DPV) is the taxonomic resource taken as an authoritative reference for the definition of the knowledge domain. Concerning RQ2, we investigate two approaches to classify sensitive data: the first - bottom-up - explores automatic learning methods based on transformer networks, the second - top-down - proposes logical-symbolic methods with the construction of privaframe, a knowledge graph of compositional frames representing personal data categories. Both approaches are tested. For the evaluation - RQ3 – we create SPeDaC, a sentence-level labeled resource. This can be used as a benchmark or training in the SID task, filling the gap of a shared resource in this field. If the approach based on artificial neural networks confirms the validity of the direction adopted in the most recent studies on SID, the logical-symbolic approach emerges as the preferred way for the classification of fine-grained personal data categories, thanks to the semantic-grounded tailor modeling it allows. At the same time, the results highlight the strong potential of hybrid architectures in solving automatic tasks

    Information extraction of +/-effect events to support opinion inference

    Get PDF
    Recently, work in NLP was initiated on a type of opinion inference that arises when opinions are expressed toward events which have positive or negative effects on entities, called +/-effect events. The ultimate goal is to develop a fully automatic system capable of recognizing inferred attitudes. To achieve its results, the inference system requires all instances of +/-effect events. Therefore, this dissertation focuses on +/-effect events to support opinion inference. To extract +/-effect events, we first need the list of +/-effect events. Due to significant sense ambiguity, our goal is to develop a sense-level rather than word-level lexicon. To handle sense-level information, WordNet is adopted. We adopt a graph-based method which is seeded by entries culled from FrameNet and then expanded by exploiting semantic relations in WordNet. We show that WordNet relations are useful for the polarity propagation in the graph model. In addition, to maximize the effectiveness of different types of information, we combine a graph-based method using WordNet relations and a standard classifier using gloss information. Further, we provide evidence that the model is an effective way to guide manual annotation to find +/-effect senses that are not in the seed set. To exploit the sense-level lexicons, we have to carry out word sense disambiguation. We present a knowledge-based +/-effect coarse-grained word sense disambiguation method based on selectional preferences via topic models. For more information, we first group senses, and then utilize topic models to model selectional preferences. Our experiments show that selectional preferences are helpful in our work. To support opinion inferences, we need to identify not only +/-effect events but also their affected entities automatically. Thus, we address both +/-effect event detection and affected entity identification. Since +/-effect events and their affected entities are closely related, instead of a pipeline system, we present a joint model to extract +/-effect events and their affected entities simultaneously. We demonstrate that our joint model is promising to extract +/-effect events and their affected entities jointly