967 research outputs found

    Enriching very large ontologies using the WWW

    Full text link
    This paper explores the possibility to exploit text on the world wide web in order to enrich the concepts in existing ontologies. First, a method to retrieve documents from the WWW related to a concept is described. These document collections are used 1) to construct topic signatures (lists of topically related words) for each concept in WordNet, and 2) to build hierarchical clusters of the concepts (the word senses) that lexicalize a given word. The overall goal is to overcome two shortcomings of WordNet: the lack of topical links among concepts, and the proliferation of senses. Topic signatures are validated on a word sense disambiguation task with good results, which are improved when the hierarchical clusters are used.Comment: 6 page

    Little words can make a big difference for text classification

    Full text link

    Mining of Textual Data from the Web for Speech Recognition

    Get PDF
    Prvotním cílem tohoto projektu bylo prostudovat problematiku jazykového modelování pro rozpoznávání řeči a techniky pro získávání textových dat z Webu. Text představuje základní techniky rozpoznávání řeči a detailněji popisuje jazykové modely založené na statistických metodách. Zvláště se práce zabývá kriterii pro vyhodnocení kvality jazykových modelů a systémů pro rozpoznávání řeči. Text dále popisuje modely a techniky dolování dat, zvláště vyhledávání informací. Dále jsou představeny problémy spojené se získávání dat z webu, a v kontrastu s tím je představen vyhledávač Google. Součástí projektu byl návrh a implementace systému pro získávání textu z webu, jehož detailnímu popisu je věnována náležitá pozornost. Nicméně, hlavním cílem práce bylo ověřit, zda data získaná z Webu mohou mít nějaký přínos pro rozpoznávání řeči. Popsané techniky se tak snaží najít optimální způsob, jak data získaná z Webu použít pro zlepšení ukázkových jazykových modelů, ale i modelů nasazených v reálných rozpoznávacích systémech.The preliminary goals of this project were to get familiar with language modeling for speech recognition and techniques for acquisition of text data from the Web. Speech recognition techniques are introduced and statistical language modeling is described in detail. The text also covers mining models and techniques, information retrieval especially. Specific problems of Web mining are discussed and Google search is introduced. Special attention was paid to detailed description of implementation of the text mining system. However, the main goal of this work was to determine, whether the data acquired from the Web can provide some improvement into the recognition systems. The text is describing experiments, which use the retrieved Web data to update sample language models.

    HeAT PATRL: Network-Agnostic Cyber Attack Campaign Triage With Pseudo-Active Transfer Learning

    Get PDF
    SOC (Security Operation Center) analysts historically struggled to keep up with the growing sophistication and daily prevalence of cyber attackers. To aid in the detection of cyber threats, many tools like IDS’s (Intrusion Detection Systems) are utilized to monitor cyber threats on a network. However, a common problem with these tools is the volume of the logs generated is extreme and does not stop, further increasing the chance for an adversary to go unnoticed until it’s too late. Typically, the initial evidence of an attack is not an isolated event but a part of a larger attack campaign describing prior events that the attacker took to reach their final goal. If an analyst can quickly identify each step of an attack campaign, a timely response can be made to limit the impact of the attack or future attacks. In this work, we ask the question “Given IDS alerts, can we extract out the cyber-attack kill chain for an observed threat that is meaningful to the analyst?” We present HeAT-PATRL, an IDS attack campaign extractor that leverages multiple deep machine learning techniques, network-agnostic feature engineering, and the analyst’s knowledge of potential threats to extract out cyber-attack campaigns from IDS alert logs. HeAT-PATRL is the culmination of two works. Our first work “PATRL” (Pseudo-Active Transfer Learning), translates the complex alert signature description to the Action-Intent Framework (AIF), a customized set of attack stages. PATRL employs a deep language model with cyber security texts (CVE’s, C-Sec Blogs, etc.) and then uses transfer learning to classify alert descriptions. To further leverage the cyber-context learned in the language model, we develop Pseudo-Active learning to self-label unknown unlabeled alerts to use as additional training data. We show PATRL classifying the entire Suricata database (~70k signatures) with a top-1 of 87\% and top-3 of 99\% with less than 1,200 manually labeled signatures. The final work, HeAT (Heated Alert Triage), captures the analyst’s domain knowledge and opinion of the contribution of IDS events to an attack campaign given a critical IoC (indicator of compromise). We developed network-agnostic features to characterize and generalize attack campaign contributions so that prior triages can aid in identifying attack campaigns for other attack types, new attackers, or network infrastructures. With the use of cyber-attack competition data (CPTC) and data from a real SOC operation, we demonstrate that the HeAT process can identify campaigns reflective of the analysts thinking while greatly reducing the number of actions to be assessed by the analyst. HeAT has the unique ability to uncover attack campaigns meaningful to the analyst across drastically different network structures while maintaining the important attack campaign relationships defined by the analyst

    Automatic Hoax Detection System

    Get PDF
    Hoaxes are non malicious viruses. They live on deceiving human's perception by conveying false claims as truth. Throughout history, hoaxes have actually able to influence a lot of people to the extent of tarnishing the victim's image and credibility. Moreover, wrong and misleading information has always been a distortion to a human's growth. Some hoaxes were created in a way that they can even obtain personal data by convincing the victims that those data were required for official purposes. Hoaxes are different from spams in a way that they masquerade themselves through the address of those related either directly or indirectly to us. Most of the time, they appear as a forwarded message and sometimes from legit companies such as PayPal. Having known the threat that this non malicious brought, it is important for us to address this problem seriously by providing an automatic hoax detection system as the solution to this matter. Consciousness and Awareness are definitely the first step to be taken for this matte

    Extracting pragmatic content from Email.

    Get PDF
    This research presents results concerning the large scale automatic extraction of pragmatic content from Email, by a system based on a phrase matching approach to Speech Act detection combined with the empirical detection of Speech Act patterns in corpora. The results show that most Speech Acts that occur in such a corpus can be recognized by the approach. This investigation is supported by the analysis of a corpus consisting of 1000 Emails. We describe experimental work to sort a substantial sample of Emails based on their function, which is to say, whether they contain a statement of fact, a request for the recipient to do something, or ask a question. This could be highly desirable functionality for the overburdened Email user, especially if combined with other, more traditional, measures of content relevance and filters based on desirable and undesirable mail sources. We have attempted to apply an lE engine to the extraction of message content located in the message, in part by the use of speech-act detection criteria, e. g. for what it is to be a request for action, under the many possible surface forms that can be used to express that in English, so as to locate the action requested as well as the fact it is a request. The work may have potential practical uses, but here we describe it as the challenge of adapting an IE engine to a somewhat different, task: that of message function detection. The major contributions are: Defining Request Speech Act types. The Request Speech Act is one of the most important functions of an utterance to be recognised, in order to find out the gist of a message. The present work has concentrated on three sub-types of Requests: Requests for Information, Action, and Permission. An algorithm to recognise Speech Acts Patterns found frequently in a domain, together with linguistic rules, make it possible to recognise most of the examples of Requests in the corpus. The results of the evaluation of the system are encouraging and suggest that, in order to avoid long-response time systems, a fast and friendly system is the right approach to implement

    Extracting product development intelligence from web reviews

    Get PDF
    Product development managers are constantly challenged to learn what the consumer product experience really is, and to learn specifically how the product is performing in the field. Traditionally, they have utilized methods such as prototype testing, customer quality monitoring instruments, field testing methods with sample customers, and independent assessment companies. These methods are limited in that (i) the number of customer evaluations is small, and (ii) the methods are driven by a restrictive structured format. Today the web has created a new source of product intelligence; these are unsolicited reviews from actual product users that are posted across hundreds of websites. The basic hypothesis of this research is that web reviews contain significant amount of information that is of value to the product design community. This research developed the DFOC (Design - Feature - Opinion - Cause Relationship) method for integrating the evaluation of unstructured web reviews into the structured product design process. The key data element in this research is a Web review and its associated opinion polarity (positive, negative, or neutral). Hundreds of Web reviews are collected to form a review database representing a population of customers. The DFOC method (a) identifies a set of design features that are of interest to the product design community, (b) mines the Web review database to identify which features are of significance to customer evaluations, (c) extracts and estimates the sentiment or opinion of the set of significant features, and (d) identifies the likely cause of the customer opinion. To support the DFOC method we develop an association rule based opinion mining procedure for capturing and extracting noun-verb-adjective relationships in the Web review database. This procedure exploits existing opinion mining methods to deconstruct the Web reviews and capture feature-opinion pair polarity. A Design Level Information Quality (DLIQ) measure which evaluates three components (a) Content (b) Complexity and (c) Relevancy is introduced. DLIQ is indicative of the content, complexity and relevancy of the design contextual information that can be extracted from an analysis of Web reviews for a given product. Application of this measure confirms the hypothesis that significant levels of quality design information can be efficiently extracted from Web reviews for a wide variety of product types. Application of the DFOC method and the DLIQ measure to a wide variety of product classes (electronic, automobile, service domain) is demonstrated. Specifically Web review databases for ten products/services are created from real data. Validation occurs by analyzing and presenting the extracted product design information. Examples of extracted features and feature-cause associations for negative polarity opinions are shown along with the observed significance

    An Approach for Automatic Generation of on-line Information Systems based on the Integration of Natural Language Processing and Adaptive Hypermedia Techniques

    Full text link
    Tesis doctoral inédita leída en la Universidad Autónoma de Madrid. Escuela Politécnica Superior, Departamento de ingeniería informática. Fecha de lectura: 29-05-200
    • …
    corecore