13 research outputs found
PACE: Pattern Accurate Computationally Efficient Bootstrapping for Timely Discovery of Cyber-Security Concepts
Public disclosure of important security information, such as knowledge of
vulnerabilities or exploits, often occurs in blogs, tweets, mailing lists, and
other online sources months before proper classification into structured
databases. In order to facilitate timely discovery of such knowledge, we
propose a novel semi-supervised learning algorithm, PACE, for identifying and
classifying relevant entities in text sources. The main contribution of this
paper is an enhancement of the traditional bootstrapping method for entity
extraction by employing a time-memory trade-off that simultaneously circumvents
a costly corpus search while strengthening pattern nomination, which should
increase accuracy. An implementation in the cyber-security domain is discussed
as well as challenges to Natural Language Processing imposed by the security
domain.Comment: 6 pages, 3 figures, ieeeTran conference. International Conference on
Machine Learning and Applications 201
Semi-Supervised Event Extraction with Paraphrase Clusters
Supervised event extraction systems are limited in their accuracy due to the
lack of available training data. We present a method for self-training event
extraction systems by bootstrapping additional training data. This is done by
taking advantage of the occurrence of multiple mentions of the same event
instances across newswire articles from multiple sources. If our system can
make a highconfidence extraction of some mentions in such a cluster, it can
then acquire diverse training examples by adding the other mentions as well.
Our experiments show significant performance improvements on multiple event
extractors over ACE 2005 and TAC-KBP 2015 datasets.Comment: NAACL 201
Ekstraksi Informasi Halaman Web Menggunakan Pendekatan Bootstrapping pada Ontology-Based Information Extraction
AbstrakEkstraksi informasi merupakan suatu bidang ilmu untuk pengolahan bahasa alami, dengan cara mengubah teks tidak terstruktur menjadi informasi dalam bentuk terstruktur. Berbagai jenis informasi di Internet ditransmisikan secara tidak terstruktur melalui website, menyebabkan munculnya kebutuhan akan suatu teknologi untuk menganalisa teks dan menemukan pengetahuan yang relevan dalam bentuk informasi terstruktur. Contoh informasi tidak terstruktur adalah informasi utama yang ada pada konten halaman web. Bermacam pendekatan untuk ekstraksi informasi telah dikembangkan oleh berbagai peneliti, baik menggunakan metode manual atau otomatis, namun masih perlu ditingkatkan kinerjanya terkait akurasi dan kecepatan ekstraksi. Pada penelitian ini diusulkan suatu penerapan pendekatan ekstraksi informasi dengan mengkombinasikan pendekatan bootstrapping dengan Ontology-based Information Extraction (OBIE). Pendekatan bootstrapping dengan menggunakan sedikit contoh data berlabel, digunakan untuk memimalkan keterlibatan manusia dalam proses ekstraksi informasi, sedangkan penggunakan panduan ontologi untuk mengekstraksi classes (kelas), properties dan instance digunakan untuk menyediakan konten semantik untuk web semantik. Pengkombinasian kedua pendekatan tersebut diharapkan dapat meningkatan kecepatan proses ekstraksi dan akurasi hasil ekstraksi. Studi kasus untuk penerapan sistem ekstraksi informasi menggunakan dataset “LonelyPlanet”. Kata kunci—Ekstraksi informasi, ontologi, bootstrapping, Ontology-Based Information Extraction, OBIE, kinerja Abstract Information extraction is a field study of natural language processing by converting unstructured text into structured information. Several types of information on the Internet is transmitted through unstructured information via websites, led to emergence of the need a technology to analyze text and found relevant knowledge into structured information. For example of unstructured information is existing main information on the content of web pages. Various approaches for information extraction have been developed by many researchers, either using manual or automatic method, but still need to be improved performance related accuracy and speed of extraction. This research proposed an approach of information extraction that combines bootstrapping approach with Ontology-Based Information Extraction (OBIE). Bootstrapping approach using small seed of labelled data, is used to minimize human intervention on information extraction process, while the use of guide ontology for extracting classes, properties and instances, using for provide semantic content for semantic web. Combining both approaches expected to increase speed of extraction process and accuracy of extraction results. Case study to apply information extraction system using “LonelyPlanet” datasets. Keywords— Information extraction, ontology, bootstrapping, Ontology-Based Information Extraction, OBIE, performanc
Open Domain Event Extraction Using Neural Latent Variable Models
We consider open domain event extraction, the task of extracting unconstraint
types of events from news clusters. A novel latent variable neural model is
constructed, which is scalable to very large corpus. A dataset is collected and
manually annotated, with task-specific evaluation metrics being designed.
Results show that the proposed unsupervised model gives better performance
compared to the state-of-the-art method for event schema induction.Comment: accepted by ACL 201
GIANT: Scalable Creation of a Web-scale Ontology
Understanding what online users may pay attention to is key to content
recommendation and search services. These services will benefit from a highly
structured and web-scale ontology of entities, concepts, events, topics and
categories. While existing knowledge bases and taxonomies embody a large volume
of entities and categories, we argue that they fail to discover properly
grained concepts, events and topics in the language style of online population.
Neither is a logically structured ontology maintained among these notions. In
this paper, we present GIANT, a mechanism to construct a user-centered,
web-scale, structured ontology, containing a large number of natural language
phrases conforming to user attentions at various granularities, mined from a
vast volume of web documents and search click graphs. Various types of edges
are also constructed to maintain a hierarchy in the ontology. We present our
graph-neural-network-based techniques used in GIANT, and evaluate the proposed
methods as compared to a variety of baselines. GIANT has produced the Attention
Ontology, which has been deployed in various Tencent applications involving
over a billion users. Online A/B testing performed on Tencent QQ Browser shows
that Attention Ontology can significantly improve click-through rates in news
recommendation.Comment: Accepted as full paper by SIGMOD 202
Understanding stories via event sequence modeling
Understanding stories, i.e. sequences of events, is a crucial yet challenging natural language understanding (NLU) problem, which requires dealing with multiple aspects of semantics, including actions, entities and emotions, as well as background knowledge. In this thesis, towards the goal of building a NLU system that can model what has happened in stories and predict what would happen in the future, we contribute on three fronts: First, we investigate the optimal way to model events in text; Second, we study how we can model a sequence of events with the balance of generality and specificity; Third, we improve event sequence modeling by joint modeling of semantic information and incorporating background knowledge.
Each of the above three research problems poses both conceptual and computational challenges. For event extraction, we find that Semantic Role Labeling (SRL) signals can be served as good intermediate representations for events, thus giving us the ability to reliably identify events with minimal supervision. In addition, since it is important to resolve co-referred entities for extracted events, we make improvements to an existing co-reference resolution system. To model event sequences, we start from studying within document event co-reference (the simplest flow of events); and then extend to model two other more natural event sequences along with discourse phenomena while abstracting over the specific mentions of predicates and entities. We further identify problems for the basic event sequence models, where we fail to capture multiple semantic aspects and background knowledge. We then improve our system by jointly modeling frames, entities and sentiments, yielding joint representations of all these semantic aspects; while at the same time incorporate explicit background knowledge acquired from other corpus as well as human experience. For all tasks, we evaluate the developed algorithms and models on benchmark datasets and achieve better performance compared to other highly competitive methods