68 research outputs found
Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition
We describe the CoNLL-2003 shared task: language-independent named entity
recognition. We give background information on the data sets (English and
German) and the evaluation method, present a general overview of the systems
that have taken part in the task and discuss their performance
Combination Strategies for Semantic Role Labeling
This paper introduces and analyzes a battery of inference models for the
problem of semantic role labeling: one based on constraint satisfaction, and
several strategies that model the inference as a meta-learning problem using
discriminative classifiers. These classifiers are developed with a rich set of
novel features that encode proposition and sentence-level information. To our
knowledge, this is the first work that: (a) performs a thorough analysis of
learning-based inference models for semantic role labeling, and (b) compares
several inference strategies in this context. We evaluate the proposed
inference strategies in the framework of the CoNLL-2005 shared task using only
automatically-generated syntactic information. The extensive experimental
evaluation and analysis indicates that all the proposed inference strategies
are successful -they all outperform the current best results reported in the
CoNLL-2005 evaluation exercise- but each of the proposed approaches has its
advantages and disadvantages. Several important traits of a state-of-the-art
SRL combination strategy emerge from this analysis: (i) individual models
should be combined at the granularity of candidate arguments rather than at the
granularity of complete solutions; (ii) the best combination strategy uses an
inference model based in learning; and (iii) the learning-based inference
benefits from max-margin classifiers and global feedback
Automatic log parser to support forensic analysis
Event log parsing is a process to split and label each field in a log entry. Existing approaches commonly use regular expressions or parsing rules to extract the fields. However, such techniques are time-consuming as a forensic investigator needs to define a new rule for each log file type. In this paper, we present a tool, namely nerlogparser, to parse the log entries automatically, where log parsing is modeled as a named entity recognition problem. We use a deep machine learning technique, specifically the bidirectional long short-term memory networks, as the underlying architecture for this purpose. Unlike existing tools, nerlogparser is a fully automatic tool as the investigators do not need to define any parsing rules and it is generic as there is only one model to parse various types of log files. Experimental results show that nerlogparser achieves superior performance compared with other traditional machine learning methods
Automatic log parser to support forensic analysis
Event log parsing is a process to split and label each field in a log entry. Existing approaches commonly use regular expressions or parsing rules to extract the fields. However, such techniques are time-consuming as a forensic investigator needs to define a new rule for each log file type. In this paper, we present a tool, namely nerlogparser, to parse the log entries automatically, where log parsing is modeled as a named entity recognition problem. We use a deep machine learning technique, specifically the bidirectional long short-term memory networks, as the underlying architecture for this purpose. Unlike existing tools, nerlogparser is a fully automatic tool as the investigators do not need to define any parsing rules and it is generic as there is only one model to parse various types of log files. Experimental results show that nerlogparser achieves superior performance compared with other traditional machine learning methods
Unsupervised Syntactic Structure Induction in Natural Language Processing
This work addresses unsupervised chunking as a task for syntactic structure induction, which could help understand the linguistic structures of human languages especially, low-resource languages. In chunking, words of a sentence are grouped together into different phrases (also known as chunks) in a non-hierarchical fashion. Understanding text fundamentally requires finding noun and verb phrases, which makes unsupervised chunking an important step in several real-world applications.
In this thesis, we establish several baselines and discuss our three-step knowledge transfer approach for unsupervised chunking. In the first step, we take advantage of state-of-the-art unsupervised parsers, and in the second, we heuristically induce chunk labels from them. We propose a simple heuristic that does not require any supervision of annotated grammar and generates reasonable (albeit noisy) chunks. In the third step, we design a hierarchical recurrent neural network (HRNN) that learns from these pseudo ground-truth labels. The HRNN explicitly models the composition of words into chunks and smooths out the noise from heuristically induced labels. Our HRNN a) maintains both word-level and phrase-level representations and b) explicitly handles the chunking decisions by providing autoregressiveness at each step. Furthermore, we make a case for exploring the self-supervised learning objectives for unsupervised chunking. Finally, we discuss our attempt to transfer knowledge from chunking back to parsing in an unsupervised setting.
We conduct comprehensive experiments on three datasets: CoNLL-2000 (English), CoNLL-2003 (German), and the English Web Treebank. Results show that our HRNN improves upon the teacher model (Compound PCFG) in terms of both phrase F1 and tag accuracy. Our HRNN can smooth out the noise from induced chunk labels and accurately capture the chunking patterns. We evaluate different chunking heuristics and show that maximal left-branching performs the best, reinforcing the fact that left-branching structures indicate closely related words. We also present rigorous analysis on the HRNN's architecture and discuss the performance of vanilla recurrent neural networks
Named Entity Recognition Using the Web
Proceedings of the Second Workshop on Anaphora Resolution
(WAR II).
Editor: Christer Johansson.
NEALT Proceedings Series, Vol. 2 (2008), 83-90.
© 2008 The editors and contributors.
Published by
Northern European Association for Language
Technology (NEALT)
http://omilia.uio.no/nealt .
Electronically published at
Tartu University Library (Estonia)
http://hdl.handle.net/10062/7129
D4.1. Technologies and tools for corpus creation, normalization and annotation
The objectives of the Corpus Acquisition and Annotation (CAA) subsystem are the acquisition and processing of monolingual and bilingual language resources (LRs) required in the PANACEA context. Therefore, the CAA subsystem includes: i) a Corpus Acquisition Component (CAC) for extracting monolingual and bilingual data from the web, ii) a component for cleanup and normalization (CNC) of these data and iii) a text processing component (TPC) which consists of NLP tools including modules for sentence splitting, POS tagging, lemmatization, parsing and named entity recognition
Effective Knowledge Graph Aggregation for Malware-Related Cybersecurity Text
With the rate at which malware spreads in the modern age, it is extremely important that cyber security analysts are able to extract relevant information pertaining to new and active threats in a timely and effective manner. Having to manually read through articles and blog posts on the internet is time consuming and usually involves sifting through much repeated information. Knowledge graphs, a structured representation of relationship information, are an effective way to visually condense information presented in large amounts of unstructured text for human readers. Thusly, they are useful for sifting through the abundance of cyber security information that is released through web-based security articles and blogs. This paper presents a pipeline for extracting these relationships using supervised deep learning with the recent state-of-the-art transformer-based neural architectures for sequence processing tasks. To this end, a corpus of text from a range of prominent cybersecurity-focused media outlets was manually annotated. An algorithm is also presented that keeps potentially redundant relationships from being added to an existing knowledge graph, using a cosine-similarity metric on pre-trained word embeddings
- …