2,858 research outputs found
An attentive neural architecture for joint segmentation and parsing and its application to real estate ads
In processing human produced text using natural language processing (NLP)
techniques, two fundamental subtasks that arise are (i) segmentation of the
plain text into meaningful subunits (e.g., entities), and (ii) dependency
parsing, to establish relations between subunits. In this paper, we develop a
relatively simple and effective neural joint model that performs both
segmentation and dependency parsing together, instead of one after the other as
in most state-of-the-art works. We will focus in particular on the real estate
ad setting, aiming to convert an ad to a structured description, which we name
property tree, comprising the tasks of (1) identifying important entities of a
property (e.g., rooms) from classifieds and (2) structuring them into a tree
format. In this work, we propose a new joint model that is able to tackle the
two tasks simultaneously and construct the property tree by (i) avoiding the
error propagation that would arise from the subtasks one after the other in a
pipelined fashion, and (ii) exploiting the interactions between the subtasks.
For this purpose, we perform an extensive comparative study of the pipeline
methods and the new proposed joint model, reporting an improvement of over
three percentage points in the overall edge F1 score of the property tree.
Also, we propose attention methods, to encourage our model to focus on salient
tokens during the construction of the property tree. Thus we experimentally
demonstrate the usefulness of attentive neural architectures for the proposed
joint model, showcasing a further improvement of two percentage points in edge
F1 score for our application.Comment: Preprint - Accepted for publication in Expert Systems with
Application
A Boundary Determined Neural Model for Relation Extraction
Existing models extract entity relations only after two entity spans have been precisely extracted that influenced the performance of relation extraction. Compared with recognizing entity spans, because the boundary has a small granularity and a less ambiguity, it can be detected precisely and incorporated to learn better representation. Motivated by the strengths of boundary, we propose a boundary determined neural (BDN) model, which leverages boundaries as task-related cues to predict the relation labels. Our model can predict high-quality relation instance via the pairs of boundaries, which can relieve error propagation problem. Moreover, our model fuses with boundary-relevant information encoding to represent distributed representation to improve the ability of capturing semantic and dependency information, which can increase the discriminability of neural network. Experiments show that our model achieves state-of-the-art performances on ACE05 corpus
PDF-Malware Detection: A Survey and Taxonomy of Current Techniques
Portable Document Format, more commonly known as PDF, has become, in the last 20 years, a standard for document exchange and dissemination due its portable nature and widespread adoption. The flexibility and power of this format are not only leveraged by benign users, but from hackers as well who have been working to exploit various types of vulnerabilities, overcome security restrictions, and then transform the PDF format in one among the leading malicious code spread vectors. Analyzing the content of malicious PDF files to extract the main features that characterize the malware identity and behavior, is a fundamental task for modern threat intelligence platforms that need to learn how to automatically identify new attacks. This paper surveys existing state of the art about systems for the detection of malicious PDF files and organizes them in a taxonomy that separately considers the used approaches and the data analyzed to detect the presence of malicious code. © Springer International Publishing AG, part of Springer Nature 2018
A two-stage approach for table extraction in invoices
The automated analysis of administrative documents is an important field in
document recognition that is studied for decades. Invoices are key documents
among these huge amounts of documents available in companies and public
services. Invoices contain most of the time data that are presented in tables
that should be clearly identified to extract suitable information. In this
paper, we propose an approach that combines an image processing based
estimation of the shape of the tables with a graph-based representation of the
document, which is used to identify complex tables precisely. We propose an
experimental evaluation using a real case application
LLM Based Multi-Agent Generation of Semi-structured Documents from Semantic Templates in the Public Administration Domain
In the last years' digitalization process, the creation and management of
documents in various domains, particularly in Public Administration (PA), have
become increasingly complex and diverse. This complexity arises from the need
to handle a wide range of document types, often characterized by
semi-structured forms. Semi-structured documents present a fixed set of data
without a fixed format. As a consequence, a template-based solution cannot be
used, as understanding a document requires the extraction of the data
structure. The recent introduction of Large Language Models (LLMs) has enabled
the creation of customized text output satisfying user requests. In this work,
we propose a novel approach that combines the LLMs with prompt engineering and
multi-agent systems for generating new documents compliant with a desired
structure. The main contribution of this work concerns replacing the commonly
used manual prompting with a task description generated by semantic retrieval
from an LLM. The potential of this approach is demonstrated through a series of
experiments and case studies, showcasing its effectiveness in real-world PA
scenarios.Comment: Accepted at HCI INTERNATIONAL 2024 - 26th International Conference on
Human-Computer Interaction. Washington Hilton Hotel, Washington DC, USA, 29
June - 4 July 202
CloudScan - A configuration-free invoice analysis system using recurrent neural networks
We present CloudScan; an invoice analysis system that requires zero
configuration or upfront annotation. In contrast to previous work, CloudScan
does not rely on templates of invoice layout, instead it learns a single global
model of invoices that naturally generalizes to unseen invoice layouts. The
model is trained using data automatically extracted from end-user provided
feedback. This automatic training data extraction removes the requirement for
users to annotate the data precisely. We describe a recurrent neural network
model that can capture long range context and compare it to a baseline logistic
regression model corresponding to the current CloudScan production system. We
train and evaluate the system on 8 important fields using a dataset of 326,471
invoices. The recurrent neural network and baseline model achieve 0.891 and
0.887 average F1 scores respectively on seen invoice layouts. For the harder
task of unseen invoice layouts, the recurrent neural network model outperforms
the baseline with 0.840 average F1 compared to 0.788.Comment: Presented at ICDAR 201
Recovering Grammar Relationships for the Java Language Specification
Grammar convergence is a method that helps discovering relationships between
different grammars of the same language or different language versions. The key
element of the method is the operational, transformation-based representation
of those relationships. Given input grammars for convergence, they are
transformed until they are structurally equal. The transformations are composed
from primitive operators; properties of these operators and the composed chains
provide quantitative and qualitative insight into the relationships between the
grammars at hand. We describe a refined method for grammar convergence, and we
use it in a major study, where we recover the relationships between all the
grammars that occur in the different versions of the Java Language
Specification (JLS). The relationships are represented as grammar
transformation chains that capture all accidental or intended differences
between the JLS grammars. This method is mechanized and driven by nominal and
structural differences between pairs of grammars that are subject to
asymmetric, binary convergence steps. We present the underlying operator suite
for grammar transformation in detail, and we illustrate the suite with many
examples of transformations on the JLS grammars. We also describe the
extraction effort, which was needed to make the JLS grammars amenable to
automated processing. We include substantial metadata about the convergence
process for the JLS so that the effort becomes reproducible and transparent
- …