49,118 research outputs found
An approach to graph-based analysis of textual documents
In this paper a new graph-based model is proposed for the representation of textual documents. Graph-structures are obtained from textual documents by making use of the well-known Part-Of-Speech (POS) tagging technique. More specifically, a simple rule-based (re) classifier is used to map each tag onto graph vertices and edges. As a result, a decomposition of textual documents is obtained where tokens are automatically parsed and attached to either a vertex or an edge. It is shown how textual documents can be aggregated through their graph-structures and finally, it is shown how vertex-ranking methods can be used to find relevant tokens.(1)
LexRank: Graph-based Lexical Centrality as Salience in Text Summarization
We introduce a stochastic graph-based method for computing relative
importance of textual units for Natural Language Processing. We test the
technique on the problem of Text Summarization (TS). Extractive TS relies on
the concept of sentence salience to identify the most important sentences in a
document or set of documents. Salience is typically defined in terms of the
presence of particular important words or in terms of similarity to a centroid
pseudo-sentence. We consider a new approach, LexRank, for computing sentence
importance based on the concept of eigenvector centrality in a graph
representation of sentences. In this model, a connectivity matrix based on
intra-sentence cosine similarity is used as the adjacency matrix of the graph
representation of sentences. Our system, based on LexRank ranked in first place
in more than one task in the recent DUC 2004 evaluation. In this paper we
present a detailed analysis of our approach and apply it to a larger data set
including data from earlier DUC evaluations. We discuss several methods to
compute centrality using the similarity graph. The results show that
degree-based methods (including LexRank) outperform both centroid-based methods
and other systems participating in DUC in most of the cases. Furthermore, the
LexRank with threshold method outperforms the other degree-based techniques
including continuous LexRank. We also show that our approach is quite
insensitive to the noise in the data that may result from an imperfect topical
clustering of documents
Neural Graph Matching for Modification Similarity Applied to Electronic Document Comparison
In this paper, we present a novel neural graph matching approach applied to
document comparison. Document comparison is a common task in the legal and
financial industries. In some cases, the most important differences may be the
addition or omission of words, sentences, clauses, or paragraphs. However, it
is a challenging task without recording or tracing whole edited process. Under
many temporal uncertainties, we explore the potentiality of our approach to
proximate the accurate comparison to make sure which element blocks have a
relation of edition with others. In beginning, we apply a document layout
analysis that combining traditional and modern technics to segment layout in
blocks of various types appropriately. Then we transform this issue to a
problem of layout graph matching with textual awareness. About graph matching,
it is a long-studied problem with a broad range of applications. However,
different from previous works focusing on visual images or structural layout,
we also bring textual features into our model for adapting this domain.
Specifically, based on the electronic document, we introduce an encoder to deal
with the visual presentation decoding from PDF. Additionally, because the
modifications can cause the inconsistency of document layout analysis between
modified documents and the blocks can be merged and split, Sinkhorn divergence
is adopted in our graph neural approach, which tries to overcome both these
issues with many-to-many block matching. We demonstrate this on two categories
of layouts, as follows., legal agreement and scientific articles, collected
from our real-case datasets
Probing the topological properties of complex networks modeling short written texts
In recent years, graph theory has been widely employed to probe several
language properties. More specifically, the so-called word adjacency model has
been proven useful for tackling several practical problems, especially those
relying on textual stylistic analysis. The most common approach to treat texts
as networks has simply considered either large pieces of texts or entire books.
This approach has certainly worked well -- many informative discoveries have
been made this way -- but it raises an uncomfortable question: could there be
important topological patterns in small pieces of texts? To address this
problem, the topological properties of subtexts sampled from entire books was
probed. Statistical analyzes performed on a dataset comprising 50 novels
revealed that most of the traditional topological measurements are stable for
short subtexts. When the performance of the authorship recognition task was
analyzed, it was found that a proper sampling yields a discriminability similar
to the one found with full texts. Surprisingly, the support vector machine
classification based on the characterization of short texts outperformed the
one performed with entire books. These findings suggest that a local
topological analysis of large documents might improve its global
characterization. Most importantly, it was verified, as a proof of principle,
that short texts can be analyzed with the methods and concepts of complex
networks. As a consequence, the techniques described here can be extended in a
straightforward fashion to analyze texts as time-varying complex networks
SelfDocSeg: A Self-Supervised vision-based Approach towards Document Segmentation
Document layout analysis is a known problem to the documents research
community and has been vastly explored yielding a multitude of solutions
ranging from text mining, and recognition to graph-based representation, visual
feature extraction, etc. However, most of the existing works have ignored the
crucial fact regarding the scarcity of labeled data. With growing internet
connectivity to personal life, an enormous amount of documents had been
available in the public domain and thus making data annotation a tedious task.
We address this challenge using self-supervision and unlike, the few existing
self-supervised document segmentation approaches which use text mining and
textual labels, we use a complete vision-based approach in pre-training without
any ground-truth label or its derivative. Instead, we generate pseudo-layouts
from the document images to pre-train an image encoder to learn the document
object representation and localization in a self-supervised framework before
fine-tuning it with an object detection model. We show that our pipeline sets a
new benchmark in this context and performs at par with the existing methods and
the supervised counterparts, if not outperforms. The code is made publicly
available at: https://github.com/MaitySubhajit/SelfDocSegComment: Accepted at The 17th International Conference on Document Analysis
and Recognition (ICDAR 2023
Unsupervised Visual and Textual Information Fusion in Multimedia Retrieval - A Graph-based Point of View
Multimedia collections are more than ever growing in size and diversity.
Effective multimedia retrieval systems are thus critical to access these
datasets from the end-user perspective and in a scalable way. We are interested
in repositories of image/text multimedia objects and we study multimodal
information fusion techniques in the context of content based multimedia
information retrieval. We focus on graph based methods which have proven to
provide state-of-the-art performances. We particularly examine two of such
methods : cross-media similarities and random walk based scores. From a
theoretical viewpoint, we propose a unifying graph based framework which
encompasses the two aforementioned approaches. Our proposal allows us to
highlight the core features one should consider when using a graph based
technique for the combination of visual and textual information. We compare
cross-media and random walk based results using three different real-world
datasets. From a practical standpoint, our extended empirical analysis allow us
to provide insights and guidelines about the use of graph based methods for
multimodal information fusion in content based multimedia information
retrieval.Comment: An extended version of the paper: Visual and Textual Information
Fusion in Multimedia Retrieval using Semantic Filtering and Graph based
Methods, by J. Ah-Pine, G. Csurka and S. Clinchant, submitted to ACM
Transactions on Information System
Thick 2D Relations for Document Understanding
We use a propositional language of qualitative rectangle relations to detect the reading order from document images. To this end, we define the notion of a document encoding rule and we analyze possible formalisms to express document encoding rules such as LATEX and SGML. Document encoding rules expressed in the propositional language of rectangles are used to build a reading order detector for document images. In order to achieve robustness and avoid brittleness when applying the system to real life document images, the notion of a thick boundary interpretation for a qualitative relation is introduced. The framework is tested on a collection of heterogeneous document images showing recall rates up to 89%
- …