9,000 research outputs found
Indexing of Reading Paths for a Structured Information Retrieval on the Web
International audienceIn this paper, we present a hyperdocument model taking into account the essential aspects of information on the Web: content, composition (logical structure) and nonlinear reading (hypertext structure). We have developed a Structured Information Retrieval System (SIRS) based on this model. Its phases of indexing and querying are based on a “reading paths” point of view of the Web: a Web site is considered as a set of potential reading paths, instead of a set of atomic and flat pages. We have developed an specific algorithm to index the reading paths. We present some experiments aiming at evaluating the interest of our indexing process of reading paths
A Labeled Graph Kernel for Relationship Extraction
In this paper, we propose an approach for Relationship Extraction (RE) based
on labeled graph kernels. The kernel we propose is a particularization of a
random walk kernel that exploits two properties previously studied in the RE
literature: (i) the words between the candidate entities or connecting them in
a syntactic representation are particularly likely to carry information
regarding the relationship; and (ii) combining information from distinct
sources in a kernel may help the RE system make better decisions. We performed
experiments on a dataset of protein-protein interactions and the results show
that our approach obtains effectiveness values that are comparable with the
state-of-the art kernel methods. Moreover, our approach is able to outperform
the state-of-the-art kernels when combined with other kernel methods
Graph Homomorphism Revisited for Graph Matching
In a variety of emerging applications one needs to decide whether a graph
G matches
another
G
p
,
i.e.
, whether
G
has a topological structure similar to that of
G
p
. The traditional notions of graph homomorphism and isomorphism often fall short of capturing the structural similarity in these applications. This paper studies revisions of these notions, providing a full treatment from complexity to algorithms. (1) We propose
p-homomorphism (p
-hom) and 1-1
p
-hom, which extend graph homomorphism and subgraph isomorphism, respectively, by mapping
edges
from one graph to
paths
in another, and by measuring
the similarity of nodes
. (2) We introduce metrics to measure graph similarity, and several optimization problems for
p
-hom and 1-1
p
-hom. (3) We show that the decision problems for
p
-hom and 1-1
p
-hom are NP-complete even for DAGs, and that the optimization problems are approximation-hard. (4) Nevertheless, we provide approximation algorithms with
provable guarantees
on match quality. We experimentally verify the effectiveness of the revised notions and the efficiency of our algorithms in Web site matching, using real-life and synthetic data.
</jats:p
XML documents clustering using a tensor space model
The traditional Vector Space Model (VSM) is not able to represent both the structure and the content of XML documents. This paper introduces a novel method of representing XML documents in a Tensor Space Model (TSM) and then utilizing it for clustering. Empirical analysis shows that the proposed method is scalable for large-sized datasets; as well, the factorized matrices produced from the proposed method help to improve the quality of clusters through the enriched document representation of both structure and content information
Structure-semantics interplay in complex networks and its effects on the predictability of similarity in texts
There are different ways to define similarity for grouping similar texts into
clusters, as the concept of similarity may depend on the purpose of the task.
For instance, in topic extraction similar texts mean those within the same
semantic field, whereas in author recognition stylistic features should be
considered. In this study, we introduce ways to classify texts employing
concepts of complex networks, which may be able to capture syntactic, semantic
and even pragmatic features. The interplay between the various metrics of the
complex networks is analyzed with three applications, namely identification of
machine translation (MT) systems, evaluation of quality of machine translated
texts and authorship recognition. We shall show that topological features of
the networks representing texts can enhance the ability to identify MT systems
in particular cases. For evaluating the quality of MT texts, on the other hand,
high correlation was obtained with methods capable of capturing the semantics.
This was expected because the golden standards used are themselves based on
word co-occurrence. Notwithstanding, the Katz similarity, which involves
semantic and structure in the comparison of texts, achieved the highest
correlation with the NIST measurement, indicating that in some cases the
combination of both approaches can improve the ability to quantify quality in
MT. In authorship recognition, again the topological features were relevant in
some contexts, though for the books and authors analyzed good results were
obtained with semantic features as well. Because hybrid approaches encompassing
semantic and topological features have not been extensively used, we believe
that the methodology proposed here may be useful to enhance text classification
considerably, as it combines well-established strategies
- …