75,725 research outputs found
Towards Zero-shot Relation Extraction in Web Mining: A Multimodal Approach with Relative XML Path
The rapid growth of web pages and the increasing complexity of their
structure poses a challenge for web mining models. Web mining models are
required to understand the semi-structured web pages, particularly when little
is known about the subject or template of a new page. Current methods migrate
language models to the web mining by embedding the XML source code into the
transformer or encoding the rendered layout with graph neural networks.
However, these approaches do not take into account the relationships between
text nodes within and across pages. In this paper, we propose a new approach,
ReXMiner, for zero-shot relation extraction in web mining. ReXMiner encodes the
shortest relative paths in the Document Object Model (DOM) tree which is a more
accurate and efficient signal for key-value pair extraction within a web page.
It also incorporates the popularity of each text node by counting the
occurrence of the same text node across different web pages. We use the
contrastive learning to address the issue of sparsity in relation extraction.
Extensive experiments on public benchmarks show that our method, ReXMiner,
outperforms the state-of-the-art baselines in the task of zero-shot relation
extraction in web mining
History-based visual mining of semi-structured audio and text
Accessing specific or salient parts of multimedia recordings remains a challenge as there is no obvious way of structuring and representing a mix of space-based and time-based media. A number of approaches have been proposed which usually involve translating the continuous component of the multimedia recording into a space-based representation, such as text from audio through automatic speech recognition and images from video (keyframes). In this paper, we present a novel technique which defines retrieval units in terms of a log of actions performed on space-based artefacts, and exploits timing properties and extended concurrency to construct a visual presentation of text and speech data. This technique can be easily adapted to any mix of space-based artefacts and continuous media
Experiments in Clustering Homogeneous XML Documents to Validate an Existing Typology
This paper presents some experiments in clustering homogeneous XMLdocuments
to validate an existing classification or more generally anorganisational
structure. Our approach integrates techniques for extracting knowledge from
documents with unsupervised classification (clustering) of documents. We focus
on the feature selection used for representing documents and its impact on the
emerging classification. We mix the selection of structured features with fine
textual selection based on syntactic characteristics.We illustrate and evaluate
this approach with a collection of Inria activity reports for the year 2003.
The objective is to cluster projects into larger groups (Themes), based on the
keywords or different chapters of these activity reports. We then compare the
results of clustering using different feature selections, with the official
theme structure used by Inria.Comment: (postprint); This version corrects a couple of errors in authors'
names in the bibliograph
- …