545 research outputs found
A Reference Architecture to Devise Web Information Extractors
The Web is the largest repository of human-friendly information. Unfortunately, web information is embedded in formatting tags and is surrounded by irrelevant information. Researchers are working on information extractors that allow transforming this information into
structured data for its later integration into automated processes. Devising a new information extraction technique requires an array of tasks that are specific to this technique and many tasks that are actually common between all techniques. The lack of a reference architectural proposal in the literature to guide software engineers in the design and implementation of information extractors, amounts to little reuse and the focus is usually blurred because of irrelevant details. In this paper, we present a reference architecture to design and implement rule learners for information extractors. We have implemented a software framework to support our architecture, and we have validated it by means of four case studies and a number of experiments that prove that our proposal helps reduce development costs significantly.Ministerio de Educación y Ciencia TIN2007-64119Junta de Andalucía P07-TIC-2602Junta de Andalucía P08-TIC-4100Ministerio de Ciencia e Innovación TIN2008-04718-EMinisterio de Ciencia e Innovación TIN2010-21744Ministerio de Economía, Industria y Competitividad TIN2010-09809-EMinisterio de Ciencia e Innovación TIN2010-10811-EMinisterio de Ciencia e Innovación TIN2010-09988-
From Data Fusion to Knowledge Fusion
The task of {\em data fusion} is to identify the true values of data items
(eg, the true date of birth for {\em Tom Cruise}) among multiple observed
values drawn from different sources (eg, Web sites) of varying (and unknown)
reliability. A recent survey\cite{LDL+12} has provided a detailed comparison of
various fusion methods on Deep Web data. In this paper, we study the
applicability and limitations of different fusion techniques on a more
challenging problem: {\em knowledge fusion}. Knowledge fusion identifies true
subject-predicate-object triples extracted by multiple information extractors
from multiple information sources. These extractors perform the tasks of entity
linkage and schema alignment, thus introducing an additional source of noise
that is quite different from that traditionally considered in the data fusion
literature, which only focuses on factual errors in the original sources. We
adapt state-of-the-art data fusion techniques and apply them to a knowledge
base with 1.6B unique knowledge triples extracted by 12 extractors from over 1B
Web pages, which is three orders of magnitude larger than the data sets used in
previous data fusion papers. We show great promise of the data fusion
approaches in solving the knowledge fusion problem, and suggest interesting
research directions through a detailed error analysis of the methods.Comment: VLDB'201
Integrating Deep-Web Information Sources
Deep-web information sources are difficult to integrate into automated
business processes if they only provide a search form. A wrapping agent is a piece
of software that allows a developer to query such information sources without
worrying about the details of interacting with such forms. Our goal is to help soft ware engineers construct wrapping agents that interpret queries written in high-level
structured languages. We think that this shall definitely help reduce integration costs
because this shall relieve developers from the burden of transforming their queries
into low-level interactions in an ad-hoc manner. In this paper, we report on our
reference framework, delve into the related work, and highlight current research
challenges. This is intended to help guide future research efforts in this area.Ministerio de Educación y Ciencia TIN2007-64119Junta de Andalucía P07-TIC-2602Junta de Andalucía P08-TIC-4100Ministerio de Ciencia e Innovación TIN2008-04718-
An Unsupervised Technique to Extract Information from Semi-structured Web Pages
We propose a technique that takes two or more web pages generated by the same server-side template and tries to learn a regular expression that represents it and helps extract relevant information from similar pages. Our experimental results on real-world web sites demonstrate that our technique outperforms others in terms of both effectiveness and efficiency and is not affected by HTML errors.Ministerio de Ciencia y Tecnología TIN2007-64119Junta de Andalucía P07-TIC-2602Junta de Andalucía P08-TIC-4100Ministerio de Ciencia e Innovación TIN2008-04718-EMinisterio de Ciencia e Innovación TIN2010- 21744Ministerio de Economía, Industria y Competitividad TIN2010-09809-EMinisterio de Ciencia e Innovación TIN2010-10811-EMinisterio de Ciencia e Innovación TIN2010-09988-EMinisterio de Economía y Competitividad TIN2011-15497-
Geometric Deep Learning for Autonomous Driving: Unlocking the Power of Graph Neural Networks With CommonRoad-Geometric
Heterogeneous graphs offer powerful data representations for traffic, given
their ability to model the complex interaction effects among a varying number
of traffic participants and the underlying road infrastructure. With the recent
advent of graph neural networks (GNNs) as the accompanying deep learning
framework, the graph structure can be efficiently leveraged for various machine
learning applications such as trajectory prediction. As a first of its kind,
our proposed Python framework offers an easy-to-use and fully customizable data
processing pipeline to extract standardized graph datasets from traffic
scenarios. Providing a platform for GNN-based autonomous driving research, it
improves comparability between approaches and allows researchers to focus on
model implementation instead of dataset curation.Comment: Presented at IV 202
Review-Based Domain Disentanglement without Duplicate Users or Contexts for Cross-Domain Recommendation
A cross-domain recommendation has shown promising results in solving
data-sparsity and cold-start problems. Despite such progress, existing methods
focus on domain-shareable information (overlapped users or same contexts) for a
knowledge transfer, and they fail to generalize well without such requirements.
To deal with these problems, we suggest utilizing review texts that are general
to most e-commerce systems. Our model (named SER) uses three text analysis
modules, guided by a single domain discriminator for disentangled
representation learning. Here, we suggest a novel optimization strategy that
can enhance the quality of domain disentanglement, and also debilitates
detrimental information of a source domain. Also, we extend the encoding
network from a single to multiple domains, which has proven to be powerful for
review-based recommender systems. Extensive experiments and ablation studies
demonstrate that our method is efficient, robust, and scalable compared to the
state-of-the-art single and cross-domain recommendation methods
Multimodal and Explainable Internet Meme Classification
Warning: this paper contains content that may be offensive or upsetting. In
the current context where online platforms have been effectively weaponized in
a variety of geo-political events and social issues, Internet memes make fair
content moderation at scale even more difficult. Existing work on meme
classification and tracking has focused on black-box methods that do not
explicitly consider the semantics of the memes or the context of their
creation. In this paper, we pursue a modular and explainable architecture for
Internet meme understanding. We design and implement multimodal classification
methods that perform example- and prototype-based reasoning over training
cases, while leveraging both textual and visual SOTA models to represent the
individual cases. We study the relevance of our modular and explainable models
in detecting harmful memes on two existing tasks: Hate Speech Detection and
Misogyny Classification. We compare the performance between example- and
prototype-based methods, and between text, vision, and multimodal models,
across different categories of harmfulness (e.g., stereotype and
objectification). We devise a user-friendly interface that facilitates the
comparative analysis of examples retrieved by all of our models for any given
meme, informing the community about the strengths and limitations of these
explainable methods
MemeSequencer: sparse matching for embedding image macros
[Proceeding of]: The Web Conference 2018 (WWW2018), April 23 - 27, 2018, Lyon, FranceThe analysis of the creation, mutation, and propagation of social media content on the Internet is an essential problem in computational social science, affecting areas ranging from marketing to political mobilization. A first step towards understanding the evolution of images online is the analysis of rapidly modifying and propagating memetic imagery or "memes". However, a pitfall in proceeding with such an investigation is the current incapability to produce a robust semantic space for such imagery, capable of understanding differences in Image Macros. In this study, we provide a first step in the systematic study of image evolution on the Internet, by proposing an algorithm based on sparse representations and deep learning to decouple various types of content in such images and produce a rich semantic embedding. We demonstrate the benefits of our approach on a variety of tasks pertaining to memes and Image Macros, such as image clustering, image retrieval, topic prediction and virality prediction, surpassing the existing methods on each. In addition to its utility on quantitative tasks, our method opens up the possibility of obtaining the first large-scale understanding of the evolution and propagation of memetic imagery.Publicad
Temporal Sentence Grounding in Videos: A Survey and Future Directions
Temporal sentence grounding in videos (TSGV), \aka natural language video
localization (NLVL) or video moment retrieval (VMR), aims to retrieve a
temporal moment that semantically corresponds to a language query from an
untrimmed video. Connecting computer vision and natural language, TSGV has
drawn significant attention from researchers in both communities. This survey
attempts to provide a summary of fundamental concepts in TSGV and current
research status, as well as future research directions. As the background, we
present a common structure of functional components in TSGV, in a tutorial
style: from feature extraction from raw video and language query, to answer
prediction of the target moment. Then we review the techniques for multimodal
understanding and interaction, which is the key focus of TSGV for effective
alignment between the two modalities. We construct a taxonomy of TSGV
techniques and elaborate the methods in different categories with their
strengths and weaknesses. Lastly, we discuss issues with the current TSGV
research and share our insights about promising research directions.Comment: 29 pages, 32 figures, 9 table
Extracting keywords from tweets
Nos últimos anos, uma enorme quantidade de informações foi disponibilizada na Internet. As redes sociais estão entre as que mais contribuem para esse aumento no volume de dados. O Twitter, em particular, abriu o caminho, enquanto plataforma social, para que pessoas e organizações possam interagir entre si, gerando grandes volumes de dados a partir dos quais é possível extrair informação útil. Uma tal quantidade de dados, permitirá por exemplo, revelar-se importante se e quando, vários indivíduos relatarem sintomas de doença ao mesmo tempo e no mesmo lugar. Processar automaticamente um tal volume de informações e obter a partir dele conhecimento útil, torna-se, no entanto, uma tarefa impossível para qualquer ser humano. Os extratores de palavras-chave surgem neste contexto como uma ferramenta valiosa que visa facilitar este trabalho, ao permitir, de uma forma rápida, ter acesso a um conjunto de termos caracterizadores do documento.
Neste trabalho, tentamos contribuir para um melhor entendimento deste problema, avaliando a eficácia do YAKE (um algoritmo de extração de palavras-chave não supervisionado) em cima de um conjunto de tweets, um tipo de texto, caracterizado não só pelo seu reduzido tamanho, mas também pela sua natureza não estruturada. Embora os extratores de palavras-chave tenham sido amplamente aplicados a textos genéricos, como a relatórios, artigos, entre outros, a sua aplicabilidade em tweets é escassa e até ao momento não foi disponibilizado formalmente nenhum conjunto de dados. Neste trabalho e por forma a contornar esse problema optámos por desenvolver e tornar disponível uma nova coleção de dados, um importante contributo para que a comunidade científica promova novas soluções neste domínio. O KWTweet foi anotado por 15 anotadores e resultou em 7736 tweets anotados. Com base nesta informação, pudemos posteriormente avaliar a eficácia do YAKE! contra 9 baselines de extração de palavra-chave não supervisionados (TextRank, KP-Miner, SingleRank, PositionRank, TopicPageRank, MultipartiteRank, TopicRank, Rake e TF.IDF). Os resultados obtidos demonstram que o YAKE! tem um desempenho superior quando comparado com os seus competidores, provando-se assim a sua eficácia neste tipo de textos. Por fim, disponibilizamos uma demo que visa demonstrar o funcionamento do YAKE! Nesta plataforma web, os utilizadores têm a possibilidade de fazer uma pesquisa por utilizador ou hashtag e dessa forma obter as palavras chave mais relevantes através de uma nuvem de palavra
- …