10,510 research outputs found
AMBER: Automatic Supervision for Multi-Attribute Extraction
The extraction of multi-attribute objects from the deep web is the bridge
between the unstructured web and structured data. Existing approaches either
induce wrappers from a set of human-annotated pages or leverage repeated
structures on the page without supervision. What the former lack in automation,
the latter lack in accuracy. Thus accurate, automatic multi-attribute object
extraction has remained an open challenge.
AMBER overcomes both limitations through mutual supervision between the
repeated structure and automatically produced annotations. Previous approaches
based on automatic annotations have suffered from low quality due to the
inherent noise in the annotations and have attempted to compensate by exploring
multiple candidate wrappers. In contrast, AMBER compensates for this noise by
integrating repeated structure analysis with annotation-based induction: The
repeated structure limits the search space for wrapper induction, and
conversely, annotations allow the repeated structure analysis to distinguish
noise from relevant data. Both, low recall and low precision in the annotations
are mitigated to achieve almost human quality (more than 98 percent)
multi-attribute object extraction.
To achieve this accuracy, AMBER needs to be trained once for an entire
domain. AMBER bootstraps its training from a small, possibly noisy set of
attribute instances and a few unannotated sites of the domain
Multimodal Attribute Extraction
The broad goal of information extraction is to derive structured information
from unstructured data. However, most existing methods focus solely on text,
ignoring other types of unstructured data such as images, video and audio which
comprise an increasing portion of the information on the web. To address this
shortcoming, we propose the task of multimodal attribute extraction. Given a
collection of unstructured and semi-structured contextual information about an
entity (such as a textual description, or visual depictions) the task is to
extract the entity's underlying attributes. In this paper, we provide a dataset
containing mixed-media data for over 2 million product items along with 7
million attribute-value pairs describing the items which can be used to train
attribute extractors in a weakly supervised manner. We provide a variety of
baselines which demonstrate the relative effectiveness of the individual modes
of information towards solving the task, as well as study human performance.Comment: AKBC 2017 Workshop Pape
Subsurface structure analysis using computational interpretation and learning: A visual signal processing perspective
Understanding Earth's subsurface structures has been and continues to be an
essential component of various applications such as environmental monitoring,
carbon sequestration, and oil and gas exploration. By viewing the seismic
volumes that are generated through the processing of recorded seismic traces,
researchers were able to learn from applying advanced image processing and
computer vision algorithms to effectively analyze and understand Earth's
subsurface structures. In this paper, first, we summarize the recent advances
in this direction that relied heavily on the fields of image processing and
computer vision. Second, we discuss the challenges in seismic interpretation
and provide insights and some directions to address such challenges using
emerging machine learning algorithms
Morpho-syntactic Lexicon Generation Using Graph-based Semi-supervised Learning
Morpho-syntactic lexicons provide information about the morphological and
syntactic roles of words in a language. Such lexicons are not available for all
languages and even when available, their coverage can be limited. We present a
graph-based semi-supervised learning method that uses the morphological,
syntactic and semantic relations between words to automatically construct wide
coverage lexicons from small seed sets. Our method is language-independent, and
we show that we can expand a 1000 word seed lexicon to more than 100 times its
size with high quality for 11 languages. In addition, the automatically created
lexicons provide features that improve performance in two downstream tasks:
morphological tagging and dependency parsing.Comment: Transactions of the Association for Computational Linguistics (TACL)
201
BlogForever D2.6: Data Extraction Methodology
This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform
Towards Better Summarizing Bug Reports with Crowdsourcing Elicited Attributes
Recent years have witnessed the growing demands for resolving numerous bug
reports in software maintenance. Aiming to reduce the time testers/developers
take in perusing bug reports, the task of bug report summarization has
attracted a lot of research efforts in the literature. However, no systematic
analysis has been conducted on attribute construction which heavily impacts the
performance of supervised algorithms for bug report summarization. In this
study, we first conduct a survey to reveal the existing methods for attribute
construction in mining software repositories. Then, we propose a new method
named Crowd-Attribute to infer new effective attributes from the crowdgenerated
data in crowdsourcing and develop a new tool named Crowdsourcing Software
Engineering Platform to facilitate this method. With Crowd-Attribute, we
successfully construct 11 new attributes and propose a new supervised algorithm
named Logistic Regression with Crowdsourced Attributes (LRCA). To evaluate the
effectiveness of LRCA, we build a series of large scale data sets with 105,177
bug reports. Experiments over both the public data set SDS with 36 manually
annotated bug reports and new large-scale data sets demonstrate that LRCA can
consistently outperform the state-of-the-art algorithms for bug report
summarization.Comment: Accepted by IEEE Transactions on Reliabilit
Visual Graph Mining
In this study, we formulate the concept of "mining maximal-size frequent
subgraphs" in the challenging domain of visual data (images and videos). In
general, visual knowledge can usually be modeled as attributed relational
graphs (ARGs) with local attributes representing local parts and pairwise
attributes describing the spatial relationship between parts. Thus, from a
practical perspective, such mining of maximal-size subgraphs can be regarded as
a general platform for discovering and modeling the common objects within
cluttered and unlabeled visual data. Then, from a theoretical perspective,
visual graph mining should encode and overcome the great fuzziness of messy
data collected from complex real-world situations, which conflicts with the
conventional theoretical basis of graph mining designed for tabular data.
Common subgraphs hidden in these ARGs usually have soft attributes, with
considerable inter-graph variation. More importantly, we should also discover
the latent pattern space, including similarity metrics for the pattern and
hidden node relations, during the mining process. In this study, we redefine
the visual subgraph pattern that encodes all of these challenges in a general
way, and propose an approximate but efficient solution to graph mining. We
conduct five experiments to evaluate our method with different kinds of visual
data, including videos and RGB/RGB-D images. These experiments demonstrate the
generality of the proposed method
A traffic classification method using machine learning algorithm
Applying concepts of attack investigation in IT industry, this idea has been developed to design
a Traffic Classification Method using Data Mining techniques at the intersection of Machine
Learning Algorithm, Which will classify the normal and malicious traffic. This classification will
help to learn about the unknown attacks faced by IT industry. The notion of traffic classification
is not a new concept; plenty of work has been done to classify the network traffic for
heterogeneous application nowadays. Existing techniques such as (payload based, port based
and statistical based) have their own pros and cons which will be discussed in this
literature later, but classification using Machine Learning techniques is still an open field to explore and has provided very promising results up till now
GAN-based Pose-aware Regulation for Video-based Person Re-identification
Video-based person re-identification deals with the inherent difficulty of
matching unregulated sequences with different length and with incomplete target
pose/viewpoint structure. Common approaches operate either by reducing the
problem to the still images case, facing a significant information loss, or by
exploiting inter-sequence temporal dependencies as in Siamese Recurrent Neural
Networks or in gait analysis. However, in all cases, the inter-sequences
pose/viewpoint misalignment is not considered, and the existing spatial
approaches are mostly limited to the still images context. To this end, we
propose a novel approach that can exploit more effectively the rich video
information, by accounting for the role that the changing pose/viewpoint factor
plays in the sequences matching process. Specifically, our approach consists of
two components. The first one attempts to complement the original
pose-incomplete information carried by the sequences with synthetic
GAN-generated images, and fuse their feature vectors into a more discriminative
viewpoint-insensitive embedding, namely Weighted Fusion (WF). Another one
performs an explicit pose-based alignment of sequence pairs to promote coherent
feature matching, namely Weighted-Pose Regulation (WPR). Extensive experiments
on two large video-based benchmark datasets show that our approach outperforms
considerably existing methods
Securing Your Transactions: Detecting Anomalous Patterns In XML Documents
XML transactions are used in many information systems to store data and
interact with other systems. Abnormal transactions, the result of either an
on-going cyber attack or the actions of a benign user, can potentially harm the
interacting systems and therefore they are regarded as a threat. In this paper
we address the problem of anomaly detection and localization in XML
transactions using machine learning techniques. We present a new XML anomaly
detection framework, XML-AD. Within this framework, an automatic method for
extracting features from XML transactions was developed as well as a practical
method for transforming XML features into vectors of fixed dimensionality. With
these two methods in place, the XML-AD framework makes it possible to utilize
general learning algorithms for anomaly detection. Central to the functioning
of the framework is a novel multi-univariate anomaly detection algorithm,
ADIFA. The framework was evaluated on four XML transactions datasets, captured
from real information systems, in which it achieved over 89% true positive
detection rate with less than a 0.2% false positive rate.Comment: Journal version (14 pages
- âŠ