12,493 research outputs found
AspEm: Embedding Learning by Aspects in Heterogeneous Information Networks
Heterogeneous information networks (HINs) are ubiquitous in real-world
applications. Due to the heterogeneity in HINs, the typed edges may not fully
align with each other. In order to capture the semantic subtlety, we propose
the concept of aspects with each aspect being a unit representing one
underlying semantic facet. Meanwhile, network embedding has emerged as a
powerful method for learning network representation, where the learned
embedding can be used as features in various downstream applications.
Therefore, we are motivated to propose a novel embedding learning
framework---AspEm---to preserve the semantic information in HINs based on
multiple aspects. Instead of preserving information of the network in one
semantic space, AspEm encapsulates information regarding each aspect
individually. In order to select aspects for embedding purpose, we further
devise a solution for AspEm based on dataset-wide statistics. To corroborate
the efficacy of AspEm, we conducted experiments on two real-words datasets with
two types of applications---classification and link prediction. Experiment
results demonstrate that AspEm can outperform baseline network embedding
learning methods by considering multiple aspects, where the aspects can be
selected from the given HIN in an unsupervised manner.Comment: 11 pages including additional supplementary materials. In Proceedings
of the 2018 SIAM International Conference on Data Mining, San Diego,
California, USA, SIAM, 201
Clinical Relationships Extraction Techniques from Patient Narratives
The Clinical E-Science Framework (CLEF) project was used to extract important
information from medical texts by building a system for the purpose of clinical
research, evidence-based healthcare and genotype-meets-phenotype informatics.
The system is divided into two parts, one part concerns with the identification
of relationships between clinically important entities in the text. The full
parses and domain-specific grammars had been used to apply many approaches to
extract the relationship. In the second part of the system, statistical machine
learning (ML) approaches are applied to extract relationship. A corpus of
oncology narratives that hand annotated with clinical relationships can be used
to train and test a system that has been designed and implemented by supervised
machine learning (ML) approaches. Many features can be extracted from these
texts that are used to build a model by the classifier. Multiple supervised
machine learning algorithms can be applied for relationship extraction. Effects
of adding the features, changing the size of the corpus, and changing the type
of the algorithm on relationship extraction are examined. Keywords: Text
mining; information extraction; NLP; entities; and relations.Comment: 15 pages 13 figures 7 table
Literature Review Of Attribute Level And Structure Level Data Linkage Techniques
Data Linkage is an important step that can provide valuable insights for
evidence-based decision making, especially for crucial events. Performing
sensible queries across heterogeneous databases containing millions of records
is a complex task that requires a complete understanding of each contributing
databases schema to define the structure of its information. The key aim is to
approximate the structure and content of the induced data into a concise
synopsis in order to extract and link meaningful data-driven facts. We identify
such problems as four major research issues in Data Linkage: associated costs
in pair-wise matching, record matching overheads, semantic flow of information
restrictions, and single order classification limitations. In this paper, we
give a literature review of research in Data Linkage. The purpose for this
review is to establish a basic understanding of Data Linkage, and to discuss
the background in the Data Linkage research domain. Particularly, we focus on
the literature related to the recent advancements in Approximate Matching
algorithms at Attribute Level and Structure Level. Their efficiency,
functionality and limitations are critically analysed and open-ended problems
have been exposed.Comment: 20 page
A Tensor Based Data Model for Polystore: An Application to Social Networks Data
In this article, we show how the mathematical object tensor can be used to
build a multi-paradigm model for the storage of social data in data warehouses.
From an architectural point of view, our approach allows to link different
storage systems (polystore) and limits the impact of ETL tools performing model
transformations required to feed different analysis algorithms. Therefore,
systems can take advantage of multiple data models both in terms of query
execution performance and the semantic expressiveness of data representation.
The proposed model allows to reach the logical independence between data and
programs implementing analysis algorithms. With a concrete case study on
message virality on Twitter during the French presidential election of 2017, we
highlight some of the contributions of our model
Dataset2Vec: Learning Dataset Meta-Features
Meta-learning, or learning to learn, is a machine learning approach that
utilizes prior learning experiences to expedite the learning process on unseen
tasks. As a data-driven approach, meta-learning requires meta-features that
represent the primary learning tasks or datasets, and are estimated
traditonally as engineered dataset statistics that require expert domain
knowledge tailored for every meta-task. In this paper, first, we propose a
meta-feature extractor called Dataset2Vec that combines the versatility of
engineered dataset meta-features with the expressivity of meta-features learned
by deep neural networks. Primary learning tasks or datasets are represented as
hierarchical sets, i.e., as a set of sets, esp. as a set of predictor/target
pairs, and then a DeepSet architecture is employed to regress meta-features on
them. Second, we propose a novel auxiliary meta-learning task with abundant
data called dataset similarity learning that aims to predict if two batches
stem from the same dataset or different ones. In an experiment on a large-scale
hyperparameter optimization task for 120 UCI datasets with varying schemas as a
meta-learning task, we show that the meta-features of Dataset2Vec outperform
the expert engineered meta-features and thus demonstrate the usefulness of
learned meta-features for datasets with varying schemas for the first time
Machine Learning with World Knowledge: The Position and Survey
Machine learning has become pervasive in multiple domains, impacting a wide
variety of applications, such as knowledge discovery and data mining, natural
language processing, information retrieval, computer vision, social and health
informatics, ubiquitous computing, etc. Two essential problems of machine
learning are how to generate features and how to acquire labels for machines to
learn. Particularly, labeling large amount of data for each domain-specific
problem can be very time consuming and costly. It has become a key obstacle in
making learning protocols realistic in applications. In this paper, we will
discuss how to use the existing general-purpose world knowledge to enhance
machine learning processes, by enriching the features or reducing the labeling
work. We start from the comparison of world knowledge with domain-specific
knowledge, and then introduce three key problems in using world knowledge in
learning processes, i.e., explicit and implicit feature representation,
inference for knowledge linking and disambiguation, and learning with direct or
indirect supervision. Finally we discuss the future directions of this research
topic
Deep Recurrent Neural Networks for mapping winter vegetation quality coverage via multi-temporal SAR Sentinel-1
Mapping winter vegetation quality coverage is a challenge problem of remote
sensing. This is due to the cloud coverage in winter period, leading to use
radar rather than optical images. The objective of this paper is to provide a
better understanding of the capabilities of radar Sentinel-1 and deep learning
concerning about mapping winter vegetation quality coverage. The analysis
presented in this paper is carried out on multi-temporal Sentinel-1 data over
the site of La Rochelle, France, during the campaign in December 2016. This
dataset were processed in order to produce an intensity radar data stack from
October 2016 to February 2017. Two deep Recurrent Neural Network (RNN) based
classifier methods were employed. We found that the results of RNNs clearly
outperformed the classical machine learning approaches (Support Vector Machine
and Random Forest). This study confirms that the time series radar Sentinel-1
and RNNs could be exploited for winter vegetation quality cover mapping.Comment: In submission to IEEE Geoscience and Remote Sensing Letter
Predicting Anchor Links between Heterogeneous Social Networks
People usually get involved in multiple social networks to enjoy new services
or to fulfill their needs. Many new social networks try to attract users of
other existing networks to increase the number of their users. Once a user
(called source user) of a social network (called source network) joins a new
social network (called target network), a new inter-network link (called anchor
link) is formed between the source and target networks. In this paper, we
concentrated on predicting the formation of such anchor links between
heterogeneous social networks. Unlike conventional link prediction problems in
which the formation of a link between two existing users within a single
network is predicted, in anchor link prediction, the target user is missing and
will be added to the target network once the anchor link is created. To solve
this problem, we use meta-paths as a powerful tool for utilizing heterogeneous
information in both the source and target networks. To this end, we propose an
effective general meta-path-based approach called Connector and Recursive
Meta-Paths (CRMP). By using those two different categories of meta-paths, we
model different aspects of social factors that may affect a source user to join
the target network, resulting in the formation of a new anchor link. Extensive
experiments on real-world heterogeneous social networks demonstrate the
effectiveness of the proposed method against the recent methods.Comment: To be published in "Proceedings of the 2016 IEEE/ACM International
Conference on Advances in Social Networks Analysis and Mining (ASONAM)
Data Management and Mining in Astrophysical Databases
We analyse the issues involved in the management and mining of astrophysical
data. The traditional approach to data management in the astrophysical field is
not able to keep up with the increasing size of the data gathered by modern
detectors. An essential role in the astrophysical research will be assumed by
automatic tools for information extraction from large datasets, i.e. data
mining techniques, such as clustering and classification algorithms. This asks
for an approach to data management based on data warehousing, emphasizing the
efficiency and simplicity of data access; efficiency is obtained using
multidimensional access methods and simplicity is achieved by properly handling
metadata. Clustering and classification techniques, on large datasets, pose
additional requirements: computational and memory scalability with respect to
the data size, interpretability and objectivity of clustering or classification
results. In this study we address some possible solutions.Comment: 10 pages, Late
Securing Your Transactions: Detecting Anomalous Patterns In XML Documents
XML transactions are used in many information systems to store data and
interact with other systems. Abnormal transactions, the result of either an
on-going cyber attack or the actions of a benign user, can potentially harm the
interacting systems and therefore they are regarded as a threat. In this paper
we address the problem of anomaly detection and localization in XML
transactions using machine learning techniques. We present a new XML anomaly
detection framework, XML-AD. Within this framework, an automatic method for
extracting features from XML transactions was developed as well as a practical
method for transforming XML features into vectors of fixed dimensionality. With
these two methods in place, the XML-AD framework makes it possible to utilize
general learning algorithms for anomaly detection. Central to the functioning
of the framework is a novel multi-univariate anomaly detection algorithm,
ADIFA. The framework was evaluated on four XML transactions datasets, captured
from real information systems, in which it achieved over 89% true positive
detection rate with less than a 0.2% false positive rate.Comment: Journal version (14 pages
- …