16,145 research outputs found
On Regularization Parameter Estimation under Covariate Shift
This paper identifies a problem with the usual procedure for
L2-regularization parameter estimation in a domain adaptation setting. In such
a setting, there are differences between the distributions generating the
training data (source domain) and the test data (target domain). The usual
cross-validation procedure requires validation data, which can not be obtained
from the unlabeled target data. The problem is that if one decides to use
source validation data, the regularization parameter is underestimated. One
possible solution is to scale the source validation data through importance
weighting, but we show that this correction is not sufficient. We conclude the
paper with an empirical analysis of the effect of several importance weight
estimators on the estimation of the regularization parameter.Comment: 6 pages, 2 figures, 2 tables. Accepted to ICPR 201
Knowledge Base Population using Semantic Label Propagation
A crucial aspect of a knowledge base population system that extracts new
facts from text corpora, is the generation of training data for its relation
extractors. In this paper, we present a method that maximizes the effectiveness
of newly trained relation extractors at a minimal annotation cost. Manual
labeling can be significantly reduced by Distant Supervision, which is a method
to construct training data automatically by aligning a large text corpus with
an existing knowledge base of known facts. For example, all sentences
mentioning both 'Barack Obama' and 'US' may serve as positive training
instances for the relation born_in(subject,object). However, distant
supervision typically results in a highly noisy training set: many training
sentences do not really express the intended relation. We propose to combine
distant supervision with minimal manual supervision in a technique called
feature labeling, to eliminate noise from the large and noisy initial training
set, resulting in a significant increase of precision. We further improve on
this approach by introducing the Semantic Label Propagation method, which uses
the similarity between low-dimensional representations of candidate training
instances, to extend the training set in order to increase recall while
maintaining high precision. Our proposed strategy for generating training data
is studied and evaluated on an established test collection designed for
knowledge base population tasks. The experimental results show that the
Semantic Label Propagation strategy leads to substantial performance gains when
compared to existing approaches, while requiring an almost negligible manual
annotation effort.Comment: Submitted to Knowledge Based Systems, special issue on Knowledge
Bases for Natural Language Processin
Neural overlap of L1 and L2 semantic representations across visual and auditory modalities : a decoding approach/
This study investigated whether brain activity in Dutch-French bilinguals during semantic access to concepts from one language could be used to predict neural activation during access to the same concepts from another language, in different language modalities/tasks. This was tested using multi-voxel pattern analysis (MVPA), within and across language comprehension (word listening and word reading) and production (picture naming). It was possible to identify the picture or word named, read or heard in one language (e.g. maan, meaning moon) based on the brain activity in a distributed bilateral brain network while, respectively, naming, reading or listening to the picture or word in the other language (e.g. lune). The brain regions identified differed across tasks. During picture naming, brain activation in the occipital and temporal regions allowed concepts to be predicted across languages. During word listening and word reading, across-language predictions were observed in the rolandic operculum and several motor-related areas (pre- and postcentral, the cerebellum). In addition, across-language predictions during reading were identified in regions typically associated with semantic processing (left inferior frontal, middle temporal cortex, right cerebellum and precuneus) and visual processing (inferior and middle occipital regions and calcarine sulcus). Furthermore, across modalities and languages, the left lingual gyrus showed semantic overlap across production and word reading. These findings support the idea of at least partially language- and modality-independent semantic neural representations
Synthetic sequence generator for recommender systems - memory biased random walk on sequence multilayer network
Personalized recommender systems rely on each user's personal usage data in
the system, in order to assist in decision making. However, privacy policies
protecting users' rights prevent these highly personal data from being publicly
available to a wider researcher audience. In this work, we propose a memory
biased random walk model on multilayer sequence network, as a generator of
synthetic sequential data for recommender systems. We demonstrate the
applicability of the synthetic data in training recommender system models for
cases when privacy policies restrict clickstream publishing.Comment: The new updated version of the pape
Molecular Infectious Disease Epidemiology: Survival Analysis and Algorithms Linking Phylogenies to Transmission Trees
Recent work has attempted to use whole-genome sequence data from pathogens to
reconstruct the transmission trees linking infectors and infectees in
outbreaks. However, transmission trees from one outbreak do not generalize to
future outbreaks. Reconstruction of transmission trees is most useful to public
health if it leads to generalizable scientific insights about disease
transmission. In a survival analysis framework, estimation of transmission
parameters is based on sums or averages over the possible transmission trees. A
phylogeny can increase the precision of these estimates by providing partial
information about who infected whom. The leaves of the phylogeny represent
sampled pathogens, which have known hosts. The interior nodes represent common
ancestors of sampled pathogens, which have unknown hosts. Starting from
assumptions about disease biology and epidemiologic study design, we prove that
there is a one-to-one correspondence between the possible assignments of
interior node hosts and the transmission trees simultaneously consistent with
the phylogeny and the epidemiologic data on person, place, and time. We develop
algorithms to enumerate these transmission trees and show these can be used to
calculate likelihoods that incorporate both epidemiologic data and a phylogeny.
A simulation study confirms that this leads to more efficient estimates of
hazard ratios for infectiousness and baseline hazards of infectious contact,
and we use these methods to analyze data from a foot-and-mouth disease virus
outbreak in the United Kingdom in 2001. These results demonstrate the
importance of data on individuals who escape infection, which is often
overlooked. The combination of survival analysis and algorithms linking
phylogenies to transmission trees is a rigorous but flexible statistical
foundation for molecular infectious disease epidemiology.Comment: 28 pages, 11 figures, 3 table
- …