17,995 research outputs found
Adaptive Tag Selection for Image Annotation
Not all tags are relevant to an image, and the number of relevant tags is
image-dependent. Although many methods have been proposed for image
auto-annotation, the question of how to determine the number of tags to be
selected per image remains open. The main challenge is that for a large tag
vocabulary, there is often a lack of ground truth data for acquiring optimal
cutoff thresholds per tag. In contrast to previous works that pre-specify the
number of tags to be selected, we propose in this paper adaptive tag selection.
The key insight is to divide the vocabulary into two disjoint subsets, namely a
seen set consisting of tags having ground truth available for optimizing their
thresholds and a novel set consisting of tags without any ground truth. Such a
division allows us to estimate how many tags shall be selected from the novel
set according to the tags that have been selected from the seen set. The
effectiveness of the proposed method is justified by our participation in the
ImageCLEF 2014 image annotation task. On a set of 2,065 test images with ground
truth available for 207 tags, the benchmark evaluation shows that compared to
the popular top- strategy which obtains an F-score of 0.122, adaptive tag
selection achieves a higher F-score of 0.223. Moreover, by treating the
underlying image annotation system as a black box, the new method can be used
as an easy plug-in to boost the performance of existing systems
Constructing Hierarchical Image-tags Bimodal Representations for Word Tags Alternative Choice
This paper describes our solution to the multi-modal learning challenge of
ICML. This solution comprises constructing three-level representations in three
consecutive stages and choosing correct tag words with a data-specific
strategy. Firstly, we use typical methods to obtain level-1 representations.
Each image is represented using MPEG-7 and gist descriptors with additional
features released by the contest organizers. And the corresponding word tags
are represented by bag-of-words model with a dictionary of 4000 words.
Secondly, we learn the level-2 representations using two stacked RBMs for each
modality. Thirdly, we propose a bimodal auto-encoder to learn the
similarities/dissimilarities between the pairwise image-tags as level-3
representations. Finally, during the test phase, based on one observation of
the dataset, we come up with a data-specific strategy to choose the correct tag
words leading to a leap of an improved overall performance. Our final average
accuracy on the private test set is 100%, which ranks the first place in this
challenge.Comment: 6 pages, 1 figure, Presented at the Workshop on Representation
Learning, ICML 201
A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data
Deep shotgun sequencing and analysis of genomes, transcriptomes, amplified
single-cell genomes, and metagenomes has enabled investigation of a wide range
of organisms and ecosystems. However, sampling variation in short-read data
sets and high sequencing error rates of modern sequencers present many new
computational challenges in data interpretation. These challenges have led to
the development of new classes of mapping tools and {\em de novo} assemblers.
These algorithms are challenged by the continued improvement in sequencing
throughput. We here describe digital normalization, a single-pass computational
algorithm that systematizes coverage in shotgun sequencing data sets, thereby
decreasing sampling variation, discarding redundant data, and removing the
majority of errors. Digital normalization substantially reduces the size of
shotgun data sets and decreases the memory and time requirements for {\em de
novo} sequence assembly, all without significantly impacting content of the
generated contigs. We apply digital normalization to the assembly of microbial
genomic data, amplified single-cell genomic data, and transcriptomic data. Our
implementation is freely available for use and modification
Interpreting 16S metagenomic data without clustering to achieve sub-OTU resolution
The standard approach to analyzing 16S tag sequence data, which relies on
clustering reads by sequence similarity into Operational Taxonomic Units
(OTUs), underexploits the accuracy of modern sequencing technology. We present
a clustering-free approach to multi-sample Illumina datasets that can identify
independent bacterial subpopulations regardless of the similarity of their 16S
tag sequences. Using published data from a longitudinal time-series study of
human tongue microbiota, we are able to resolve within standard 97% similarity
OTUs up to 20 distinct subpopulations, all ecologically distinct but with 16S
tags differing by as little as 1 nucleotide (99.2% similarity). A comparative
analysis of oral communities of two cohabiting individuals reveals that most
such subpopulations are shared between the two communities at 100% sequence
identity, and that dynamical similarity between subpopulations in one host is
strongly predictive of dynamical similarity between the same subpopulations in
the other host. Our method can also be applied to samples collected in
cross-sectional studies and can be used with the 454 sequencing platform. We
discuss how the sub-OTU resolution of our approach can provide new insight into
factors shaping community assembly.Comment: Updated to match the published version. 12 pages, 5 figures +
supplement. Significantly revised for clarity, references added, results not
change
Using GWAS Data to Identify Copy Number Variants Contributing to Common Complex Diseases
Copy number variants (CNVs) account for more polymorphic base pairs in the
human genome than do single nucleotide polymorphisms (SNPs). CNVs encompass
genes as well as noncoding DNA, making these polymorphisms good candidates for
functional variation. Consequently, most modern genome-wide association studies
test CNVs along with SNPs, after inferring copy number status from the data
generated by high-throughput genotyping platforms. Here we give an overview of
CNV genomics in humans, highlighting patterns that inform methods for
identifying CNVs. We describe how genotyping signals are used to identify CNVs
and provide an overview of existing statistical models and methods used to
infer location and carrier status from such data, especially the most commonly
used methods exploring hybridization intensity. We compare the power of such
methods with the alternative method of using tag SNPs to identify CNV carriers.
As such methods are only powerful when applied to common CNVs, we describe two
alternative approaches that can be informative for identifying rare CNVs
contributing to disease risk. We focus particularly on methods identifying de
novo CNVs and show that such methods can be more powerful than case-control
designs. Finally we present some recommendations for identifying CNVs
contributing to common complex disorders.Comment: Published in at http://dx.doi.org/10.1214/09-STS304 the Statistical
Science (http://www.imstat.org/sts/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Jet Substructure by Accident
We propose a new search strategy for high-multiplicity hadronic final states.
When new particles are produced at threshold, the distribution of their decay
products is approximately isotropic. If there are many partons in the final
state, it is likely that several will be clustered into the same large-radius
jet. The resulting jet exhibits substructure, even though the parent states are
not boosted. This "accidental" substructure is a powerful discriminant against
background because it is more pronounced for high-multiplicity signals than for
QCD multijets. We demonstrate how to take advantage of accidental substructure
to reduce backgrounds without relying on the presence of missing energy. As an
example, we present the expected limits for several R-parity violating gluino
decay topologies. This approach allows for the determination of QCD backgrounds
using data-driven methods, which is crucial for the feasibility of any search
that targets signatures with many jets and suppressed missing energy.Comment: 20 + 7 pages, 8 figures; v2: references added, minor changes, journal
versio
- …