1,824,974 research outputs found
Unsupervised learning with contrastive latent variable models
In unsupervised learning, dimensionality reduction is an important tool for
data exploration and visualization. Because these aims are typically
open-ended, it can be useful to frame the problem as looking for patterns that
are enriched in one dataset relative to another. These pairs of datasets occur
commonly, for instance a population of interest vs. control or signal vs.
signal free recordings.However, there are few methods that work on sets of data
as opposed to data points or sequences. Here, we present a probabilistic model
for dimensionality reduction to discover signal that is enriched in the target
dataset relative to the background dataset. The data in these sets do not need
to be paired or grouped beyond set membership. By using a probabilistic model
where some structure is shared amongst the two datasets and some is unique to
the target dataset, we are able to recover interesting structure in the latent
space of the target dataset. The method also has the advantages of a
probabilistic model, namely that it allows for the incorporation of prior
information, handles missing data, and can be generalized to different
distributional assumptions. We describe several possible variations of the
model and demonstrate the application of the technique to de-noising, feature
selection, and subgroup discovery settings
Accuracy Assessment of the 2006 National Land Cover Database Percent Impervious Dataset
An impervious surface is any surface that prevents water from infiltrating the ground. As impervious surface area increases within watersheds, stream networks and water quality are negatively impacted. The Multi-Resolution Land Characteristic Consortium developed a percent impervious dataset using Landsat imagery as part of the 2006 National Land Cover Database. This percent impervious dataset estimates imperviousness for each 30-meter cell in the land cover database. The percent impervious dataset permits study of impervious surfaces, can be used to identify impacted or critical areas, and allows for development of impact mitigation plans; however, the accuracy of this dataset is unknown. To determine the accuracy of the 2006 percent impervious dataset, reference data were digitized from one-foot digital aerial imagery for three study areas in Arkansas, USA. Digitized reference data were compared to percent impervious dataset estimates of imperviousness at multiple 900m2 , 8,100m2 , and 22,500m2 sample grids to determine if accuracy varied by ground area. Analyses showed percent impervious estimates and digitized reference data differ modestly; however, as ground area increases, percent impervious estimates and reference data match more closely. These findings suggest that the percent impervious dataset is useful for planning purposes for ground areas of at least 2.25ha
Automated data pre-processing via meta-learning
The final publication is available at link.springer.comA data mining algorithm may perform differently on datasets with different characteristics, e.g., it might perform better on a dataset with continuous attributes rather than with categorical attributes, or the other way around.
As a matter of fact, a dataset usually needs to be pre-processed. Taking into account all the possible pre-processing operators, there exists a staggeringly large number of alternatives and nonexperienced users become overwhelmed.
We show that this problem can be addressed by an automated approach, leveraging ideas from metalearning.
Specifically, we consider a wide range of data pre-processing techniques and a set of data mining algorithms. For each data mining algorithm and selected dataset, we are able to predict the transformations that improve the result
of the algorithm on the respective dataset. Our approach will help non-expert users to more effectively identify the transformations appropriate to their applications, and hence to achieve improved results.Peer ReviewedPostprint (published version
Multimedia Semantic Integrity Assessment Using Joint Embedding Of Images And Text
Real world multimedia data is often composed of multiple modalities such as
an image or a video with associated text (e.g. captions, user comments, etc.)
and metadata. Such multimodal data packages are prone to manipulations, where a
subset of these modalities can be altered to misrepresent or repurpose data
packages, with possible malicious intent. It is, therefore, important to
develop methods to assess or verify the integrity of these multimedia packages.
Using computer vision and natural language processing methods to directly
compare the image (or video) and the associated caption to verify the integrity
of a media package is only possible for a limited set of objects and scenes. In
this paper, we present a novel deep learning-based approach for assessing the
semantic integrity of multimedia packages containing images and captions, using
a reference set of multimedia packages. We construct a joint embedding of
images and captions with deep multimodal representation learning on the
reference dataset in a framework that also provides image-caption consistency
scores (ICCSs). The integrity of query media packages is assessed as the
inlierness of the query ICCSs with respect to the reference dataset. We present
the MultimodAl Information Manipulation dataset (MAIM), a new dataset of media
packages from Flickr, which we make available to the research community. We use
both the newly created dataset as well as Flickr30K and MS COCO datasets to
quantitatively evaluate our proposed approach. The reference dataset does not
contain unmanipulated versions of tampered query packages. Our method is able
to achieve F1 scores of 0.75, 0.89 and 0.94 on MAIM, Flickr30K and MS COCO,
respectively, for detecting semantically incoherent media packages.Comment: *Ayush Jaiswal and Ekraam Sabir contributed equally to the work in
this pape
Improved Imputation of Common and Uncommon Single Nucleotide Polymorphisms (SNPs) with a New Reference Set
Statistical imputation of genotype data is an important technique for analysis of genome-wide association studies (GWAS). We have built a reference dataset to improve imputation accuracy for studies of individuals of primarily European descent using genotype data from the Hap1, Omni1, and Omni2.5 human SNP arrays (Illumina). Our dataset contains 2.5-3.1 million variants for 930 European, 157 Asian, and 162 African/African-American individuals. Imputation accuracy of European data from Hap660 or OmniExpress array content, measured by the proportion of variants imputed with R^2^>0.8, improved by 34%, 23% and 12% for variants with MAF of 3%, 5% and 10%, respectively, compared to imputation using publicly available data from 1,000 Genomes and International HapMap projects. The improved accuracy with the use of the new dataset could increase the power for GWAS by as much as 8% relative to genotyping all variants. This reference dataset is available to the scientific community through the NCBI dbGaP portal. Future versions will include additional genotype data as well as non-European populations
Removing the influence of a group variable in high-dimensional predictive modelling
In many application areas, predictive models are used to support or make
important decisions. There is increasing awareness that these models may
contain spurious or otherwise undesirable correlations. Such correlations may
arise from a variety of sources, including batch effects, systematic
measurement errors, or sampling bias. Without explicit adjustment, machine
learning algorithms trained using these data can produce poor out-of-sample
predictions which propagate these undesirable correlations. We propose a method
to pre-process the training data, producing an adjusted dataset that is
statistically independent of the nuisance variables with minimum information
loss. We develop a conceptually simple approach for creating an adjusted
dataset in high-dimensional settings based on a constrained form of matrix
decomposition. The resulting dataset can then be used in any predictive
algorithm with the guarantee that predictions will be statistically independent
of the group variable. We develop a scalable algorithm for implementing the
method, along with theory support in the form of independence guarantees and
optimality. The method is illustrated on some simulation examples and applied
to two case studies: removing machine-specific correlations from brain scan
data, and removing race and ethnicity information from a dataset used to
predict recidivism. That the motivation for removing undesirable correlations
is quite different in the two applications illustrates the broad applicability
of our approach.Comment: Update. 18 pages, 3 figure
- …
