1,332 research outputs found
Scatteract: Automated extraction of data from scatter plots
Charts are an excellent way to convey patterns and trends in data, but they
do not facilitate further modeling of the data or close inspection of
individual data points. We present a fully automated system for extracting the
numerical values of data points from images of scatter plots. We use deep
learning techniques to identify the key components of the chart, and optical
character recognition together with robust regression to map from pixels to the
coordinate system of the chart. We focus on scatter plots with linear scales,
which already have several interesting challenges. Previous work has done fully
automatic extraction for other types of charts, but to our knowledge this is
the first approach that is fully automatic for scatter plots. Our method
performs well, achieving successful data extraction on 89% of the plots in our
test set.Comment: Submitted to ECML PKDD 2017 proceedings, 16 page
Genome-wide inference of ancestral recombination graphs
The complex correlation structure of a collection of orthologous DNA
sequences is uniquely captured by the "ancestral recombination graph" (ARG), a
complete record of coalescence and recombination events in the history of the
sample. However, existing methods for ARG inference are computationally
intensive, highly approximate, or limited to small numbers of sequences, and,
as a consequence, explicit ARG inference is rarely used in applied population
genomics. Here, we introduce a new algorithm for ARG inference that is
efficient enough to apply to dozens of complete mammalian genomes. The key idea
of our approach is to sample an ARG of n chromosomes conditional on an ARG of
n-1 chromosomes, an operation we call "threading." Using techniques based on
hidden Markov models, we can perform this threading operation exactly, up to
the assumptions of the sequentially Markov coalescent and a discretization of
time. An extension allows for threading of subtrees instead of individual
sequences. Repeated application of these threading operations results in highly
efficient Markov chain Monte Carlo samplers for ARGs. We have implemented these
methods in a computer program called ARGweaver. Experiments with simulated data
indicate that ARGweaver converges rapidly to the true posterior distribution
and is effective in recovering various features of the ARG for dozens of
sequences generated under realistic parameters for human populations. In
applications of ARGweaver to 54 human genome sequences from Complete Genomics,
we find clear signatures of natural selection, including regions of unusually
ancient ancestry associated with balancing selection and reductions in allele
age in sites under directional selection. Preliminary results also indicate
that our methods can be used to gain insight into complex features of human
population structure, even with a noninformative prior distribution.Comment: 88 pages, 7 main figures, 22 supplementary figures. This version
contains a substantially expanded genomic data analysi
AccelPrint:Accelerometers are Different by Birth
This paper submits a hypothesis that smartphone accelerometers possess unique fingerprints. We believe that the fingerprints arise from hardware imperfections during the sensor manufacturing process, causing every sensor chip to respond differently to the same motion stimulus. The differences in responses are subtle enough that they do not affect most of the higher level functions computed on them. Nonetheless, upon close inspection, these fingerprints emerge with consistency, and can even be somewhat independent of the stimulus that generates them. Measurements and classification on 80 standalone accelerometer chips, 25 Android phones, and 2 tablets, show precision and recall upward of 96%, along with good robustness to real-world conditions. Unsurprisingly, such sensor fingerprints invite new threats in smartphone applications. A crowd-sourcing app running in the cloud could segregate sensor data for each device, making it easy to track a user over space and time. This paper makes the case that such attacks are almost trivial to launch, while simple solutions may not be adequate to counteract them
Recognizing Handwriting Styles in a Historical Scanned Document Using Unsupervised Fuzzy Clustering
The forensic attribution of the handwriting in a digitized document to
multiple scribes is a challenging problem of high dimensionality. Unique
handwriting styles may be dissimilar in a blend of several factors including
character size, stroke width, loops, ductus, slant angles, and cursive
ligatures. Previous work on labeled data with Hidden Markov models, support
vector machines, and semi-supervised recurrent neural networks have provided
moderate to high success. In this study, we successfully detect hand shifts in
a historical manuscript through fuzzy soft clustering in combination with
linear principal component analysis. This advance demonstrates the successful
deployment of unsupervised methods for writer attribution of historical
documents and forensic document analysis.Comment: 26 pages in total, 5 figures and 2 table
Mapping of Genomic Regions Underlying the Early Flowering Trait in ‘RE2’, a Mutant Derived from Flax (Linum usitatissimum L.) Cultivar ‘Royal’
Canada is a world leader in flax production, and the expansion of the crop into the northern region of the prairies requires early flowering, consequently early maturing cultivars to overcome the frost damage. New sources of variation for flowering time thus hold great interest. Flax genomics resources including chromosome level assembly are now sufficiently developed to examine traits with complex inheritance. An early flowering mutant ‘RE2’ was selected from cultivar ‘Royal’ after treatment with 5-Azacytidine (5-AzaC). The mutant line flowered nearly seven to 13 days earlier than the progenitor ‘Royal’. A large recombinant inbred line (RIL) population encompassing 656 lines, derived from ‘Royal’ x ‘RE2’ was used to identify the potential genomic region underlying the trait. Firstly, the RIL population was phenotyped for early vigour, days to- start of flowering, full flowering, maturity and height in three field seasons (2015, 2016 and 2017) using a modified augmented design type 2, and once in the growth-cabinet. Secondly, the distributional extremes for flowering time identified from the RIL population were subjected to sequencing based bulked segregant analysis. Thirdly, the QTL-seq bioinformatics pipeline (Takagi et al. 2013) was used for the identification of SNP, which were annotated using SnpEff. QTL-seq pipeline identified a SNP upstream of the flax gene homologous to Arabidopsis LUMINIDEPENDENS. Later, the sequencing data were reanalysed with customized variant calling steps succeeded by statistical analysis using QTLseqr (Mansfeld and Grumet 2018), a recent improved pipeline. QTLseqr detected two genomic regions having significant association with early flowering trait on chromosomes 9 and 12. The variants in these regions were found to be associated with genes encoding LATE EMBRYOGENESIS ABUNDANT (LEA) HYDROXYPROLINE-RICH GLYCOPROTEIN FAMILY, MAINTENANCE OF MERISTEMS-LIKE, CYTOCHROME P450 87A3 and PHLOEM PROTEIN 2-A12, based on homology analysis. As ‘RE2’ was derived from the population resulting from the treatment of ‘Royal’ with the demethylating agent 5-AzaC, whole genome bisulfite sequencing data were generated to identify variation in methylation patterns and its association with early flowering. A total of 260,193 cytosines were transformed from methylated state in the late flowering bulk to the unmethylated state in the early flowering bulk, potentially owing to the hypomethylating action of 5-AzaC. Out of the 127 significant differentially methylated regions (DMRs) detected, 59 were overlapping with genes, and 35 DMRs and 33 DMRs were within the upstream- (5kb interval) and intergenic regions, respectively. Interestingly, a cluster of significant DMRs were also present on chromosome 12. Three DMRs (on chromosomes 1, 6 and 7) were overlapping the genes whose homologues encode FASCICLIN-LIKE ARABINOGALACTAN group of proteins, and two DMRs (on chromosome 12) were present upstream to SUPPRESSOR OF FRI 4 and FRIGIDA-ESSENTIAL 1. This study is first of its kind in flax, providing the basis for identifying novel epialleles underlying the early flowering phenotype
- …