1,332 research outputs found

    Scatteract: Automated extraction of data from scatter plots

    Full text link
    Charts are an excellent way to convey patterns and trends in data, but they do not facilitate further modeling of the data or close inspection of individual data points. We present a fully automated system for extracting the numerical values of data points from images of scatter plots. We use deep learning techniques to identify the key components of the chart, and optical character recognition together with robust regression to map from pixels to the coordinate system of the chart. We focus on scatter plots with linear scales, which already have several interesting challenges. Previous work has done fully automatic extraction for other types of charts, but to our knowledge this is the first approach that is fully automatic for scatter plots. Our method performs well, achieving successful data extraction on 89% of the plots in our test set.Comment: Submitted to ECML PKDD 2017 proceedings, 16 page

    Genome-wide inference of ancestral recombination graphs

    Get PDF
    The complex correlation structure of a collection of orthologous DNA sequences is uniquely captured by the "ancestral recombination graph" (ARG), a complete record of coalescence and recombination events in the history of the sample. However, existing methods for ARG inference are computationally intensive, highly approximate, or limited to small numbers of sequences, and, as a consequence, explicit ARG inference is rarely used in applied population genomics. Here, we introduce a new algorithm for ARG inference that is efficient enough to apply to dozens of complete mammalian genomes. The key idea of our approach is to sample an ARG of n chromosomes conditional on an ARG of n-1 chromosomes, an operation we call "threading." Using techniques based on hidden Markov models, we can perform this threading operation exactly, up to the assumptions of the sequentially Markov coalescent and a discretization of time. An extension allows for threading of subtrees instead of individual sequences. Repeated application of these threading operations results in highly efficient Markov chain Monte Carlo samplers for ARGs. We have implemented these methods in a computer program called ARGweaver. Experiments with simulated data indicate that ARGweaver converges rapidly to the true posterior distribution and is effective in recovering various features of the ARG for dozens of sequences generated under realistic parameters for human populations. In applications of ARGweaver to 54 human genome sequences from Complete Genomics, we find clear signatures of natural selection, including regions of unusually ancient ancestry associated with balancing selection and reductions in allele age in sites under directional selection. Preliminary results also indicate that our methods can be used to gain insight into complex features of human population structure, even with a noninformative prior distribution.Comment: 88 pages, 7 main figures, 22 supplementary figures. This version contains a substantially expanded genomic data analysi

    AccelPrint:Accelerometers are Different by Birth

    Get PDF
    This paper submits a hypothesis that smartphone accelerometers possess unique fingerprints. We believe that the fingerprints arise from hardware imperfections during the sensor manufacturing process, causing every sensor chip to respond differently to the same motion stimulus. The differences in responses are subtle enough that they do not affect most of the higher level functions computed on them. Nonetheless, upon close inspection, these fingerprints emerge with consistency, and can even be somewhat independent of the stimulus that generates them. Measurements and classification on 80 standalone accelerometer chips, 25 Android phones, and 2 tablets, show precision and recall upward of 96%, along with good robustness to real-world conditions. Unsurprisingly, such sensor fingerprints invite new threats in smartphone applications. A crowd-sourcing app running in the cloud could segregate sensor data for each device, making it easy to track a user over space and time. This paper makes the case that such attacks are almost trivial to launch, while simple solutions may not be adequate to counteract them

    Recognizing Handwriting Styles in a Historical Scanned Document Using Unsupervised Fuzzy Clustering

    Full text link
    The forensic attribution of the handwriting in a digitized document to multiple scribes is a challenging problem of high dimensionality. Unique handwriting styles may be dissimilar in a blend of several factors including character size, stroke width, loops, ductus, slant angles, and cursive ligatures. Previous work on labeled data with Hidden Markov models, support vector machines, and semi-supervised recurrent neural networks have provided moderate to high success. In this study, we successfully detect hand shifts in a historical manuscript through fuzzy soft clustering in combination with linear principal component analysis. This advance demonstrates the successful deployment of unsupervised methods for writer attribution of historical documents and forensic document analysis.Comment: 26 pages in total, 5 figures and 2 table

    Mapping of Genomic Regions Underlying the Early Flowering Trait in ‘RE2’, a Mutant Derived from Flax (Linum usitatissimum L.) Cultivar ‘Royal’

    Get PDF
    Canada is a world leader in flax production, and the expansion of the crop into the northern region of the prairies requires early flowering, consequently early maturing cultivars to overcome the frost damage. New sources of variation for flowering time thus hold great interest. Flax genomics resources including chromosome level assembly are now sufficiently developed to examine traits with complex inheritance. An early flowering mutant ‘RE2’ was selected from cultivar ‘Royal’ after treatment with 5-Azacytidine (5-AzaC). The mutant line flowered nearly seven to 13 days earlier than the progenitor ‘Royal’. A large recombinant inbred line (RIL) population encompassing 656 lines, derived from ‘Royal’ x ‘RE2’ was used to identify the potential genomic region underlying the trait. Firstly, the RIL population was phenotyped for early vigour, days to- start of flowering, full flowering, maturity and height in three field seasons (2015, 2016 and 2017) using a modified augmented design type 2, and once in the growth-cabinet. Secondly, the distributional extremes for flowering time identified from the RIL population were subjected to sequencing based bulked segregant analysis. Thirdly, the QTL-seq bioinformatics pipeline (Takagi et al. 2013) was used for the identification of SNP, which were annotated using SnpEff. QTL-seq pipeline identified a SNP upstream of the flax gene homologous to Arabidopsis LUMINIDEPENDENS. Later, the sequencing data were reanalysed with customized variant calling steps succeeded by statistical analysis using QTLseqr (Mansfeld and Grumet 2018), a recent improved pipeline. QTLseqr detected two genomic regions having significant association with early flowering trait on chromosomes 9 and 12. The variants in these regions were found to be associated with genes encoding LATE EMBRYOGENESIS ABUNDANT (LEA) HYDROXYPROLINE-RICH GLYCOPROTEIN FAMILY, MAINTENANCE OF MERISTEMS-LIKE, CYTOCHROME P450 87A3 and PHLOEM PROTEIN 2-A12, based on homology analysis. As ‘RE2’ was derived from the population resulting from the treatment of ‘Royal’ with the demethylating agent 5-AzaC, whole genome bisulfite sequencing data were generated to identify variation in methylation patterns and its association with early flowering. A total of 260,193 cytosines were transformed from methylated state in the late flowering bulk to the unmethylated state in the early flowering bulk, potentially owing to the hypomethylating action of 5-AzaC. Out of the 127 significant differentially methylated regions (DMRs) detected, 59 were overlapping with genes, and 35 DMRs and 33 DMRs were within the upstream- (5kb interval) and intergenic regions, respectively. Interestingly, a cluster of significant DMRs were also present on chromosome 12. Three DMRs (on chromosomes 1, 6 and 7) were overlapping the genes whose homologues encode FASCICLIN-LIKE ARABINOGALACTAN group of proteins, and two DMRs (on chromosome 12) were present upstream to SUPPRESSOR OF FRI 4 and FRIGIDA-ESSENTIAL 1. This study is first of its kind in flax, providing the basis for identifying novel epialleles underlying the early flowering phenotype
    • …
    corecore