24,364 research outputs found
Construction and evaluation of classifiers for forensic document analysis
In this study we illustrate a statistical approach to questioned document
examination. Specifically, we consider the construction of three classifiers
that predict the writer of a sample document based on categorical data. To
evaluate these classifiers, we use a data set with a large number of writers
and a small number of writing samples per writer. Since the resulting
classifiers were found to have near perfect accuracy using leave-one-out
cross-validation, we propose a novel Bayesian-based cross-validation method for
evaluating the classifiers.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS379 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Query by String word spotting based on character bi-gram indexing
In this paper we propose a segmentation-free query by string word spotting
method. Both the documents and query strings are encoded using a recently
proposed word representa- tion that projects images and strings into a common
atribute space based on a pyramidal histogram of characters(PHOC). These
attribute models are learned using linear SVMs over the Fisher Vector
representation of the images along with the PHOC labels of the corresponding
strings. In order to search through the whole page, document regions are
indexed per character bi- gram using a similar attribute representation. On top
of that, we propose an integral image representation of the document using a
simplified version of the attribute model for efficient computation. Finally we
introduce a re-ranking step in order to boost retrieval performance. We show
state-of-the-art results for segmentation-free query by string word spotting in
single-writer and multi-writer standard datasetsComment: To be published in ICDAR201
A Comprehensive Study of ImageNet Pre-Training for Historical Document Image Analysis
Automatic analysis of scanned historical documents comprises a wide range of
image analysis tasks, which are often challenging for machine learning due to a
lack of human-annotated learning samples. With the advent of deep neural
networks, a promising way to cope with the lack of training data is to
pre-train models on images from a different domain and then fine-tune them on
historical documents. In the current research, a typical example of such
cross-domain transfer learning is the use of neural networks that have been
pre-trained on the ImageNet database for object recognition. It remains a
mostly open question whether or not this pre-training helps to analyse
historical documents, which have fundamentally different image properties when
compared with ImageNet. In this paper, we present a comprehensive empirical
survey on the effect of ImageNet pre-training for diverse historical document
analysis tasks, including character recognition, style classification,
manuscript dating, semantic segmentation, and content-based retrieval. While we
obtain mixed results for semantic segmentation at pixel-level, we observe a
clear trend across different network architectures that ImageNet pre-training
has a positive effect on classification as well as content-based retrieval
For Geometric Inference from Images, What Kind of Statistical Model Is Necessary?
In order to facilitate smooth communications with researchers in other fields including statistics, this paper investigates the meaning of "statistical methods" for geometric inference based on image feature points, We point out that statistical analysis does not make sense unless the underlying "statistical ensemble" is clearly defined. We trace back the origin of feature uncertainty to image processing operations for computer vision in general and discuss the implications of asymptotic analysis for performance evaluation in reference to "geometric fitting", "geometric model selection", the "geometric AIC", and the "geometric MDL". Referring to such statistical concepts as "nuisance parameters", the "Neyman-Scott problem", and "semiparametric models", we point out that simulation experiments for performance evaluation will lose meaning without carefully considering the assumptions involved and intended applications
Associative and repetition priming with the repeated masked prime technique: No priming found
Wentura and Frings (2005) reported evidence of subliminal categorical priming on a lexical decision task, using a new method of visual masking in which the prime string consisted of the prime word flanked by random consonants and random letter masks alternated with the prime string on successive refresh cycles. We investigated associative and repetition priming on lexical decision, using the same method of visual masking. Three experiments failed to show any evidence of associative priming, (1) when the prime string was fixed at 10 characters (three to six flanking letters) and (2) when the number of flanking letters were reduced or absent. In all cases, prime detection was at chance level. Strong associative priming was observed with visible unmasked primes, but the addition of flanking letters restricted priming even though prime detection was still high. With repetition priming, no priming effects were found with the repeated masked technique, and prime detection was poor but just above chance levels. We conclude that with repeated masked primes, there is effective visual masking but that associative priming and repetition priming do not occur with experiment-unique prime-target pairs. Explanations for this apparent discrepancy across priming paradigms are discussed. The priming stimuli and prime-target pairs used in this study may be downloaded as supplemental materials from mc.psychonomic-journals.org/content/supplemental. © 2009 The Psychonomic Society, Inc
On the Feasibility of Malware Authorship Attribution
There are many occasions in which the security community is interested to
discover the authorship of malware binaries, either for digital forensics
analysis of malware corpora or for thwarting live threats of malware invasion.
Such a discovery of authorship might be possible due to stylistic features
inherent to software codes written by human programmers. Existing studies of
authorship attribution of general purpose software mainly focus on source code,
which is typically based on the style of programs and environment. However,
those features critically depend on the availability of the program source
code, which is usually not the case when dealing with malware binaries. Such
program binaries often do not retain many semantic or stylistic features due to
the compilation process. Therefore, authorship attribution in the domain of
malware binaries based on features and styles that will survive the compilation
process is challenging. This paper provides the state of the art in this
literature. Further, we analyze the features involved in those techniques. By
using a case study, we identify features that can survive the compilation
process. Finally, we analyze existing works on binary authorship attribution
and study their applicability to real malware binaries.Comment: FPS 201
- …