2,602 research outputs found
An Intelligent System For Arabic Text Categorization
Text Categorization (classification) is the process of classifying documents into a predefined set of categories based on their content. In this paper, an intelligent Arabic text categorization system is presented. Machine learning algorithms are used in this system. Many algorithms for stemming and feature selection are tried. Moreover, the document is represented using several term weighting schemes and finally the k-nearest neighbor and Rocchio classifiers are used for classification process. Experiments are performed over self collected data corpus and the results show that the suggested hybrid method of statistical and light stemmers is the most suitable stemming algorithm for Arabic language. The results also show that a hybrid approach of document frequency and information gain is the preferable feature selection criterion and normalized-tfidf is the best weighting scheme. Finally, Rocchio classifier has the advantage over k-nearest neighbor classifier in the classification process. The experimental results illustrate that the proposed model is an efficient method and gives generalization accuracy of about 98%
Toward Open-Set Face Recognition
Much research has been conducted on both face identification and face
verification, with greater focus on the latter. Research on face identification
has mostly focused on using closed-set protocols, which assume that all probe
images used in evaluation contain identities of subjects that are enrolled in
the gallery. Real systems, however, where only a fraction of probe sample
identities are enrolled in the gallery, cannot make this closed-set assumption.
Instead, they must assume an open set of probe samples and be able to
reject/ignore those that correspond to unknown identities. In this paper, we
address the widespread misconception that thresholding verification-like scores
is a good way to solve the open-set face identification problem, by formulating
an open-set face identification protocol and evaluating different strategies
for assessing similarity. Our open-set identification protocol is based on the
canonical labeled faces in the wild (LFW) dataset. Additionally to the known
identities, we introduce the concepts of known unknowns (known, but
uninteresting persons) and unknown unknowns (people never seen before) to the
biometric community. We compare three algorithms for assessing similarity in a
deep feature space under an open-set protocol: thresholded verification-like
scores, linear discriminant analysis (LDA) scores, and an extreme value machine
(EVM) probabilities. Our findings suggest that thresholding EVM probabilities,
which are open-set by design, outperforms thresholding verification-like
scores.Comment: Accepted for Publication in CVPR 2017 Biometrics Worksho
A high performance k-NN approach using binary neural networks
This paper evaluates a novel k-nearest neighbour (k-NN) classifier built from binary neural networks. The binary neural approach uses robust encoding to map standard ordinal, categorical and numeric data sets onto a binary neural network. The binary neural network uses high speed pattern matching to recall a candidate set of matching records, which are then processed by a conventional k-NN approach to determine the k-best matches. We compare various configurations of the binary approach to a conventional approach for memory overheads, training speed, retrieval speed and retrieval accuracy. We demonstrate the superior performance with respect to speed and memory requirements of the binary approach compared to the standard approach and we pinpoint the optimal configurations. (C) 2003 Elsevier Ltd. All rights reserved
THE USE OF REGRESSION MODELS FOR DETECTING DIGITAL FINGERPRINTS IN SYNTHETIC AUDIO
Modern advancements in text to speech and voice conversion techniques make it increasingly difficult to distinguish an authentic voice from a synthetically generated voice. These techniques, though complex, are relatively easy to use, even for non-technical users. It is important to develop mechanisms for detecting false content that easily scale to the size of the monitoring requirement. Current approaches for detecting spoofed audio are difficult to scale because of their processing requirements. Individually analyzing spectrograms for aberrations at higher frequencies relies too much on independent verification and is more resource intensive. Our method addresses the resource consideration by only looking at the residual differences between an audio file’s smoothed signal and its actual signal. We conjecture that natural audio has greater variance than spoofed audio because spoofed audio’s generation is conditioned on trying to mimic an existing pattern. To test this, we develop a classifier that distinguishes between spoofed and real audio by analyzing the differences in residual patterns between audio files.Outstanding ThesisMajor, United States ArmyApproved for public release. Distribution is unlimited
A random forest system combination approach for error detection in digital dictionaries
When digitizing a print bilingual dictionary, whether via optical character
recognition or manual entry, it is inevitable that errors are introduced into
the electronic version that is created. We investigate automating the process
of detecting errors in an XML representation of a digitized print dictionary
using a hybrid approach that combines rule-based, feature-based, and language
model-based methods. We investigate combining methods and show that using
random forests is a promising approach. We find that in isolation, unsupervised
methods rival the performance of supervised methods. Random forests typically
require training data so we investigate how we can apply random forests to
combine individual base methods that are themselves unsupervised without
requiring large amounts of training data. Experiments reveal empirically that a
relatively small amount of data is sufficient and can potentially be further
reduced through specific selection criteria.Comment: 9 pages, 7 figures, 10 tables; appeared in Proceedings of the
Workshop on Innovative Hybrid Approaches to the Processing of Textual Data,
April 201
- …