11,309 research outputs found
KACST Arabic Text Classification Project: Overview and Preliminary Results
Electronically formatted Arabic free-texts can be found in abundance these days on the World Wide Web, often linked to commercial enterprises and/or government organizations. Vast tracts of knowledge and relations lie hidden within these texts, knowledge that can be exploited once the correct intelligent tools have been identified and applied. For example, text mining may help with text classification and categorization. Text classification aims to automatically assign text to a predefined category based on identifiable linguistic features. Such a process has different useful applications including, but not restricted to, E-Mail spam detection, web pages content filtering, and automatic message routing. In this paper an overview of King Abdulaziz City for Science and Technology (KACST) Arabic Text Classification Project will be illustrated along with some preliminary results. This project will contribute to the better understanding and elaboration of Arabic text classification techniques
Chi-square-based scoring function for categorization of MEDLINE citations
Objectives: Text categorization has been used in biomedical informatics for
identifying documents containing relevant topics of interest. We developed a
simple method that uses a chi-square-based scoring function to determine the
likelihood of MEDLINE citations containing genetic relevant topic. Methods: Our
procedure requires construction of a genetic and a nongenetic domain document
corpus. We used MeSH descriptors assigned to MEDLINE citations for this
categorization task. We compared frequencies of MeSH descriptors between two
corpora applying chi-square test. A MeSH descriptor was considered to be a
positive indicator if its relative observed frequency in the genetic domain
corpus was greater than its relative observed frequency in the nongenetic
domain corpus. The output of the proposed method is a list of scores for all
the citations, with the highest score given to those citations containing MeSH
descriptors typical for the genetic domain. Results: Validation was done on a
set of 734 manually annotated MEDLINE citations. It achieved predictive
accuracy of 0.87 with 0.69 recall and 0.64 precision. We evaluated the method
by comparing it to three machine learning algorithms (support vector machines,
decision trees, na\"ive Bayes). Although the differences were not statistically
significantly different, results showed that our chi-square scoring performs as
good as compared machine learning algorithms. Conclusions: We suggest that the
chi-square scoring is an effective solution to help categorize MEDLINE
citations. The algorithm is implemented in the BITOLA literature-based
discovery support system as a preprocessor for gene symbol disambiguation
process.Comment: 34 pages, 2 figure
Spatially Aware Dictionary Learning and Coding for Fossil Pollen Identification
We propose a robust approach for performing automatic species-level
recognition of fossil pollen grains in microscopy images that exploits both
global shape and local texture characteristics in a patch-based matching
methodology. We introduce a novel criteria for selecting meaningful and
discriminative exemplar patches. We optimize this function during training
using a greedy submodular function optimization framework that gives a
near-optimal solution with bounded approximation error. We use these selected
exemplars as a dictionary basis and propose a spatially-aware sparse coding
method to match testing images for identification while maintaining global
shape correspondence. To accelerate the coding process for fast matching, we
introduce a relaxed form that uses spatially-aware soft-thresholding during
coding. Finally, we carry out an experimental study that demonstrates the
effectiveness and efficiency of our exemplar selection and classification
mechanisms, achieving accuracy on a difficult fine-grained species
classification task distinguishing three types of fossil spruce pollen.Comment: CVMI 201
- …