38 research outputs found
Final report of Task #5: Current document index system for document retrieval investigation
In Part I of this report, we describe the work completed during the last fiscal year (October 1, 2002 thru September 30, 2003). The single biggest challenge this past year has been to develop and deliver a new software technology to classify Homeland Security Sensitive documents with high precision. Not only was a satisfactory system developed, an operational version was delivered to CACI in April 2003. The delivered system is called the Homeland Security Classifier (HSC).
In Part II we give an overview of the projects ISRI has completed during the first four years of this cooperative agreement (October 1, 1998 thru September 30, 2002). Each of the deliverables associated with these projects has been thoroughly described in previous reports
Profiles of Coronary Artery Disease Risk in Cardiac Patients: Actual versus Perceived
PURPOSE: To describe interrelations and differences between actual vs. perceived cardiac risk in a cohort of coronary artery disease (CAD) patients. METHODS: 33 females (HT: 164 cm, WT: 80kg) and 67 males (HT: 179 cm, WT: 93kg) with documented CAD completed a questionnaire designed to assess CAD risk perception. They also underwent assessments for all ACSM risk factors. Five-point Likert scale responses to the question “Compared to other persons of your own age and sex, how would you rate your risk of ever having a heart attack?” were used to quantify CAD risk perception. To quantify actual risk, the number of ACSM risk markers for each subject was tabulated. It should be noted that, since all of the subjects had active CAD, they were all at high risk. Tabulations and Likert scale responses were compared using Chi-square analysis or Fisher’s Exact test with significance accepted at p\u3c0.05. To assess risk perception accuracy, Chi-square analysis with pre-determined expected cell count percentages was used. RESULTS: When compared to diagnosis driven expected frequencies of risk perception being higher or much higher (75% and 25% respectively), patients responses were only 30% and 11% respectively (Chi-square=19696.9, p\u3c.0001). Also, as the number of actual ACSM risk markers increased for each patient, no increase in patient risk perception was found (Chi-square=40.2, p=0.29). Factors associated with accurate perception include age, resting ECG status, and number of bypass grafts. Factors that were not accurately included in risk perception include family history, waist circumference, number and type of angioplasties, smoking, having had a heart attack, number of additional structural cardiac abnormalities present, the presence of arrhythmias, elevated blood lipids and blood glucose, and elevated systolic and diastolic blood pressures. CONCLUSION: Although substantial differences in number and type of actual cardiac risk exist in a cohort of cardiac patients, individual perception of these risks is not accurate in the majority of cases
Post-editing through approximation and global correction
Abstract This paper describes a new automatic spelling correction program to deal with OCR generated errors. The method used here is based on three principles: 1. Approximate string matching between the misspellings and the terms occuring in the database as opposed to the entire dictionary 2. Local information obtained from the individual documents 3. The use of a confusion matrix, which contains information inherently specific to the nature of errors caused by the particular OCR device This system is then utilized to process approximately 10,000 pages of OCR generated documents. Among the misspellings discovered by this algorithm, about 87 % were corrected
Evaluation of Model-Based Retrieval Effectiveness with OCR Text
We give a comprehensive report on our experiments with retrieval from OCR-generated text using systems based on standard models of retrieval. More specifically, we show that average precision and recall is not affected by OCR errors across systems for several collections. The collections used in these experiments include both actual OCR-generated text and standard information retrieval collections corrupted through the simulation of OCR errors. Both the actual and simulation experiments include full-text and abstract-length documents. We also demonstrate that the ranking and feedback methods associated with these models are generally not robust enough to deal with OCR errors. It is further shown that the OCR errors and garbage strings generated from the mistranslation of graphic objects increase the size of the index by a wide margin. We not only point out problems that can arise from applying OCR text within an information retrieval environment, we also suggest solutions to overcome some of these problems
Information Retrieval and OCR
Optical character recognition and document image analysis have become important areas with a fast-growing number of researchers in the field. This handbook with contributions by eminent experts, presents both the theoretical and practical aspects at an introductory level wherever possible
The eectiveness of thesauri-aided retrieval
Abstract In this report, we describe the results of an experiment designed to measure the effects of automatic query expansion on retrieval effectiveness. In particular, we used a collection-specific thesaurus to expand the query by adding synonyms of the searched terms. Our preliminary results show no significant gain in average precision and recall. \Lambd