24 research outputs found

    A software processing chain for evaluating thesaurus quality

    Get PDF
    Thesauri are knowledge models commonly used for information classification and retrieval whose structure is defined by standards that describe the main features the concepts and relations must have. However, following these standards requires a deep knowledge of the field the thesaurus is going to cover and experience in their creation. To help in this task, this paper describes a software processing chain that provides different validation components that evaluates the quality of the main thesaurus features

    Case-based fracture image retrieval

    Get PDF
    Purpose: Case-based fracture image retrieval can assist surgeons in decisions regarding new cases by supplying visually similar past cases. This tool may guide fracture fixation and management through comparison of long-term outcomes in similar cases. Methods: A fracture image database collected over 10years at the orthopedic service of the University Hospitals of Geneva was used. This database contains 2,690 fracture cases associated with 43 classes (based on the AO/OTA classification). A case-based retrieval engine was developed and evaluated using retrieval precision as a performance metric. Only cases in the same class as the query case are considered as relevant. The scale-invariant feature transform (SIFT) is used for image analysis. Performance evaluation was computed in terms of mean average precision (MAP) and early precision (P10, P30). Retrieval results produced with the GNU image finding tool (GIFT) were used as a baseline. Two sampling strategies were evaluated. One used a dense 40×40 pixel grid sampling, and the second one used the standard SIFT features. Based on dense pixel grid sampling, three unsupervised feature selection strategies were introduced to further improve retrieval performance. With dense pixel grid sampling, the image is divided into 1,600 (40×40) square blocks. The goal is to emphasize the salient regions (blocks) and ignore irrelevant regions. Regions are considered as important when a high variance of the visual features is found. The first strategy is to calculate the variance of all descriptors on the global database. The second strategy is to calculate the variance of all descriptors for each case. A third strategy is to perform a thumbnail image clustering in a first step and then to calculate the variance for each cluster. Finally, a fusion between a SIFT-based system and GIFT is performed. Results: A first comparison on the selection of sampling strategies using SIFT features shows that dense sampling using a pixel grid (MAP = 0.18) outperformed the SIFT detector-based sampling approach (MAP = 0.10). In a second step, three unsupervised feature selection strategies were evaluated. A grid parameter search is applied to optimize parameters for feature selection and clustering. Results show that using half of the regions (700 or 800) obtains the best performance for all three strategies. Increasing the number of clusters in clustering can also improve the retrieval performance. The SIFT descriptor variance in each case gave the best indication of saliency for the regions (MAP = 0.23), better than the other two strategies (MAP = 0.20 and 0.21). Combining GIFT (MAP = 0.23) and the best SIFT strategy (MAP = 0.23) produced significantly better results (MAP = 0.27) than each system alone. Conclusions: A case-based fracture retrieval engine was developed and is available for online demonstration. SIFT is used to extract local features, and three feature selection strategies were introduced and evaluated. A baseline using the GIFT system was used to evaluate the salient point-based approaches. Without supervised learning, SIFT-based systems with optimized parameters slightly outperformed the GIFT system. A fusion of the two approaches shows that the information contained in the two approaches is complementary. Supervised learning on the feature space is foreseen as the next step of this stud

    Embedding Web-based Statistical Translation Models in Cross-Language Information Retrieval

    Get PDF
    Although more and more language pairs are covered by machine translation services, there are still many pairs that lack translation resources. Cross-language information retrieval (CLIR) is an application which needs translation functionality of a relatively low level of sophistication since current models for information retrieval (IR) are still based on a bag-of-words. The Web provides a vast resource for the automatic construction of parallel corpora which can be used to train statistical translation models automatically. The resulting translation models can be embedded in several ways in a retrieval model. In this paper, we will investigate the problem of automatically mining parallel texts from the Web and different ways of integrating the translation models within the retrieval process. Our experiments on standard test collections for CLIR show that the Web-based translation models can surpass commercial MT systems in CLIR tasks. These results open the perspective of constructing a fully automatic query translation device for CLIR at a very low cost.Comment: 37 page

    Normalizing Spontaneous Reports into MedDRA: some Experiments with MagiCoder

    Get PDF
    Text normalization into medical dictionaries is useful to support clinical task. A typical setting is Pharmacovigilance (PV). The manual detection of suspected adverse drug reactions (ADRs) in narrative reports is time consuming and Natural Language Processing (NLP) provides a concrete help to PV experts. In this paper we carry on experiments for testing performances of MagiCoder, an NLP application designed to extract MedDRA terms from narrative clinical text. Given a narrative description, MagiCoder proposes an automatic encoding. The pharmacologist reviews, (possibly) corrects, and then validates the solution. This drastically reduces the time needed for the validation of reports with respect to a completely manual encoding. In previous work we mainly tested MagiCoder performances on Italian written spontaneous reports. In this paper, we include some new features, change the experiment design, and carry on more tests about MagiCoder. Moreover, we do a change of language, moving to English documents. In particular, we tested MagiCoder on the CADEC dataset, a corpus of manually annotated posts about ADRs collected from social media

    Comparative evaluation of term selection functions for authorship attribution

    Get PDF
    Different computational models have been proposed to automatically determine the most probable author of a disputed text (authorship attribution). These models can be viewed as special approaches in the text categorization domain. In this perspective, in a first step we need to determine the most effective features (words, punctuation symbols, part-of-speech, bigram of words, etc.) to discriminate between different authors. To achieve this, we can consider different independent feature-scoring selection functions (information gain, gain ratio, pointwise mutual information, odds ratio, chi-square, bi-normal separation, GSS, Darmstadt Indexing Approach (DIA), and correlation coefficient). Other term selection strategies have also been suggested in specific authorship attribution studies. To compare these two families of selection procedures, we have extracted articles from two newspapers and belonging to two categories (sports and politics). To enlarge the basis of our evaluations, we have chosen one newspaper written in the English language (‘Glasgow Herald') and a second one in Italian (‘La Stampa'). The resulting collections contain from 987 to 2,036 articles written by four to ten columnists. Using the Kullback-Leibler divergence, the chi-square measure and the Delta rule as attribution schemes, this study found that some simple selection strategies (based on occurrence frequency or document frequency) may produce similar, and sometimes better, results compared with more complex one

    Combination approaches for multilingual text retrieval

    Get PDF

    The Robot Vision Track at ImageCLEF 2010

    Get PDF
    This paper describes the robot vision track that has been proposed to the ImageCLEF 2010 participants. The track addressed the problem of visual place classification, with a special focus on generalization. Participants were asked to classify rooms and areas of an offce environment on the basis of image sequences captured by a stereo camera mounted on a mobile robot, under varying illumination conditions. The algorithms proposed by the participants had to answer the question "where are you?" (I am in the kitchen, in the corridor, etc) when presented with a test sequence, acquired within the same building but at a different oor than the training sequence. The test data contained images of rooms seen during training, or additional rooms that were not imaged in the training sequence. The participants were asked to solve the problem separately for each test image (obligatory task). Additionally, results could also be reported for algorithms exploiting the temporal continuity of the image sequences (optional task). A total of seven groups partic- ipated to the challenge, with 42 runs submitted to the obligatory task, and 13 submitted to the optional task. The best result in the obligatory task was obtained by the Computer Vision and Geometry Laboratory, ETHZ, Switzerland, with an overall score of 677. The best result in the optional task was obtained by the Idiap Research Institute, Martigny, Switzerland, with an overall score of 2052

    An automatic method for reporting the quality of thesauri

    Get PDF
    Thesauri are knowledge models commonly used for information classification and retrieval whose structure is defined by standards such as the ISO 25964. However, when creators do not correctly follow the specifications, they construct models with inadequate concepts or relations that provide a limited usability. This paper describes a process that automatically analyzes the thesaurus properties and relations with respect to ISO 25964 specification, and suggests the correction of potential problems. It performs a lexical and syntactic analysis of the concept labels, and a structural and semantic analyses of the relations. The process has been tested with Urbamet and Gemet thesauri and the results have been analyzed to determine how well the proposed process works

    Analysis of biomedical and health queries: Lessons learned from TREC and CLEF evaluation benchmarks

    Get PDF
    International audienceBACKGROUND:Inherited ichthyoses represent a group of rare skin disorders characterized by scaling, hyperkeratosis and inconstant erythema, involving most of the tegument. Epidemiology remains poorly described. This study aims to evaluate the prevalence of inherited ichthyosis (excluding very mild forms) and its different clinical forms in France.METHODS:Capture - recapture method was used for this study. According to statistical requirements, 3 different lists (reference/competence centres, French association of patients with ichthyosis and internet network) were used to record such patients. The study was conducted in 5 areas during a closed period.RESULTS:The prevalence was estimated at 13.3 per million people (/M) (CI95\%, [10.9 - 17.6]). With regard to autosomal recessive congenital ichthyosis, the prevalence was estimated at 7/M (CI 95\% [5.7 - 9.2]), with a prevalence of lamellar ichthyosis and congenital ichthyosiform erythroderma of 4.5/M (CI 95\% [3.7 - 5.9]) and 1.9/M (CI 95\% [1.6 - 2.6]), respectively. Prevalence of keratinopathic forms was estimated at 1.1/M (CI 95\% [0.9 - 1.5]). Prevalence of syndromic forms (all clinical forms together) was estimated at 1.9/M (CI 95\% [1.6 - 2.6]).CONCLUSIONS:Our results constitute a crucial basis to properly size the necessary health measures that are required to improve patient care and design further clinical studies

    Enriquecimiento automático de portales culturales mediante modelos de organización del conocimiento

    Get PDF
    During the last decades, numerous web portals have been launched to disseminate the cultural heritage. Most of these portals were developed with technologies from the syntactic web era, i.e. containing HTML pages with plain text that can be indexed by search engines, but without additional metadata and annotations of concepts belonging to knowledge organization systems that would facilitate the task of thematic specialized search engines. This paper proposes a method for recommending the knowledge organization systems that are better adjusted for the contents of a web portal and the use of these systems for the semantic annotation of the contents. To check the feasibility of the proposed method, we have applied it to the enrichment of a web portal created in the nineties that hosts a virtual catalogue of the works performed by the painter Goya. Thanks to the proposed method, we have been able to recommend knowledge organization system titled List of Subject Headings for Public Libraries because of its closeness with the portal content. In addition, two thirds of the web pages in Spanish were annotated with concepts belonging to this model. Although the accuracy of the mapping between the recognized entities in the text and the concepts of the model is not perfect, it constitutes a good base to allow web portal administrators to refine later this annotation.Durante las últimas décadas se han ido creando numerosos portales web para diseminar el patrimonio cultural. La mayoría de estos portales se crearon en tiempos de la web sintáctica generando páginas HTML con texto plano indexable por buscadores, pero sin metadatos añadidos y anotaciones de conceptos pertenecientes a modelos de organización del conocimiento que facilitarían la labor de buscadores temáticos especializados. Este artículo propone un método para recomendar los modelos de organización del conocimiento que mejor se ajusten a los contenidos de un portal web, y utilizar esos modelos para anotar semánticamente los contenidos. Para verificar la viabilidad del método propuesto se ha aplicado en el enriquecimiento de un portal creado a mediados de los años 90 y que aloja un catálogo virtual de las obras del pintor Goya. Gracias al método propuesto, se ha recomendado el modelo de organización del conocimiento denominado Lista de Encabezamientos de Materias para las Bibliotecas Públicas por su cercanía con el contenido del portal. Además, se han conseguido anotar semánticamente dos tercios de las páginas en castellano del portal con conceptos de este modelo. Aunque la exactitud de los emparejamientos entre las entidades detectadas en el texto y los conceptos del modelo no es perfecta, la anotación realizada constituye una buena base para que los administradores del portal puedan refinar posteriormente esta anotación
    corecore