552 research outputs found

    NetiNeti : Discovery of Scientific Names from Text Using Machine Learning Methods Figure 1

    Get PDF
    Figure 1 demonstrates a series of training experiments with the Naïve Bayes classifier using different neighborhoods for contextual features, different sizes of positive and negative training examples and evaluated the resulting classifiers with our annotated gold standard corpus. The data sets are the results of running NetiNeti on subset of 136 PubMedCentral tagged open access articles and with no stop list.A scientific name for an organism can be associated with almost all biological data. Name identification is an important step in many text mining tasks aiming to extract useful information from biological, biomedical and biodiversity text sources. A scientific name acts as an important metadata element to link biological information.We present NetiNeti, a machine learning based approach for identification and discovery of scientific names. The system implementing the approach can be accessed at http://namefinding.ubio.org we present the comparison results of various machine learning algorithms on our annotated corpus. Naïve Bayes and Maximum Entropy with Generalized Iterative Scaling (GIS) parameter estimation are the top two performing algorithms

    Information retrieval and text mining technologies for chemistry

    Get PDF
    Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.A.V. and M.K. acknowledge funding from the European Community’s Horizon 2020 Program (project reference: 654021 - OpenMinted). M.K. additionally acknowledges the Encomienda MINETAD-CNIO as part of the Plan for the Advancement of Language Technology. O.R. and J.O. thank the Foundation for Applied Medical Research (FIMA), University of Navarra (Pamplona, Spain). This work was partially funded by Consellería de Cultura, Educación e Ordenación Universitaria (Xunta de Galicia), and FEDER (European Union), and the Portuguese Foundation for Science and Technology (FCT) under the scope of the strategic funding of UID/BIO/04469/2013 unit and COMPETE 2020 (POCI-01-0145-FEDER-006684). We thank Iñigo Garciá -Yoldi for useful feedback and discussions during the preparation of the manuscript.info:eu-repo/semantics/publishedVersio

    Management of Scientific Images: An approach to the extraction, annotation and retrieval of figures in the field of High Energy Physics

    Get PDF
    El entorno de la información en la primera década del siglo XXI no tiene precedentes. Las barreras físicas que han limitado el acceso al conocimiento están desapareciendo a medida que los métodos tradicionales de acceso a información se reemplazan o se mejoran gracias al uso de sistemas basados en computador. Los sistemas digitales son capaces de gestionar colecciones mucho más grandes de documentos, confrontando a los usuarios de información con la avalancha de documentos asociados a su tópico de interés. Esta nueva situación ha creado un incentivo para el desarrollo de técnicas de minería de datos y la creación de motores de búsqueda más eficientes y capaces de limitar los resultados de búsqueda a un subconjunto reducido de los más relevantes. Sin embargo, la mayoría de los motores de búsqueda en la actualidad trabajan con descripciones textuales. Estas descripciones se pueden extraer o bien del contenido o a través de fuentes externas. La recuperación basada en el contenido no textual de documentos es un tema de investigación continua. En particular, la recuperación de imágenes y el desentrañar la información contenida en ellas están suscitando un gran interés en la comunidad científica. Las bibliotecas digitales se sitúan en una posición especial dentro de los sistemas que facilitan el acceso al conocimiento. Actúan como repositorios de documentos que comparten algunas características comunes (por ejemplo, pertenecer a la misma área de conocimiento o ser publicados por la misma institución) y como tales contienen documentos considerados de interés para un grupo particular de usuarios. Además, facilitan funcionalidades de recuperación sobre las colecciones gestionadas. Normalmente, las publicaciones científicas son las unidades más pequeñas gestionadas por las bibliotecas digitales científicas. Sin embargo, en el proceso de creación científica hay diferentes tipos de artefactos, entre otros: figuras y conjuntos de datos. Las figuras juegan un papel particularmente importante en el proceso de publicación científica. Representan los datos en una forma gráfica que nos permite mostrar patrones sobre grandes conjuntos de datos y transmitir ideas complejas de un modo fácilmente entendible. Los sistemas existentes para bibliotecas digitales facilitan el acceso a figuras, pero solo como parte de los ficheros sobre los que se serializa la publicación entera. El objetivo de esta tesis es proponer un conjunto de métodos ytécnicas que permitan transformar las figuras en productos de primera clase dentro del proceso de publicación científica, permitiendo que los investigadores puedan obtener el máximo beneficio a la hora de realizar búsquedas y revisiones de bibliografía existente. Los métodos y técnicas propuestos están orientados a facilitar la adquisición, anotación semántica y búsqueda de figuras contenidas en publicaciones científicas. Para demostrar la completitud de la investigación se han ilustrado las teorías propuestas mediante ejemplos en el campo de la Física de Partículas (también conocido como Física de Altas Energías). Para aquellos casos en los que se han necesitadoo en las figuras que aparecen con más frecuencia en las publicaciones de Física de Partículas: los gráficos científicos denominados en inglés con el término plots. Los prototipos que propuestas más detalladas han desarrollado para esta tesis se han integrado parcialmente dentro del software Invenio (1) para bibliotecas digitales, así como dentro de INSPIRE, una de las mayores bibliotecas digitales en Física de Partículas mantenida gracias a la colaboración de grandes laboratorios y centros de investigación como son el CERN, SLAC, DESY y Fermilab. 1). http://invenio-software.org

    Statistical assessment on Non-cooperative Target Recognition using the Neyman-Pearson statistical test

    Get PDF
    Electromagnetic simulations of a X-target were performed in order to obtain its Radar Cross Section (RCS) for several positions and frequencies. The software used is the CST MWS©. A 1 : 5 scale model of the proposed aircraft was created in CATIA© V5 R19 and imported directly into the CST MWS© environment. Simulations on the X-band were made with a variable mesh size due to a considerable wavelength variation. It is intended to evaluate the Neyman-Pearson (NP) simple hypothesis test performance by analyzing its Receiver Operating Characteristics (ROCs) for two different radar detection scenarios - a Radar Absorbent Material (RAM) coated model, and a Perfect Electric Conductor (PEC) model for recognition purposes. In parallel the radar range equation is used to estimate the maximum range detection for the simulated RAM coated cases to compare their shielding effectiveness (SE) and its consequent impact on recognition. The AN/APG-68(V)9’s airborne radar specifications were used to compute these ranges and to simulate an airborne hostile interception for a Non-Cooperative Target Recognition (NCTR) environment. Statistical results showed weak recognition performances using the Neyman-Pearson (NP) statistical test. Nevertheless, good RCS reductions for most of the simulated positions were obtained reflecting in a 50:9% maximum range detection gain for the PAniCo RAM coating, abiding with experimental results taken from the reviewed literature. The best SE was verified for the PAniCo and CFC-Fe RAMs.Simulações electromagnéticas do alvo foram realizadas de modo a obter a assinatura radar (RCS) para várias posições e frequências. O software utilizado é o CST MWS©. O modelo proposto à escala 1:5 foi modelado em CATIA© V5 R19 e importado diretamente para o ambiente de trabalho CST MWS©. Foram efectuadas simulações na banda X com uma malha de tamanho variável devido à considerável variação do comprimento de onda. Pretende-se avaliar estatisticamente o teste de decisão simples de Neyman-Pearson (NP), analisando as Características de Operação do Receptor (ROCs) para dois cenários de detecção distintos - um modelo revestido com material absorvente (RAM), e outro sendo um condutor perfeito (PEC) para fins de detecção. Em paralelo, a equação de alcance para radares foi usada para estimar o alcance máximo de detecção para ambos os casos de modo a comparar a eficiência de blindagem electromagnética (SE) entre os diferentes revestimentos. As especificações do radar AN/APG-68(V)9 do F-16 foram usadas para calcular os alcances para cada material, simulando uma intercepção hostil num ambiente de reconhecimento de alvos não-cooperativos (NCTR). Os resultados mostram performances de detecção fracas usando o teste de decisão simples de Neyman-Pearson como detector e uma boa redução de RCS para todas as posições na gama de frequências selecionada. Um ganho de alcance de detecção máximo 50:9 % foi obtido para o RAM PAniCo, estando de acordo com os resultados experimentais da bibliografia estudada. Já a melhor SE foi verificada para o RAM CFC-Fe e PAniCo

    Clinical Data Reuse or Secondary Use: Current Status and Potential Future Progress

    Get PDF
    Objective: To perform a review of recent research in clinical data reuse or secondary use, and envision future advances in this field. Methods: The review is based on a large literature search in MEDLINE (through PubMed), conference proceedings, and the ACM Digital Library, focusing only on research published between 2005 and early 2016. Each selected publication was reviewed by the authors, and a structured analysis and summarization of its content was developed. Results: The initial search produced 359 publications, reduced after a manual examination of abstracts and full publications. The following aspects of clinical data reuse are discussed: motivations and challenges, privacy and ethical concerns, data integration and interoperability, data models and terminologies, unstructured data reuse, structured data mining, clinical practice and research integration, and examples of clinical data reuse (quality measurement and learning healthcare systems). Conclusion: Reuse of clinical data is a fast-growing field recognized as essential to realize the potentials for high quality healthcare, improved healthcare management, reduced healthcare costs, population health management, and effective clinical research
    corecore