6 research outputs found

    Extracting Conceptual Terms from Medical Documents

    Get PDF
    Automated biomedical concept recognition is important for biomedical document retrieval and text mining research. In this paper, we describe a two-step concept extraction technique for documents in biomedical domain. Step one includes noun phrase extraction, which can automatically extract noun phrases from medical documents. Extracted noun phrases are used as concept term candidates which become inputs of next step. Step two includes keyphrase extraction, which can automatically identify important topical terms from candidate terms. Experiments were conducted to evaluate results of both steps. The experiment results show that our noun phrase extractor is effective in identifying noun phrases from medical documents, so is the keyphrase extractor in identifying document conceptual terms

    Using Noun Phrases for Navigating Biomedical Literature on Pubmed: How Many Updates Are We Losing Track of?

    Get PDF
    Author-supplied citations are a fraction of the related literature for a paper. The “related citations” on PubMed is typically dozens or hundreds of results long, and does not offer hints why these results are related. Using noun phrases derived from the sentences of the paper, we show it is possible to more transparently navigate to PubMed updates through search terms that can associate a paper with its citations. The algorithm to generate these search terms involved automatically extracting noun phrases from the paper using natural language processing tools, and ranking them by the number of occurrences in the paper compared to the number of occurrences on the web. We define search queries having at least one instance of overlap between the author-supplied citations of the paper and the top 20 search results as citation validated (CV). When the overlapping citations were written by same authors as the paper itself, we define it as CV-S and different authors is defined as CV-D. For a systematic sample of 883 papers on PubMed Central, at least one of the search terms for 86% of the papers is CV-D versus 65% for the top 20 PubMed “related citations.” We hypothesize these quantities computed for the 20 million papers on PubMed to differ within 5% of these percentages. Averaged across all 883 papers, 5 search terms are CV-D, and 10 search terms are CV-S, and 6 unique citations validate these searches. Potentially related literature uncovered by citation-validated searches (either CV-S or CV-D) are on the order of ten per paper – many more if the remaining searches that are not citation-validated are taken into account. The significance and relationship of each search result to the paper can only be vetted and explained by a researcher with knowledge of or interest in that paper

    A Pattern-Based Voting Approach for Concept Discovery on the Web

    Get PDF
    Abstract. Automatically discovering concepts is not only a fundamental task in knowledge capturing and ontology engineering processes, but also a key step of many applications in information retrieval. For such a task, pattern-based approaches and statistics-based approaches are widely used, between which the former ones eventually turned out to be more precise. However, the effective patterns in such approaches are usually defined manually. It involves much time and human labor, and considers only a limited set of effective patterns. In our research, we accomplish automatically obtaining patterns through frequent sequence mining. A voting approach is then presented that can determine whether a sentence contains a concept and accurately identify it. Our algorithm includes three steps: pattern mining, pattern refining and concept discovery. In our experimental study, we use several traditional measures, precision, recall and F1 value, to evaluate the performance of our approach. The experimental results not only verify the validity of the approach, but also illustrate the relationship between performance and the parameters of the algorithm

    ChemTok: A New Rule Based Tokenizer for Chemical Named Entity Recognition

    Get PDF

    Metodologia Computacional para Identificação de Sintagmas Nominais da Língua Portuguesa

    Get PDF
    Sintagmas sĂŁo unidades de sentido e com função sintĂĄtica dentro de uma frase, [Nicola 2008]. De maneira geral, as frases que compĂ”em qualquer enunciado expressam um conteĂșdo por meio dos elementos e das combinaçÔes desses elementos que a lĂ­ngua proporciona. Dessa forma, vĂŁo se formando conjuntos e subconjuntos que funcionam como unidades sintĂĄticas dentro da unidade maior que Ă© a frase -- os sintagmas, que podem ser divididos em: sintagmas nominais e verbais. Dentre esses, os nominais representam maior interesse devido ao maior valor semĂąntico contido. Os sintagmas nominais sĂŁo utilizados em tarefas de Processamento de Linguagem Natural (PLN), como resolução de correferĂȘncias (anĂĄforas), construção automĂĄtica de ontologias, em parses usados em textos mĂ©dicos para geração de resumos e criação de vocabulĂĄrio, ou ainda como uma etapa inicial em processos de anĂĄlise sintĂĄtica. Em Recuperação de Informação (RI) os sintagmas podem ser aplicados na criação de termos em sistemas de indexação e buscas de documentos, gerando resultados melhores. Esta dissertação propĂ”e uma metodologia computacional para identificação de sintagmas nominais da lĂ­ngua portuguesa em documentos digitais escritos em linguagem natural. Nesse trabalho, Ă© explicitada a metodologia adotada para identificar e extrair sintagmas nominais por meio do desenvolvimento do SISNOP -- Sistema Identificador de Sintagmas Nominais do PortuguĂȘs. O SISNOP Ă© um sistema composto por um conjunto de mĂłdulos e programas, capaz de interpretar textos irrestritos disponĂ­veis em linguagem natural, atravĂ©s de anĂĄlises morfolĂłgicas e sintĂĄticas, a fim de recuperar sintagmas nominais. Alem disso, sĂŁo obtidas informaçÔes sintĂĄticas, como gĂȘnero, nĂșmero e grau das palavras contidas nos sintagmas extraĂ­dos. O SISNOP testou, entre outros corpus, o CETENFolha, composto por mais 24 milhĂ”es de palavras, e o CETEMPĂșblico, com aproximadamente 180 milhĂ”es de palavras em portuguĂȘs europeu, e muito utilizado em trabalhos da ĂĄrea. Foi obtido 98,12% e 94,59% de frases reconhecidas pelo sistema, obtendo mais de 24 milhĂ”es de sintagmas identificados. Os mĂłdulos do SISNOP: EM Etiquetador MorfolĂłgico, ISN Identificador de Sintagmas Nominais e IGNG Identificador de GĂȘnero, NĂșmero e Grau, foram testados de maneira individual utilizando um conjunto de dados menor que o anterior, visto que, a anĂĄlise dos resultados foi feita manualmente. O mĂłdulo identificador de sintagmas obteve precisĂŁo de 82,45% e abrangĂȘncia de 69,20%
    corecore