5 research outputs found

    Improving information retrieval-based concept location using contextual relationships

    Get PDF
    For software engineers to find all the relevant program elements implementing a business concept, existing techniques based on information retrieval (IR) fall short in providing adequate solutions. Such techniques usually only consider the conceptual relations based on lexical similarities during concept mapping. However, it is also fundamental to consider the contextual relationships existing within an application’s business domain to aid in concept location. As an example, this paper proposes to use domain specific ontological relations during concept mapping and location activities when implementing business requirements

    EnzyMiner: automatic identification of protein level mutations and their impact on target enzymes from PubMed abstracts

    Get PDF
    BACKGROUND: A better understanding of the mechanisms of an enzyme's functionality and stability, as well as knowledge and impact of mutations is crucial for researchers working with enzymes. Though, several of the enzymes' databases are currently available, scientific literature still remains at large for up-to-date source of learning the effects of a mutation on an enzyme. However, going through vast amounts of scientific documents to extract the information on desired mutation has always been a time consuming process. In this paper, therefore, we describe an unique method, termed as EnzyMiner, which automatically identifies the PubMed abstracts that contain information on the impact of a protein level mutation on the stability and/or the activity of a given enzyme. RESULTS: We present an automated system which identifies the abstracts that contain an amino-acid-level mutation and then classifies them according to the mutation's effect on the enzyme. In the case of mutation identification, MuGeX, an automated mutation-gene extraction system has an accuracy of 93.1% with a 91.5 F-measure. For impact analysis, document classification is performed to identify the abstracts that contain a change in enzyme's stability or activity resulting from the mutation. The system was trained on lipases and tested on amylases with an accuracy of 85%. CONCLUSION: EnzyMiner identifies the abstracts that contain a protein mutation for a given enzyme and checks whether the abstract is related to a disease with the help of information extraction and machine learning techniques. For disease related abstracts, the mutation list and direct links to the abstracts are retrieved from the system and displayed on the Web. For those abstracts that are related to non-diseases, in addition to having the mutation list, the abstracts are also categorized into two groups. These two groups determine whether the mutation has an effect on the enzyme's stability or functionality followed by displaying these on the web

    Automatic Extraction of Protein Point Mutations Using a Graph Bigram Association

    Get PDF
    Protein point mutations are an essential component of the evolutionary and experimental analysis of protein structure and function. While many manually curated databases attempt to index point mutations, most experimentally generated point mutations and the biological impacts of the changes are described in the peer-reviewed published literature. We describe an application, Mutation GraB (Graph Bigram), that identifies, extracts, and verifies point mutations from biomedical literature. The principal problem of point mutation extraction is to link the point mutation with its associated protein and organism of origin. Our algorithm uses a graph-based bigram traversal to identify these relevant associations and exploits the Swiss-Prot protein database to verify this information. The graph bigram method is different from other models for point mutation extraction in that it incorporates frequency and positional data of all terms in an article to drive the point mutation–protein association. Our method was tested on 589 articles describing point mutations from the G protein–coupled receptor (GPCR), tyrosine kinase, and ion channel protein families. We evaluated our graph bigram metric against a word-proximity metric for term association on datasets of full-text literature in these three different protein families. Our testing shows that the graph bigram metric achieves a higher F-measure for the GPCRs (0.79 versus 0.76), protein tyrosine kinases (0.72 versus 0.69), and ion channel transporters (0.76 versus 0.74). Importantly, in situations where more than one protein can be assigned to a point mutation and disambiguation is required, the graph bigram metric achieves a precision of 0.84 compared with the word distance metric precision of 0.73. We believe the graph bigram search metric to be a significant improvement over previous search metrics for point mutation extraction and to be applicable to text-mining application requiring the association of words

    Application of automatic mutation-gene pair extraction to diseases

    Get PDF
    Nowadays, it is known that several inherited genetic diseases? such as sickle cell anemia, are caused by mutations in genes. In order to find ways to prevent and even better to circumvent occurrence of these diseases, knowledge of mutations and the genes on which the mutations occur is of crucial importance. Information on disease related mutations and genes can be accessed through publicly available databases or biomedical literature sources. However, acquiring relevant information from such resources can be problematic because of two reasons. Firstly manually created databases are usually incomplete and not up to date. Secondly reading through vast amount of publicly available biomedical documents is very time consuming. Therefore, there is a need for systems that are capable of extracting relevant information from publicly available resources in an automated fashion. This thesis presents the design and implementation of a system, MuGeX, that automatically extracts mutationgene pairs from MEDLINE abstracts for a given disease. MuGeX performs mainly three tasks. First task is identification of mutations, applying pattern matching in conjunction with a machine learning algorithm. The second task is identification of gene names utilizing a dictionarybased method. The final task is building relations between genes and mutations based on proximity measures. Results of experiments indicate that MuGeX identifies 85.9% of mutations that are on experiment corpus at 95.9% precision. For mutationgene pair extraction, we focused on Alzheimer’s disease. We observed that 88.9% of mutationgene pairs retrieved by MuGeX for Alzheimer’s disease are correct
    corecore