40 research outputs found

    C-structures and f-structures for the British national corpus

    Get PDF
    We describe how the British National Corpus (BNC), a one hundred million word balanced corpus of British English, was parsed into Lexical Functional Grammar (LFG) c-structures and f-structures, using a treebank-based parsing architecture. The parsing architecture uses a state-of-the-art statistical parser and reranker trained on the Penn Treebank to produce context-free phrase structure trees, and an annotation algorithm to automatically annotate these trees into LFG f-structures. We describe the pre-processing steps which were taken to accommodate the differences between the Penn Treebank and the BNC. Some of the issues encountered in applying the parsing architecture on such a large scale are discussed. The process of annotating a gold standard set of 1,000 parse trees is described. We present evaluation results obtained by evaluating the c-structures produced by the statistical parser against the c-structure gold standard. We also present the results obtained by evaluating the f-structures produced by the annotation algorithm against an automatically constructed f-structure gold standard. The c-structures achieve an f-score of 83.7% and the f-structures an f-score of 91.2%

    Concept-based query expansion for retrieving gene related publications from MEDLINE

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Advances in biotechnology and in high-throughput methods for gene analysis have contributed to an exponential increase in the number of scientific publications in these fields of study. While much of the data and results described in these articles are entered and annotated in the various existing biomedical databases, the scientific literature is still the major source of information. There is, therefore, a growing need for text mining and information retrieval tools to help researchers find the relevant articles for their study. To tackle this, several tools have been proposed to provide alternative solutions for specific user requests.</p> <p>Results</p> <p>This paper presents QuExT, a new PubMed-based document retrieval and prioritization tool that, from a given list of genes, searches for the most relevant results from the literature. QuExT follows a concept-oriented query expansion methodology to find documents containing concepts related to the genes in the user input, such as protein and pathway names. The retrieved documents are ranked according to user-definable weights assigned to each concept class. By changing these weights, users can modify the ranking of the results in order to focus on documents dealing with a specific concept. The method's performance was evaluated using data from the 2004 TREC genomics track, producing a mean average precision of 0.425, with an average of 4.8 and 31.3 relevant documents within the top 10 and 100 retrieved abstracts, respectively.</p> <p>Conclusions</p> <p>QuExT implements a concept-based query expansion scheme that leverages gene-related information available on a variety of biological resources. The main advantage of the system is to give the user control over the ranking of the results by means of a simple weighting scheme. Using this approach, researchers can effortlessly explore the literature regarding a group of genes and focus on the different aspects relating to these genes.</p

    Frame-szemantikára alapozott információ-visszakereső rendszer

    Get PDF
    Egy olyan információ-visszakereső rendszert mutatunk be, amely kontrollált természetes nyelven megadható keresőkifejezésekhez keres hasonló jelentésű szövegrészt tartalmazó természetes nyelvű dokumentumokat. A rendszer frame-szemantikai elemzéssel előállítja a keresőkifejezés szemantikus reprezentációját, és azokat a dokumentumokat adja vissza találatként, amelyekben található olyan szövegrész, amelyhez a reprezentáció illeszthető. Cikkünkben ismertetjük a rendszer működését és az általa használt szemantikus reprezentációk, illetve erőforrások felépítését – elsősorban a frame-szemantika alkalmazására koncentrálva. Röviden kitérünk a még hátralévő feladatokra és a lehetséges további kutatási irányokra is

    Named Entity Recognition for Bacterial Type IV Secretion Systems

    Get PDF
    Research on specialized biological systems is often hampered by a lack of consistent terminology, especially across species. In bacterial Type IV secretion systems genes within one set of orthologs may have over a dozen different names. Classifying research publications based on biological processes, cellular components, molecular functions, and microorganism species should improve the precision and recall of literature searches allowing researchers to keep up with the exponentially growing literature, through resources such as the Pathosystems Resource Integration Center (PATRIC, patricbrc.org). We developed named entity recognition (NER) tools for four entities related to Type IV secretion systems: 1) bacteria names, 2) biological processes, 3) molecular functions, and 4) cellular components. These four entities are important to pathogenesis and virulence research but have received less attention than other entities, e.g., genes and proteins. Based on an annotated corpus, large domain terminological resources, and machine learning techniques, we developed recognizers for these entities. High accuracy rates (>80%) are achieved for bacteria, biological processes, and molecular function. Contrastive experiments highlighted the effectiveness of alternate recognition strategies; results of term extraction on contrasting document sets demonstrated the utility of these classes for identifying T4SS-related documents

    MASZEKER : projekt szemantikus keresőtechnológia kidolgozására

    Get PDF
    Egy merész nyelvészeti projektről számolunk be, a MASZEKER szemantikus keresést megcélzó projektről, amelyen az Alkalmazott Logikai Laboratórium és a Szegedi Tudományegyetem közösen dolgozik. A cél olyan technológia kidolgozása, amely a jól formált keresőkifejezés jelentésreprezentációját illeszti a szövegekre olyan egyezést keresve, amely kifejezheti a keresőkifejezés jelentését. Két felhasználási területre, mégpedig a szabadalmi keresésre, valamint néprajzi keresésre prototípus rendszert kívánunk fejleszteni. A technológiát nyelvfüggetlennek tervezzük, természetesen egyes komponenseinek nyelvfüggőnek kell lenniük. Angol és magyar nyelv változatot fogunk fejleszteni. Magát a keresést végző rendszert kiegészítik az archívumot feldolgozó modulok (tematikus klaszterezés, témafüggő szinonimagenerálás)

    Computing Network of Diseases and Pharmacological Entities through the Integration of Distributed Literature Mining and Ontology Mapping

    Get PDF
    The proliferation of -omics (such as, Genomics, Proteomics) and -ology (such as, System Biology, Cell Biology, Pharmacology) have spawned new frontiers of research in drug discovery and personalized medicine. A vast amount (21 million) of published research results are archived in the PubMed and are continually growing in size. To improve the accessibility and utility of such a large number of literatures, it is critical to develop a suit of semantic sensitive technology that is capable of discovering knowledge and can also infer possible new relationships based on statistical co-occurrences of meaningful terms or concepts. In this context, this thesis presents a unified framework to mine a large number of literatures through the integration of latent semantic analysis (LSA) and ontology mapping. In particular, a parameter optimized, robust, scalable, and distributed LSA (DiLSA) technique was designed and implemented on a carefully selected 7.4 million PubMed records related to pharmacology. The DiLSA model was integrated with MeSH to make the model effective and efficient for a specific domain. An optimized multi-gram dictionary was customized by mapping the MeSH to build the DiLSA model. A fully integrated web-based application, called PharmNet, was developed to bridge the gap between biological knowledge and clinical practices. Preliminary analysis using the PharmNet shows an improved performance over global LSA model. A limited expert evaluation was performed to validate the retrieved results and network with biological literatures. A thorough performance evaluation and validation of results is in progress
    corecore