10 research outputs found

    Doublet method for very fast autocoding

    Get PDF
    BACKGROUND: Autocoding (or automatic concept indexing) occurs when a software program extracts terms contained within text and maps them to a standard list of concepts contained in a nomenclature. The purpose of autocoding is to provide a way of organizing large documents by the concepts represented in the text. Because textual data accumulates rapidly in biomedical institutions, the computational methods used to autocode text must be very fast. The purpose of this paper is to describe the doublet method, a new algorithm for very fast autocoding. METHODS: An autocoder was written that transforms plain-text into intercalated word doublets (e.g. "The ciliary body produces aqueous humor" becomes "The ciliary, ciliary body, body produces, produces aqueous, aqueous humor"). Each doublet is checked against an index of doublets extracted from a standard nomenclature. Matching doublets are assigned a numeric code specific for each doublet found in the nomenclature. Text doublets that do not match the index of doublets extracted from the nomenclature are not part of valid nomenclature terms. Runs of matching doublets from text are concatenated and matched against nomenclature terms (also represented as runs of doublets). RESULTS: The doublet autocoder was compared for speed and performance against a previously published phrase autocoder. Both autocoders are Perl scripts, and both autocoders used an identical text (a 170+ Megabyte collection of abstracts collected through a PubMed search) and the same nomenclature (neocl.xml, containing over 102,271 unique names of neoplasms). In side-by-side comparison on the same computer, the doublet method autocoder was 8.4 times faster than the phrase autocoder (211 seconds versus 1,776 seconds). The doublet method codes 0.8 Megabytes of text per second on a desktop computer with a 1.6 GHz processor. In addition, the doublet autocoder successfully matched terms that were missed by the phrase autocoder, while the phrase autocoder found no terms that were missed by the doublet autocoder. CONCLUSIONS: The doublet method of autocoding is a novel algorithm for rapid text autocoding. The method will work with any nomenclature and will parse any ascii plain-text. An implementation of the algorithm in Perl is provided with this article. The algorithm, the Perl implementation, the neoplasm nomenclature, and Perl itself, are all open source materials

    Automatic extraction of candidate nomenclature terms using the doublet method

    Get PDF
    BACKGROUND: New terminology continuously enters the biomedical literature. How can curators identify new terms that can be added to existing nomenclatures? The most direct method, and one that has served well, involves reading the current literature. The scholarly curator adds new terms as they are encountered. Present-day scholars are severely challenged by the enormous volume of biomedical literature. Curators of medical nomenclatures need computational assistance if they hope to keep their terminologies current. The purpose of this paper is to describe a method of rapidly extracting new, candidate terms from huge volumes of biomedical text. The resulting lists of terms can be quickly reviewed by curators and added to nomenclatures, if appropriate. The candidate term extractor uses a variation of the previously described doublet coding method. The algorithm, which operates on virtually any nomenclature, derives from the observation that most terms within a knowledge domain are composed entirely of word combinations found in other terms from the same knowledge domain. Terms can be expressed as sequences of overlapping word doublets that have more specific meaning than the individual words that compose the term. The algorithm parses through text, finding contiguous sequences of word doublets that are known to occur somewhere in the reference nomenclature. When a sequence of matching word doublets is encountered, it is compared with whole terms already included in the nomenclature. If the doublet sequence is not already in the nomenclature, it is extracted as a candidate new term. Candidate new terms can be reviewed by a curator to determine if they should be added to the nomenclature. An implementation of the algorithm is demonstrated, using a corpus of published abstracts obtained through the National Library of Medicine's PubMed query service and using "The developmental lineage classification and taxonomy of neoplasms" as a reference nomenclature. RESULTS: A 31+ Megabyte corpus of pathology journal abstracts was parsed using the doublet extraction method. This corpus consisted of 4,289 records, each containing an abstract title. The total number of words included in the abstract titles was 50,547. New candidate terms for the nomenclature were automatically extracted from the titles of abstracts in the corpus. Total execution time on a desktop computer with CPU speed of 2.79 GHz was 2 seconds. The resulting output consisted of 313 new candidate terms, each consisting of concatenated doublets found in the reference nomenclature. Human review of the 313 candidate terms yielded a list of 285 terms approved by a curator. A final automatic extraction of duplicate terms yielded a final list of 222 new terms (71% of the original 313 extracted candidate terms) that could be added to the reference nomenclature. CONCLUSION: The doublet method for automatically extracting candidate nomenclature terms can be used to quickly find new terms from vast amounts of text. The method can be immediately adapted for virtually any text and any nomenclature. An implementation of the algorithm, in the Perl programming language, is provided with this article

    Building software factories in the aerospace industry

    Get PDF
    Thesis (M.S.)--Massachusetts Institute of Technology, Dept. of Aeronautics and Astronautics, 1997.Includes bibliographical references (p. 107-110).by Jose K. Menendez.M.S

    Deep Learning for genomic data analysis

    Get PDF
    Desde o Human Genome Project que os dados genómicos se tornam de fácil acesso. Com os inúmeros investimentos na área, as tecnologias de sequenciação de genomas tornam-se mais avançadas e sofisticadas, permitindo assim uma sequenciação mais fácil e mais rápida. Tal quantidade de dados permite uma melhor e mais avançada pesquisa, o que leva a avanços na área. No entanto, este processo de sequenciação produz dados quer de elevada dimensionalidade, quer de elevado volume e para isso são necessários um bom poder computação e algoritmos eficientes de maneira a extrair informação útil num tempo aceitável, o que representa uma barreira no que diz respeito à extração e interpretação da informação.Neste trabalho focamo-nos principalmente nos aspectos biológicos do RNA-Seq e na sua análise usando os métodos mais comuns de Machine learning, e Deep Learning. O trabalho foi dividido em duas vertentes. Na primeira construímos e comparamos a precisão de classificadores que foram capazes de distinguir amostras de RNA-Seq de pacientes com cancro de amostras de pessoas saudáveis. Em segundo lugar foi investigada a possibilidade de construir boas descrições dos dados a partir das diferenças nos dados de expressão genética usando Denoising Autoencoders e Stacked Autoencoders como classificadores base, e depois fazer o pós-processamento dos dados extraídos dos modelos de maneira a conseguir extrair informação importante.Since the Human Genome Project, the availability of genomic data has largely increased. In the last years, genome sequencing technologies and techniques have been improving at a fast rate, resulting in a cheaper and faster genome sequencing. Such amount of data enables both more complex analysis and advances in research. However, a sequencing process quite often produces a huge amount of data that is highly complex. A considerable computational power and efficient algorithms are mandatory in order to extract useful information and perform it in reasonable time, which can represent a constraint on the extraction and comprehension of such information.In this work, we focus on the biological aspects of RNA-Seq and its analysis using traditional Machine Learning and Deep learning methods. We divided our study into two branches. First, we built and compared the accuracy of classifiers that were able distinguish the RNA-seq samples of thyroid cancer patients from samples of healthy persons. Secondly, we have investigated the possibility of building comprehensible descriptions for the differences in the RNA-Seq data by using Denoising Autoencoders and Stacked Denoising Autoencoders as base classifiers and then devising post-processing techniques to extract comprehensible and biologically meaningful descriptions out of the constructed models

    BIG DATA и анализ высокого уровня : материалы конференции

    Get PDF
    В сборнике опубликованы результаты научных исследований и разработок в области BIG DATA and Advanced Analytics для оптимизации IT-решений и бизнес-решений, а также тематических исследований в области медицины, образования и экологии

    Space station systems: A bibliography with indexes (supplement 6)

    Get PDF
    This bibliography lists 1,133 reports, articles, and other documents introduced into the NASA scientific and technical information system between July 1, 1987 and December 31, 1987. Its purpose is to provide helpful information to the researcher, manager, and designer in technology development and mission design according to system, interactive analysis and design, structural and thermal analysis and design, structural concepts and control systems, electronics, advanced materials, assembly concepts, propulsion, and solar power satellite systems. The coverage includes documents that define major systems and subsystems, servicing and support requirements, procedures and operations, and missions for the current and future Space Station

    Cumulative index to NASA Tech Briefs, 1970-1975

    Get PDF
    Tech briefs of technology derived from the research and development activities of the National Aeronautics and Space Administration are presented. Abstracts and indexes of subject, personal author, originating center, and tech brief number for the 1970-1975 tech briefs are presented

    Resources for comparing the speed and performance of medical autocoders

    Get PDF
    Abstract Background Concept indexing is a popular method for characterizing medical text, and is one of the most important early steps in many data mining efforts. Concept indexing differs from simple word or phrase indexing because concepts are typically represented by a nomenclature code that binds a medical concept to all equivalent representations. A concept search on the term renal cell carcinoma would be expected to find occurrences of hypernephroma, and renal carcinoma (concept equivalents). The purpose of this study is to provide freely available resources to compare speed and performance among different autocoders. These tools consist of: 1) a public domain autocoder written in Perl (a free and open source programming language that installs on any operating system); 2) a nomenclature database derived from the unencumbered subset of the publicly available Unified Medical Language System; 3) a large corpus of autocoded output derived from a publicly available medical text. Methods A simple lexical autocoder was written that parses plain-text into a listing of all 1,2,3, and 4-word strings contained in text, assigning a nomenclature code for text strings that match terms in the nomenclature. The nomenclature used is the unencumbered subset of the 2003 Unified Medical Language System (UMLS). The unencumbered subset of UMLS was reduced to exclude homonymous one-word terms and proper names, resulting in a term/code data dictionary containing about a half million medical terms. The Online Mendelian Inheritance in Man (OMIM), a 92+ Megabyte publicly available medical opus, was used as sample medical text for the autocoder. Results The autocoding Perl script is remarkably short, consisting of just 38 command lines. The 92+ Megabyte OMIM file was completely autocoded in 869 seconds on a 2.4 GHz processor (less than 10 seconds per Megabyte of text). The autocoded output file (9,540,442 bytes) contains 367,963 coded terms from OMIM and is distributed with this manuscript. Conclusions A public domain Perl script is provided that can parse through plain-text files of any length, matching concepts against an external nomenclature. The script and associated files can be used freely to compare the speed and performance of autocoding software.</p
    corecore