41 research outputs found

    SBRML: A markup language for associating systems biology data with models

    Get PDF
    MOTIVATION: Research in systems biology is carried out through a combination of experiments and models. Several data standards have been adopted for representing models (Systems Biology Markup Language) and various types of relevant experimental data (such as FuGE and those of the Proteomics Standards Initiative). However, until now, there has been no standard way to associate a model and its entities to the corresponding datasets, or vice versa. Such a standard would provide a means to represent computational simulation results as well as to frame experimental data in the context of a particular model. Target applications include model-driven data analysis, parameter estimation, and sharing and archiving model simulations. RESULTS: We propose the Systems Biology Results Markup Language (SBRML), an XML-based language that associates a model with several datasets. Each dataset is represented as a series of values associated with model variables, and their corresponding parameter values. SBRML provides a flexible way of indexing the results to model parameter values, which supports both spreadsheet-like data and multidimensional data cubes. We present and discuss several examples of SBRML usage in applications such as enzyme kinetics, microarray gene expression and various types of simulation results

    Fine-tuning coreference resolution for different styles of clinical narratives

    Get PDF
    Objective: Coreference resolution (CR) is a natural language processing (NLP) task that is concerned with finding all expressions within a single document that refer to the same entity. This makes it crucial in supporting downstream NLP tasks such as summarization, question answering and information extraction. Despite great progress in CR, our experiments have highlighted a substandard performance of the existing open-source CR tools in the clinical domain. We set out to explore some practical solutions to fine-tune their performance on clinical data. Methods: We first explored the possibility of automatically producing silver standards following the success of such an approach in other clinical NLP tasks. We designed an ensemble approach that leverages multiple models to automatically annotate co-referring mentions. Subsequently, we looked into other ways of incorporating human feedback to improve the performance of an existing neural network approach. We proposed a semi-automatic annotation process to facilitate the manual annotation process. We also compared the effectiveness of active learning relative to random sampling in an effort to further reduce the cost of manual annotation. Results: Our experiments demonstrated that the silver standard approach was ineffective in fine-tuning the CR models. Our results indicated that active learning should also be applied with caution. The semi-automatic annotation approach combined with continued training was found to be well suited for the rapid transfer of CR models under low-resource conditions. The ensemble approach demonstrated a potential to further improve accuracy by leveraging multiple fine-tuned models. Conclusion: Overall, we have effectively transferred a general CR model to a clinical domain. Our findings based on extensive experimentation have been summarized into practical suggestions for rapid transferring of CR models across different styles of clinical narratives. Keywords: natural language processing, coreference resolution, transfer learning, active learning, ensemble algorith

    Word sense disambiguation of acronyms in clinical narratives

    Get PDF
    Clinical narratives commonly use acronyms without explicitly defining their long forms. This makes it difficult to automatically interpret their sense as acronyms tend to be highly ambiguous. Supervised learning approaches to their disambiguation in the clinical domain are hindered by issues associated with patient privacy and manual annotation, which limit the size and diversity of training data. In this study, we demonstrate how scientific abstracts can be utilised to overcome these issues by creating a large automatically annotated dataset of artificially simulated global acronyms. A neural network trained on such a dataset achieved the F1-score of 95% on disambiguation of acronym mentions in scientific abstracts. This network was integrated with multi-word term recognition to extract a sense inventory of acronyms from a corpus of clinical narratives on the fly. Acronym sense extraction achieved the F1-score of 74% on a corpus of radiology reports. In clinical practice, the suggested approach can be used to facilitate development of institution-specific inventories

    MeMo: a hybrid SQL/XML approach to metabolomic data management for functional genomics

    Get PDF
    Background: The genome sequencing projects have shown our limited knowledge regarding gene function, e.g. S. cerevisiae has 5-6,000 genes of which nearly 1,000 have an uncertain function. Their gross influence on the behaviour of the cell can be observed using large-scale metabolomic studies. The metabolomic data produced need to be structured and annotated in a machine-usable form to facilitate the exploration of the hidden links between the genes and their functions. Description: MeMo is a formal model for representing metabolomic data and the associated metadata. Two predominant platforms (SQL and XML) are used to encode the model. MeMo has been implemented as a relational database using a hybrid approach combining the advantages of the two technologies. It represents a practical solution for handling the sheer volume and complexity of the metabolomic data effectively and efficiently. The MeMo model and the associated software are available at http://dbkgroup.org/memo/. Conclusions: The maturity of relational database technology is used to support efficient data processing. The scalability and self-descriptiveness of XML are used to simplify the relational schema and facilitate the extensibility of the model necessitated by the creation of new experimental techniques. Special consideration is given to data integration issues as part of the systems biology agenda. MeMo has been physically integrated and cross-linked to related metabolomic and genomic databases. Semantic integration with other relevant databases has been supported through ontological annotation. Compatibility with other data formats is supported by automatic conversion

    In vitro/in silico ispitivanje lekovite supstance i tableta telmisartana

    Get PDF
    Telmisartan acts as antagonist of angiotensin II type-1 (AT1) receptor and is indicated in the treatment of essential hypertension. In order to rationalize the pharmacokinetic characteristics, pharmacological activity, as well as the optimal method of administration of this drug, knowledge of its physico-chemical properties is needed. The assessment of the drug physico-chemical parameters on the basis of its chemical structure at different pH values, which are characteristic for physiological conditions, enables the prediction of its behaviour in the body before the drug is synthesized. Such assessment of its physico-chemical parameters during the preformulation phase is important for the development of a safe, efficient and stable dosage form. Based on the calculated pKa values, this paper is focused on the prediction of distribution of the ionized and nonionized drug species in the pH gradient of 1 to 8 and the calculation of physico-chemical parameters such as telmisartan lipophilicity (log P) and intrinsic solubility (log S0). On the basis of the calculated physicochemical parameters, the pH-dependent solubility and lipophilicity curves of this medicinal substance have been constructed. The assessment of intrinsic dissolution rate and dissolution rate of telmisartan from tablets was used to investigate the influence of medium pH values applied on the model substance behavior. The results obtained from predicting the physico-chemical properties and from experimental evaluation of the model substance intrinsic dissolution rate and telmisartan dissolution rate from tablets, indicate the importance of physico-chemical characterization of the active substance during the preformulation investigation for predicting the drug behaviour in the body (absorption, bioavailability, tissue penetration, elimination). .Telmisartan deluje kao antagonista angiotenzinskog II tipa-1 (AT1) receptora i indikovan je u terapiji esencijalne hipertenzije. Da bi se razjasnile farmakokinetičke osobine, farmakoloÅ”ka aktivnost, kao i optimalni način primene ove lekovite supstance, potrebno je poznavanje njenih fizičko-hemijskih osobina. Određivanje fizičko-hemijskih parametara lekovite supstance na osnovu hemijske strukture pri različitim pH vrednostima koje su karakteristične za fizioloÅ”ke uslove omogućava predviđanje njenog ponaÅ”anja u organizmu pre nego Å”to se lekovita supstanca sintetiÅ”e. Određivanje fizičko-hemijskih parametara u toku preformulacionih ispitivanja značajno je za razvijanje bezbednog, efikasnog i stabilnog farmaceutskog oblika. U ovom radu je, na osnovu izračunatih pKa vrednosti, izvrÅ”eno predviđanje raspodele jonizovanih i nejonizivanog oblika lekovite supstance u pH gradijentu od 1 do 8 i izračunavanje fizičko-hemijskih parametara telmisartana kao Å”to su lipofilnost (log P) i osnovna rastvorljivost (log S0). Na osnovu izračunatih fizičko-hemijskih parametara konstruisane su krive pH-zavisne rastvorljivosti i lipofilnosti ove lekovite supstance. Određivanjem osnovnih brzina rastvaranja i brzina rastvaranja telmisartana iz tableta ispitan je uticaj pH vrednosti primenjenog medijuma na ponaÅ”anje model supstance. Rezultati dobijeni predviđanjem fizičko-hemijskih osobina, kao i eksperimentalnim određivanjem osnovne brzine rastvaranja model supstance i brzine rastvaranja telmisartana iz tableta ukazuju na značaj fizičko-hemijske karakterizacije aktivne supstance tokom preformulacionih ispitivanja za predviđanje njenog ponaÅ”anja u organizmu (resorpcije, bioloÅ”ke raspoloživosti, penetracije u tkiva, eliminacije)

    The role of idioms in sentiment analysis

    Get PDF
    In this paper we investigate the role of idioms in automated approaches to sentiment analysis. To estimate the degree to which the inclusion of idioms as features may potentially improve the results of traditional sentiment analysis, we compared our results to two such methods. First, to support idioms as features we collected a set of 580 idioms that are relevant to sentiment analysis, i.e. the ones that can be mapped to an emotion. These mappings were then obtained using a web-based crowdsourcing approach. The quality of the crowdsourced information is demonstrated with high agreement among five independent annotators calculated using Krippendorff's alpha coefficient (Ī± = 0.662). Second, to evaluate the results of sentiment analysis, we assembled a corpus of sentences in which idioms are used in context. Each sentence was annotated with an emotion, which formed the basis for the gold standard used for the comparison against two baseline methods. The performance was evaluated in terms of three measures - precision, recall and F-measure. Overall, our approach achieved 64% and 61% for these three measures in two experiments improving the baseline results by 20 and 15 percent points respectively. F-measure was significantly improved over all three sentiment polarity classes: Positive, Negative and Other. Most notable improvement was recorded in classification of positive sentiments, where recall was improved by 45 percent points in both experiments without compromising the precision. The statistical significance of these improvements was confirmed by McNemar's test

    Facilitating the development of controlled vocabularies for metabolomics technologies with text mining

    Get PDF
    BACKGROUND: Many bioinformatics applications rely on controlled vocabularies or ontologies to consistently interpret and seamlessly integrate information scattered across public resources. Experimental data sets from metabolomics studies need to be integrated with one another, but also with data produced by other types of omics studies in the spirit of systems biology, hence the pressing need for vocabularies and ontologies in metabolomics. However, it is time-consuming and non trivial to construct these resources manually. RESULTS: We describe a methodology for rapid development of controlled vocabularies, a study originally motivated by the needs for vocabularies describing metabolomics technologies. We present case studies involving two controlled vocabularies (for nuclear magnetic resonance spectroscopy and gas chromatography) whose development is currently underway as part of the Metabolomics Standards Initiative. The initial vocabularies were compiled manually, providing a total of 243 and 152 terms. A total of 5,699 and 2,612 new terms were acquired automatically from the literature. The analysis of the results showed that full-text articles (especially the Materials and Methods sections) are the major source of technology-specific terms as opposed to paper abstracts. CONCLUSIONS: We suggest a text mining method for efficient corpus-based term acquisition as a way of rapidly expanding a set of controlled vocabularies with the terms used in the scientific literature. We adopted an integrative approach, combining relatively generic software and data resources for time- and cost-effective development of a text mining tool for expansion of controlled vocabularies across various domains, as a practical alternative to both manual term collection and tailor-made named entity recognition methods

    A genetic programming approach to development of clinical prediction models: A case study in symptomatic cardiovascular disease

    Get PDF
    BACKGROUND:Genetic programming (GP) is an evolutionary computing methodology capable of identifying complex, non-linear patterns in large data sets. Despite the potential advantages of GP over more typical, frequentist statistical approach methods, its applications to survival analyses are rare, at best. The aim of this study was to determine the utility of GP for the automatic development of clinical prediction models. METHODS:We compared GP against the commonly used Cox regression technique in terms of the development and performance of a cardiovascular risk score using data from the SMART study, a prospective cohort study of patients with symptomatic cardiovascular disease. The composite endpoint was cardiovascular death, non-fatal stroke, and myocardial infarction. A total of 3,873 patients aged 19-82 years were enrolled in the study 1996-2006. The cohort was split 70:30 into derivation and validation sets. The derivation set was used for development of both GP and Cox regression models. These models were then used to predict the discrete hazards at t = 1, 3, and 5 years. The predictive ability of both models was evaluated in terms of their risk discrimination and calibration using the validation set. RESULTS:The discrimination of both models was comparable. At time points t = 1, 3, and 5 years the C-index was 0.59, 0.69, 0.64 and 0.66, 0.70, 0.70 for the GP and Cox regression models respectively. At the same time points, the calibration of both models, which was assessed using calibration plots and the generalization of the Hosmer-Lemeshow test statistic, was also comparable, but with the Cox model being better calibrated to the validation data. CONCLUSION:Using empirical data, we demonstrated that a prediction model developed automatically by GP has predictive ability comparable to that of manually tuned Cox regression. The GP model was more complex, but it was developed in a fully automated way and comprised fewer covariates. Furthermore, it did not require the expertise normally needed for its derivation, thereby alleviating the knowledge elicitation bottleneck. Overall, GP demonstrated considerable potential as a method for the automated development of clinical prediction models for diagnostic and prognostic purposes
    corecore