1,186 research outputs found
Normalizing biomedical terms by minimizing ambiguity and variability
<p>Abstract</p> <p>Background</p> <p>One of the difficulties in mapping biomedical named entities, e.g. genes, proteins, chemicals and diseases, to their concept identifiers stems from the potential variability of the terms. Soft string matching is a possible solution to the problem, but its inherent heavy computational cost discourages its use when the dictionaries are large or when real time processing is required. A less computationally demanding approach is to normalize the terms by using heuristic rules, which enables us to look up a dictionary in a constant time regardless of its size. The development of good heuristic rules, however, requires extensive knowledge of the terminology in question and thus is the bottleneck of the normalization approach.</p> <p>Results</p> <p>We present a novel framework for discovering a list of normalization rules from a dictionary in a fully automated manner. The rules are discovered in such a way that they minimize the ambiguity and variability of the terms in the dictionary. We evaluated our algorithm using two large dictionaries: a human gene/protein name dictionary built from BioThesaurus and a disease name dictionary built from UMLS.</p> <p>Conclusions</p> <p>The experimental results showed that automatically discovered rules can perform comparably to carefully crafted heuristic rules in term mapping tasks, and the computational overhead of rule application is small enough that a very fast implementation is possible. This work will help improve the performance of term-concept mapping tasks in biomedical information extraction especially when good normalization heuristics for the target terminology are not fully known.</p
Natural Language Query in the Biochemistry and Molecular Biology Domains Based on Cognition Search™
Motivation: With the tremendous growth in scientific literature, it is necessary to improve upon the standard pattern matching style of the available search engines. Semantic NLP may be the solution to this problem. Cognition Search (CSIR) is a natural language technology. It is best used by asking a simple question that might be answered in textual data being queried, such as MEDLINE. CSIR has a large English dictionary and semantic database. Cognition’s semantic map enables the search process to be based on meaning rather than statistical word pattern matching and, therefore, returns more complete and relevant results. The Cognition Search engine uses downward reasoning and synonymy which also improves recall. It improves precision through phrase parsing and word sense disambiguation.
Result: Here we have carried out several projects to "teach" the CSIR lexicon medical, biochemical and molecular biological language and acronyms from curated web-based free sources. Vocabulary from the Alliance for Cell Signaling (AfCS), the Human Genome Nomenclature Consortium (HGNC), the United Medical Language System (UMLS) Meta-thesaurus, and The International Union of Pure and Applied Chemistry (IUPAC) was introduced into the CSIR dictionary and curated. The resulting system was used to interpret MEDLINE abstracts. Meaning-based search of MEDLINE abstracts yields high precision (estimated at >90%), and high recall (estimated at >90%), where synonym information has been encoded. The present implementation can be found at http://MEDLINE.cognition.com. 

ProNormz – An integrated approach for human proteins and protein kinases normalization
AbstractThe task of recognizing and normalizing protein name mentions in biomedical literature is a challenging task and important for text mining applications such as protein–protein interactions, pathway reconstruction and many more. In this paper, we present ProNormz, an integrated approach for human proteins (HPs) tagging and normalization. In Homo sapiens, a greater number of biological processes are regulated by a large human gene family called protein kinases by post translational phosphorylation. Recognition and normalization of human protein kinases (HPKs) is considered to be important for the extraction of the underlying information on its regulatory mechanism from biomedical literature. ProNormz distinguishes HPKs from other HPs besides tagging and normalization. To our knowledge, ProNormz is the first normalization system available to distinguish HPKs from other HPs in addition to gene normalization task. ProNormz incorporates a specialized synonyms dictionary for human proteins and protein kinases, a set of 15 string matching rules and a disambiguation module to achieve the normalization. Experimental results on benchmark BioCreative II training and test datasets show that our integrated approach achieve a fairly good performance and outperforms more sophisticated semantic similarity and disambiguation systems presented in BioCreative II GN task. As a freely available web tool, ProNormz is useful to developers as extensible gene normalization implementation, to researchers as a standard for comparing their innovative techniques, and to biologists for normalization and categorization of HPs and HPKs mentions in biomedical literature. URL: http://www.biominingbu.org/pronormz
Uncertainty quantification in medical image segmentation with normalizing flows
Medical image segmentation is inherently an ambiguous task due to factors
such as partial volumes and variations in anatomical definitions. While in most
cases the segmentation uncertainty is around the border of structures of
interest, there can also be considerable inter-rater differences. The class of
conditional variational autoencoders (cVAE) offers a principled approach to
inferring distributions over plausible segmentations that are conditioned on
input images. Segmentation uncertainty estimated from samples of such
distributions can be more informative than using pixel level probability
scores. In this work, we propose a novel conditional generative model that is
based on conditional Normalizing Flow (cFlow). The basic idea is to increase
the expressivity of the cVAE by introducing a cFlow transformation step after
the encoder. This yields improved approximations of the latent posterior
distribution, allowing the model to capture richer segmentation variations.
With this we show that the quality and diversity of samples obtained from our
conditional generative model is enhanced. Performance of our model, which we
call cFlow Net, is evaluated on two medical imaging datasets demonstrating
substantial improvements in both qualitative and quantitative measures when
compared to a recent cVAE based model.Comment: 12 pages. Accepted to be presented at 11th International Workshop on
Machine Learning in Medical Imaging. Source code will be updated at
https://github.com/raghavian/cFlo
Effect of latent space distribution on the segmentation of images with multiple annotations
We propose the Generalized Probabilistic U-Net, which extends the
Probabilistic U-Net by allowing more general forms of the Gaussian distribution
as the latent space distribution that can better approximate the uncertainty in
the reference segmentations. We study the effect the choice of latent space
distribution has on capturing the variation in the reference segmentations for
lung tumors and white matter hyperintensities in the brain. We show that the
choice of distribution affects the sample diversity of the predictions and
their overlap with respect to the reference segmentations. We have made our
implementation available at
https://github.com/ishaanb92/GeneralizedProbabilisticUNetComment: Accepted for publication at the Journal of Machine Learning for
Biomedical Imaging (MELBA) https://melba-journal.org/2023:005. arXiv admin
note: text overlap with arXiv:2207.1287
Investigating and Improving Latent Density Segmentation Models for Aleatoric Uncertainty Quantification in Medical Imaging
Data uncertainties, such as sensor noise or occlusions, can introduce
irreducible ambiguities in images, which result in varying, yet plausible,
semantic hypotheses. In Machine Learning, this ambiguity is commonly referred
to as aleatoric uncertainty. Latent density models can be utilized to address
this problem in image segmentation. The most popular approach is the
Probabilistic U-Net (PU-Net), which uses latent Normal densities to optimize
the conditional data log-likelihood Evidence Lower Bound. In this work, we
demonstrate that the PU- Net latent space is severely inhomogenous. As a
result, the effectiveness of gradient descent is inhibited and the model
becomes extremely sensitive to the localization of the latent space samples,
resulting in defective predictions. To address this, we present the Sinkhorn
PU-Net (SPU-Net), which uses the Sinkhorn Divergence to promote homogeneity
across all latent dimensions, effectively improving gradient-descent updates
and model robustness. Our results show that by applying this on public datasets
of various clinical segmentation problems, the SPU-Net receives up to 11%
performance gains compared against preceding latent variable models for
probabilistic segmentation on the Hungarian-Matched metric. The results
indicate that by encouraging a homogeneous latent space, one can significantly
improve latent density modeling for medical image segmentation.Comment: 12 pages incl. references, 11 figure
That Label's Got Style: Handling Label Style Bias for Uncertain Image Segmentation
Segmentation uncertainty models predict a distribution over plausible
segmentations for a given input, which they learn from the annotator variation
in the training set. However, in practice these annotations can differ
systematically in the way they are generated, for example through the use of
different labeling tools. This results in datasets that contain both data
variability and differing label styles. In this paper, we demonstrate that
applying state-of-the-art segmentation uncertainty models on such datasets can
lead to model bias caused by the different label styles. We present an updated
modelling objective conditioning on labeling style for aleatoric uncertainty
estimation, and modify two state-of-the-art-architectures for segmentation
uncertainty accordingly. We show with extensive experiments that this method
reduces label style bias, while improving segmentation performance, increasing
the applicability of segmentation uncertainty models in the wild. We curate two
datasets, with annotations in different label styles, which we will make
publicly available along with our code upon publication
Focused Proofreading: Efficiently Extracting Connectomes from Segmented EM Images
Identifying complex neural circuitry from electron microscopic (EM) images
may help unlock the mysteries of the brain. However, identifying this circuitry
requires time-consuming, manual tracing (proofreading) due to the size and
intricacy of these image datasets, thus limiting state-of-the-art analysis to
very small brain regions. Potential avenues to improve scalability include
automatic image segmentation and crowd sourcing, but current efforts have had
limited success. In this paper, we propose a new strategy, focused
proofreading, that works with automatic segmentation and aims to limit
proofreading to the regions of a dataset that are most impactful to the
resulting circuit. We then introduce a novel workflow, which exploits
biological information such as synapses, and apply it to a large dataset in the
fly optic lobe. With our techniques, we achieve significant tracing speedups of
3-5x without sacrificing the quality of the resulting circuit. Furthermore, our
methodology makes the task of proofreading much more accessible and hence
potentially enhances the effectiveness of crowd sourcing
Evaluation and cross-comparison of lexical entities of biological interest (LexEBI)
MOTIVATION:
Biomedical entities, their identifiers and names, are essential in the representation of biomedical facts and knowledge. In the same way, the complete set of biomedical and chemical terms, i.e. the biomedical "term space" (the "Lexeome"), forms a key resource to achieve the full integration of the scientific literature with biomedical data resources: any identified named entity can immediately be normalized to the correct database entry. This goal does not only require that we are aware of all existing terms, but would also profit from knowing all their senses and their semantic interpretation (ambiguities, nestedness).
RESULT:
This study compiles a resource for lexical terms of biomedical interest in a standard format (called "LexEBI"), determines the overall number of terms, their reuse in different resources and the nestedness of terms. LexEBI comprises references for protein and gene entries and their term variants and chemical entities amongst other terms. In addition, disease terms have been identified from Medline and PubmedCentral and added to LexEBI. Our analysis demonstrates that the baseforms of terms from the different semantic types show only little polysemous use. Nonetheless, the term variants of protein and gene names (PGNs) frequently contain species mentions, which should have been avoided according to protein annotation guidelines. Furthermore, the protein and gene entities as well as the chemical entities, both do comprise enzymes leading to hierarchical polysemy, and a large portion of PGNs make reference to a chemical entity. Altogether, according to our analysis based on the Medline distribution, 401,869 unique PGNs in the documents contain a reference to 25,022 chemical entities, 3,125 disease terms or 1,576 species mentions.
CONCLUSION:
LexEBI delivers the complete biomedical and chemical Lexeome in a standardized representation (http://www.ebi.ac.uk/Rebholz-srv/LexEBI/). The resource provides the disease terms as open source content, and fully interlinks terms across resources
- …