27,917 research outputs found
Building a semantically annotated corpus of clinical texts
In this paper, we describe the construction of a semantically annotated corpus of clinical texts for use in the development and evaluation of systems for automatically extracting clinically significant information from the textual component of patient records. The paper details the sampling of textual material from a collection of 20,000 cancer patient records, the development of a semantic annotation scheme, the annotation methodology, the distribution of annotations in the final corpus, and the use of the corpus for development of an adaptive information extraction system. The resulting corpus is the most richly semantically annotated resource for clinical text processing built to date, whose value has been demonstrated through its use in developing an effective information extraction system. The detailed presentation of our corpus construction and annotation methodology will be of value to others seeking to build high-quality semantically annotated corpora in biomedical domains
Recommended from our members
Development and validation of blood-based proteomic biomarker-sociodemographic diagnostic prediction models to identify major depressive disorder among symptomatic individuals
Major depressive disorder (MDD) is a highly prevalent and disabling condition with a complex pathophysiology that has not been fully elucidated to date. While the socioeconomic burden of the disease is significant, many individuals remain undiagnosed or misdiagnosed. This is largely because the current diagnostic approach that relies on clinical evaluations of signs and symptoms can be subjective, and time and resources tend to be rather limited in primary care where the majority seek help for depression. Therefore, there is a significant and pressing need for an objective, reliable and readily accessible diagnostic test to enable earlier and more accurate diagnosis of MDD. In particular, as individuals experiencing subthreshold levels of depressive symptoms have an increased risk of developing MDD, it would be clinically relevant for such a diagnostic test to be able to identify depressed patients and/or individuals with high risks of incident MDD among symptomatic individuals.
This thesis sought to develop risk prediction models that could potentially be utilised within a clinical setting to facilitate earlier and more accurate diagnosis of MDD. Such models were used to obtain probability estimates of the investigated individuals having or developing MDD based on their blood-based proteomic profiles and other characteristics, including sociodemographic and lifestyle factors. A targeted mass spectrometry approach was used to measure the abundances of a panel of peptides representing proteins, many of which have been previously associated with psychiatric disorders. Biomarkers were investigated in serum samples, which are widely used for blood-based biomarker discovery, as well as in dried blood spot samples, which are relatively novel in the field and carry several advantages. Importantly, this thesis focused on adopting appropriate statistical methods to ensure that the diagnostic predictions made by the models were accurate and reproducible, by addressing problems of model overfitting and model selection uncertainty. A particularly significant aspect of this was the development and application of a multimodel-based approach combining feature extraction and model averaging, which resulted in improved model predictive performance and generalisability.
Diagnostic prediction models based on serum proteomic, sociodemographic/lifestyle and clinical data were shown to be able to differentiate between subthreshold symptomatic individuals who developed and did not develop MDD. Additionally, diagnostic prediction models based on dried blood spot proteomic and digital mental health assessment data were shown to be able to identify currently depressed patients without an existing MDD diagnosis as well as currently not depressed patients with an existing MDD diagnosis among subthreshold symptomatic individuals. These results clearly demonstrate the potential of such prediction models to be used as an aid to the diagnosis of MDD in clinical practice, especially within the primary care setting. Moreover, MDD was found to be associated with several blood-based proteomic biomarkers, which mainly represented an immune/inflammatory profile, as well as with various other patient features, most notably body mass index and childhood trauma. Although further investigations are needed, these associations reveal disturbances in the stress response pathways involving the hypothalamic-pituitary-adrenal axis in the pathophysiology of depression
Integrated mining of feature spaces for bioinformatics domain discovery
One of the major challenges in the field of bioinformatics is the elucidation of protein folding for the functional annotation of proteins. The factors that govern protein folding include the chemical, physical, and environmental conditions of the protein\u27s surroundings, which can be measured and exploited for computational discovery purposes. These conditions enable the protein to transform from a sequence of amino acids to a globular three-dimensional structure. Information concerning the folded state of a protein has significant potential to explain biochemical pathways and their involvement in disorders and diseases. This information impacts the ways in which genetic diseases are characterized and cured and in which designer drugs are created. With the exponential growth of protein databases and the limitations of experimental protein structure determination, sophisticated computational methods have been developed and applied to search for, detect, and compare protein homology. Most computational tools developed for protein structure prediction are primarily based on sequence similarity searches. These approaches have improved the prediction accuracy of high sequence similarity proteins but have failed to perform well with proteins of low sequence similarity. Data mining offers unique algorithmic computational approaches that have been used widely in the development of automatic protein structure classification and prediction.
In this dissertation, we present a novel approach for the integration of physico-chemical properties and effective feature extraction techniques for the classification of proteins. Our approaches overcome one of the major obstacles of data mining in protein databases, the encapsulation of different hydrophobicity residue properties into a much reduced feature space that possess high degrees of specificity and sensitivity in protein structure classification. We have developed three unique computational algorithms for coherent feature extraction on selected scale properties of the protein sequence. When plagued by the problem of the unequal cardinality of proteins, our proposed integration scheme effectively handles the varied sizes of proteins and scales well with increasing dimensionality of these sequences. We also detail a two-fold methodology for protein functional annotation. First, we exhibit our success in creating an algorithm that provides a means to integrate multiple physico-chemical properties in the form of a multi-layered abstract feature space, with each layer corresponding to a physico-chemical property. Second, we discuss a wavelet-based segmentation approach that efficiently detects regions of property conservation across all layers of the created feature space.
Finally, we present a unique graph-theory based algorithmic framework for the identification of conserved hydrophobic residue interaction patterns using identified scales of hydrophobicity. We report that these discriminatory features are specific to a family of proteins, which consist of conserved hydrophobic residues that are then used for structural classification. We also present our rigorously tested validation schemes, which report significant degrees of accuracy to show that homologous proteins exhibit the conservation of physico-chemical properties along the protein backbone. We conclude our discussion by summarizing our results and contributions and by listing our goals for future research
Microarray Data Mining and Gene Regulatory Network Analysis
The novel molecular biological technology, microarray, makes it feasible to obtain quantitative measurements of expression of thousands of genes present in a biological sample simultaneously. Genome-wide expression data generated from this technology are promising to uncover the implicit, previously unknown biological knowledge. In this study, several problems about microarray data mining techniques were investigated, including feature(gene) selection, classifier genes identification, generation of reference genetic interaction network for non-model organisms and gene regulatory network reconstruction using time-series gene expression data. The limitations of most of the existing computational models employed to infer gene regulatory network lie in that they either suffer from low accuracy or computational complexity. To overcome such limitations, the following strategies were proposed to integrate bioinformatics data mining techniques with existing GRN inference algorithms, which enables the discovery of novel biological knowledge. An integrated statistical and machine learning (ISML) pipeline was developed for feature selection and classifier genes identification to solve the challenges of the curse of dimensionality problem as well as the huge search space. Using the selected classifier genes as seeds, a scale-up technique is applied to search through major databases of genetic interaction networks, metabolic pathways, etc.
By curating relevant genes and blasting genomic sequences of non-model organisms against well-studied genetic model organisms, a reference gene regulatory network for less-studied organisms was built and used both as prior knowledge and model validation for GRN reconstructions. Networks of gene interactions were inferred using a Dynamic Bayesian Network (DBN) approach and were analyzed for elucidating the dynamics caused by perturbations. Our proposed pipelines were applied to investigate molecular mechanisms for chemical-induced reversible neurotoxicity
Sparse Proteomics Analysis - A compressed sensing-based approach for feature selection and classification of high-dimensional proteomics mass spectrometry data
Background: High-throughput proteomics techniques, such as mass spectrometry
(MS)-based approaches, produce very high-dimensional data-sets. In a clinical
setting one is often interested in how mass spectra differ between patients of
different classes, for example spectra from healthy patients vs. spectra from
patients having a particular disease. Machine learning algorithms are needed to
(a) identify these discriminating features and (b) classify unknown spectra
based on this feature set. Since the acquired data is usually noisy, the
algorithms should be robust against noise and outliers, while the identified
feature set should be as small as possible.
Results: We present a new algorithm, Sparse Proteomics Analysis (SPA), based
on the theory of compressed sensing that allows us to identify a minimal
discriminating set of features from mass spectrometry data-sets. We show (1)
how our method performs on artificial and real-world data-sets, (2) that its
performance is competitive with standard (and widely used) algorithms for
analyzing proteomics data, and (3) that it is robust against random and
systematic noise. We further demonstrate the applicability of our algorithm to
two previously published clinical data-sets
- …