1,455 research outputs found

    Large-scale protein-protein post-translational modification extraction with distant supervision and confidence calibrated BioBERT

    Get PDF
    Protein-protein interactions (PPIs) are critical to normal cellular function and are related to many disease pathways. A range of protein functions are mediated and regulated by protein interactions through post-translational modifications (PTM). However, only 4% of PPIs are annotated with PTMs in biological knowledge databases such as IntAct, mainly performed through manual curation, which is neither time- nor cost-effective. Here we aim to facilitate annotation by extracting PPIs along with their pairwise PTM from the literature by using distantly supervised training data using deep learning to aid human curation. Method We use the IntAct PPI database to create a distant supervised dataset annotated with interacting protein pairs, their corresponding PTM type, and associated abstracts from the PubMed database. We train an ensemble of BioBERT models-dubbed PPI-BioBERT-x10-to improve confidence calibration. We extend the use of ensemble average confidence approach with confidence variation to counteract the effects of class imbalance to extract high confidence predictions. Results and conclusion The PPI-BioBERT-x10 model evaluated on the test set resulted in a modest F1-micro 41.3 (P =5 8.1, R = 32.1). However, by combining high confidence and low variation to identify high quality predictions, tuning the predictions for precision, we retained 19% of the test predictions with 100% precision. We evaluated PPI-BioBERT-x10 on 18 million PubMed abstracts and extracted 1.6 million (546507 unique PTM-PPI triplets) PTM-PPI predictions, and filter [Formula: see text] (4584 unique) high confidence predictions. Of the 5700, human evaluation on a small randomly sampled subset shows that the precision drops to 33.7% despite confidence calibration and highlights the challenges of generalisability beyond the test set even with confidence calibration. We circumvent the problem by only including predictions associated with multiple papers, improving the precision to 58.8%. In this work, we highlight the benefits and challenges of deep learning-based text mining in practice, and the need for increased emphasis on confidence calibration to facilitate human curation efforts.Aparna Elangovan, Yuan Li, Douglas E. V. Pires, Melissa J. Davis, and Karin Verspoo

    Comprehensive high performance thin layer chromatography (HPTLC) fingerprinting in quality control of herbal drugs, preparations and products

    Get PDF
    [eng] Quality control of herbals has its roots in the study of morphoanatomic and organoleptic characters. Nevertheless, in the last century, with the evolution of analytical chemistry, the quality control rapidly evolved from elementary tests to the use of sophisticated instruments combined with software for data management. In the current days, many authorities and organizations recommend a suite of tests, featuring many of these instruments, to evaluate quality of herbal products. HPTLC offers a comprehensive set of data that can be used not only for identification but also to evaluate the purity and content of herbal drugs, herbal preparations, and herbal products. The objective of this doctoral thesis was to explore in-depth the capacities of HPTLC and develop applications for quality control of herbals, far beyond simple identification of the herbal drugs, preparations, and products. For that, five studies were developed. In the first study, the quality of herbal drugs, preparations, and products from milk thistle fruit, coneflower root and aerial parts and black cohosh root, regulated under food supplements or medicines were evaluated with existing HPTLC methods. The suitability of these methods, using the entire fingerprint and several detection modes, as a tool for detecting quality problems, mainly adulterations, was confirmed. In the second study, the comprehensive HPTLC fingerprinting concept was developed with the goal of simplifying the quality control process. This concept combines the qualitative and quantitative information of HPTLC images, obtained in a single analysis, to evaluate the identity, purity and content of herbals. The possibilities of applying it to identify an herbal drug, detect mixtures with re¬lated species (purity), and develop a minimum content test of an analytical marker were demonstrated in Angelica gigas root. In the third study, the application of comprehensive HPTLC fingerprinting aimed to go one step beyond in the test for adulterants and to evaluate the use of the HPTLC for purity limit tests. This approach was evaluated with sam¬ples of ginkgo leaf and extracts, commercialized as food supplements in different countries. This study demonstrated that the information contained in the HPTLC finger¬prints was suitable for verifying levels of rutin and quercetin, providing results similar to that of the HPLC limit test. It was also useful for detecting mixtures of ginkgo products not only with rutin and quercetin but also with buckwheat herb and sophora (flower bud or fruit). In the fourth study, it was evaluated the use of comprehensive HPTLC fingerprinting as an alternative method to the current HPLC assay of markers of TCM drugs in the Ph. Eur. The goal of this project was to simplify the determination of content and thus reducing the number of tests to be performed during quality control. For this evaluation, two TCM herbal drugs were chosen by the experts of the TCM working party of the Ph. Eur.: Fritillaria thunbergii bulbs and corydalis rhizome. In both cases, comprehensive HPTLC fingerprinting was proven useful for identification and minimum content testing in one single analysis. The fifth study goes one step beyond in the content determination. While the previous studies focused in the quantification of single markers, this study aimed to apply comprehensive HPTLC fingerprinting to quantify a group of constituents in an herbal drug, as an example of a more holistic assessment of quality. This determination was combined with the tests for purity and identity. To illustrate this concept, Ganoderma lucidum fruiting body was chosen. In this work, HPTLC proved to be a useful technique for routine quality control of herbal drugs, preparations and products. As demonstrated, it can simplify this process by applying the concept of comprehensive HPTLC fingerprinting. A detailed guideline of how to develop, validate and apply comprehensive HPTLC fingerprinting methods for routine quality control of herbals has been elaborated and is also included in the thesis.  [cat] El control de qualitat dels productes a base de plantes té les seves arrels en l'estudi dels caràcters morfoanatòmics i organolèptics. No obstant això, al segle passat, amb l'evolució de la química analítica, el control de qualitat va evolucionar ràpidament des de proves elementals a l'ús d'instruments sofisticats combinats amb programari per a la gestió de dades. Actualment, moltes autoritats i organitzacions recomanen un conjunt de proves, amb molts d'aquests instruments, per avaluar la qualitat dels productes a base de plantes. La HPTLC ofereix un conjunt complet de dades que poden usar-se no només per a la identificació, sinó també per avaluar la puresa i el contingut de drogues i preparats vegetals i productes a base de plantes. L'objectiu d'aquesta tesi doctoral va ser explorar en profunditat les capacitats de la HPTLC i desenvolupar aplicacions per al control de qualitat dels productes de plantes medicinals, molt més enllà de la simple identificació de drogues i preparats vegetals i productes acabats comercialitzats. Per això, es van desenvolupar cinc estudis. En el primer estudi, es va avaluar la qualitat de les drogues vegetals, preparats vegetals i productes a base de plantes del fruit de card marià, l'arrel i la part aèria de equinàcia i l'arrel de cimicífuga, regulats com complements alimentosos o medicaments, amb els mètodes existents de HPTLC. Es va confirmar la idoneïtat d'aquests mètodes, utilitzant l'empremta dactilar completa i diversos formes de detecció, com una eina per a detectar problemes de qualitat, principalment adulteracions. En el segon estudi, es va desenvolupar el concepte d'anàlisi integral de l'empremta dactilar per HPTLC (comprehensive HPTLC fingerprinting) amb l'objectiu de simplificar el procés de control de qualitat. Aquest concepte combina la informació qualitativa i quantitativa de les imatges d’HPTLC, obtingudes en una única anàlisi, per avaluar la identitat, la puresa i el contingut dels productes a base de plantes. La seva aplicabilitat per identificar una droga vegetal, detectar mescles amb espècies relacionades (puresa) i desenvolupar un assaig de contingut mínim d'un marcador analític es van demostrar en l'arrel d'Angelica gigas. En el tercer estudi, l'aplicació de l'anàlisi integral de l'empremta dactilar per HPTLC va tenir com a objectiu anar un pas més enllà en l'assaig de adulterants i avaluar l'ús de l’HPTLC per a l'assaig límit de puresa. Aquest enfocament es va avaluar amb mostres de fulla i extracte de ginkgo, comercialitzats com a complements alimentosos en diferents països. Aquest estudi va demostrar que la informació continguda en les empremtes dactilars per HPTLC era adequada per verificar els nivells de rutina i quercetina, proporcionant resultats similars als de l'assaig límit per HPLC. També va ser útil per detectar mescles de productes de ginkgo no només amb rutina i quercetina, sinó també amb part aèria de blat sarraí i sòfora (botó floral i fruit). En el quart estudi, es va avaluar l'ús de l'anàlisi integral de l'empremta dactilar per HPTLC com un mètode alternatiu a l'actual valoració de marcadors per HPLC en drogues vegetals de la medicina tradicional xinesa (MTC) a la Ph. Eur. L'objectiu d'aquest projecte era simplificar la determinació del contingut i, per tant, reduir el nombre de proves a realitzar durant el control de qualitat. Per a aquesta avaluació, dues drogues vegetals de la MTC van ser elegides pels experts del grup de treball TCM de la Ph. Eur.: bulb de Fritillaria thunbergii i rizoma de coridalis. En tots dos casos, es va demostrar que l'empremta dactilar completa per HPTLC era útil per a la identificació i l'assaig de contingut mínim en una sola anàlisi. El cinquè estudi va un pas més enllà en la determinació del contingut. Si bé els estudis anteriors es van centrar en la quantificació de marcadors individuals, aquest estudi va tenir com a objectiu aplicar l'anàlisi integral de l'empremta dactilar per HPTLC a la quantificació d'un grup de components en una droga vegetal, com un exemple d'una avaluació més holística de la qualitat. Aquesta determinació es va combinar amb els assajos d'identitat i puresa. Per il·lustrar aquest concepte, es va triar el carpòfor de Ganoderma lucidum. En aquest treball, s'ha demostrat que la HPTLC és una tècnica útil per al control de qualitat rutinari de drogues i preparats vegetals i productes a base de plantes, i que es pot simplificar aquest procés aplicant el concepte d'anàlisi integral de l'empremta dactilar per HPTLC. S'ha elaborat una guia detallada (inclosa a la tesi) sobre com desenvolupar, validar i aplicar mètodes d'anàlisi integral de l'empremta dactilar per HPTLC per al control de qualitat rutinari de productes a base de plantes.[spa] El control de calidad de los productos a base de plantas tiene sus raíces en el estudio de los caracteres morfoanatómicos y organolépticos. Sin embargo, en el siglo pasado, con la evolución de la química analítica, el control de calidad evolucionó rápidamente de las pruebas elementales al uso de instrumentos sofisticados combinados con software para la gestión de datos. Actualmente, muchas autoridades y organizaciones recomiendan un conjunto de pruebas, con muchos de estos instrumentos, para evaluar la calidad de los productos a base de plantas. La HPTLC ofrece un conjunto completo de datos que pueden usarse no sólo para la identificación, sino también para evaluar la pureza y el contenido de drogas y preparados vegetales y productos a base de plantas. El objetivo de esta tesis doctoral fue explorar en profundidad las capacidades de HPTLC y desarrollar aplicaciones para el control de calidad de los productos de plantas medicinales, mucho más allá de la simple identificación de drogas vegetales, preparados vegetales y productos finales comercializados. Para eso, se desarrollaron cinco estudios. En el primer estudio, se evaluó la calidad de las drogas vegetales, preparados vegetales y productos a base de plantas del fruto del cardo mariano, la raíz y la parte aérea de equinácea y la raíz de cimicífuga, regulados como complementos alimenticios o medicamentos, con los métodos existentes de HPTLC. Se confirmó la idoneidad de estos métodos, utilizando la huella digital completa y varios modos de detección, como una herramienta para detectar problemas de calidad, principalmente adulteraciones. En el segundo estudio, se desarrolló el concepto de análisis integral de la huella dactilar por HPTLC (comprehensive HPTLC fingerprinting) con el objetivo de simplificar el proceso de control de calidad. Este concepto combina la información cualitativa y cuantitativa de las imágenes de HPTLC, obtenidas en un único análisis, para evaluar la identidad, la pureza y el contenido de los productos a base de plantas. Su aplicabilidad para identificar una droga vegetal, detectar mezclas con especies relacionadas (pureza) y desarrollar un ensayo de contenido mínimo de un marcador analítico se demostraron en la raíz de Angelica gigas. En el tercer estudio, la aplicación del análisis integral de la huella dactilar por HPTLC tuvo como objetivo ir un paso más allá en el ensayo de adulterantes y evaluar el uso de la HPTLC para el ensayo límite de pureza. Este enfoque se evaluó con muestras de hoja y extracto de ginkgo, comercializados como complementos alimenticios en diferentes países. Este estudio demostró que la información contenida en las huellas dactilares por HPTLC era adecuada para verificar los niveles de rutina y quercetina, proporcionando resultados similares a los del ensayo límite por HPLC. También fue útil para detectar mezclas de productos de ginkgo no sólo con rutina y quercetina, sino también con parte aérea de trigo sarraceno y sófora (botón floral y fruto). En el cuarto estudio, se evaluó el uso del análisis integral de la huella dactilar por HPTLC como un método alternativo a la actual valoración de marcadores por HPLC en drogas vegetales de la medicina tradicional china (MTC) en la Ph. Eur. El objetivo de este proyecto era simplificar la determinación del contenido y, por lo tanto, reducir el número de pruebas a realizar durante el control de calidad. Para esta evaluación, dos drogas vegetales de la MTC fueron elegidas por los expertos del grupo de trabajo TCM de la Ph. Eur.: bulbo de Fritillaria thunbergii y rizoma coridalis. En ambos casos, se demostró que la huella digital completa de HPTLC era útil para la identificación y el ensayo de contenido mínimo en un solo análisis. El quinto estudio va un paso más allá en la determinación del contenido. Si bien los estudios anteriores se centraron en la cuantificación de marcadores individuales, este estudio tuvo como objetivo aplicar el análisis integral de la huella dactilar por HPTLC a la cuantificación de un grupo de componentes en una droga vegetal, como un ejemplo de una evaluación más holística de la calidad. Esta determinación se combinó con los ensayos de identidad y pureza. Para ilustrar este concepto, se eligió el carpóforo de Ganoderma lucidum. En este trabajo, se ha demostrado que la HPTLC es una técnica útil para el control de calidad rutinario de drogas y preparados vegetales y productos a base de plantas, y que se puede simplificar este proceso aplicando el concepto de análisis integral de la huella dactilar por HPTLC. Se ha elaborado una guía detallada (incluida en la tesis) sobre cómo desarrollar, validar y aplicar métodos de análisis integral de la huella dactilar por HPTLC para el control de calidad rutinario de productos a base de plantas

    An Integrative Multi-Network and Multi-Classifier Approach to Predict Genetic Interactions

    Get PDF
    Genetic interactions occur when a combination of mutations results in a surprising phenotype. These interactions capture functional redundancy, and thus are important for predicting function, dissecting protein complexes into functional pathways, and exploring the mechanistic underpinnings of common human diseases. Synthetic sickness and lethality are the most studied types of genetic interactions in yeast. However, even in yeast, only a small proportion of gene pairs have been tested for genetic interactions due to the large number of possible combinations of gene pairs. To expand the set of known synthetic lethal (SL) interactions, we have devised an integrative, multi-network approach for predicting these interactions that significantly improves upon the existing approaches. First, we defined a large number of features for characterizing the relationships between pairs of genes from various data sources. In particular, these features are independent of the known SL interactions, in contrast to some previous approaches. Using these features, we developed a non-parametric multi-classifier system for predicting SL interactions that enabled the simultaneous use of multiple classification procedures. Several comprehensive experiments demonstrated that the SL-independent features in conjunction with the advanced classification scheme led to an improved performance when compared to the current state of the art method. Using this approach, we derived the first yeast transcription factor genetic interaction network, part of which was well supported by literature. We also used this approach to predict SL interactions between all non-essential gene pairs in yeast (http://sage.fhcrc.org/downloads/downloads/predicted_yeast_genetic_interactions.zip). This integrative approach is expected to be more effective and robust in uncovering new genetic interactions from the tens of millions of unknown gene pairs in yeast and from the hundreds of millions of gene pairs in higher organisms like mouse and human, in which very few genetic interactions have been identified to date

    Timely and reliable evaluation of the effects of interventions: a framework for adaptive meta-analysis (FAME)

    Get PDF
    Most systematic reviews are retrospective and use aggregate data AD) from publications, meaning they can be unreliable, lag behind therapeutic developments and fail to influence ongoing or new trials. Commonly, the potential influence of unpublished or ongoing trials is overlooked when interpreting results, or determining the value of updating the meta-analysis or need to collect individual participant data (IPD). Therefore, we developed a Framework for Adaptive Metaanalysis (FAME) to determine prospectively the earliest opportunity for reliable AD meta-analysis. We illustrate FAME using two systematic reviews in men with metastatic (M1) and non-metastatic (M0)hormone-sensitive prostate cancer (HSPC)

    Biomedical Information Extraction: Mining Disease Associated Genes from Literature

    Get PDF
    Disease associated gene discovery is a critical step to realize the future of personalized medicine. However empirical and clinical validation of disease associated genes are time consuming and expensive. In silico discovery of disease associated genes from literature is therefore becoming the first essential step for biomarker discovery to support hypothesis formulation and decision making. Completion of human genome project and advent of high-throughput technology have produced tremendous amount of data, which results in exponential growing of biomedical knowledge deposited in literature database. The sheer quantity of unexplored information causes information overflow for biomedical researchers, and poses big challenge for informatics researchers to address user's information extraction needs. This thesis focused on mining disease associated genes from PubMed literature database using machine learning and graph theory based information extraction (IE) methods. Mining disease associated genes is not trivial and requires pipelines of information extraction steps and methods. Beginning from named entity recognition (NER), the author introduced semantic concept type into feature space for conditional random fields machine learning and demonstrated the effectiveness of the concept feature for disease NER. The effects of domain specific POS tagging, domain specific dictionaries, and named entity encoding scheme on NER performance were also explored. Experimental results show that by combining knowledge base with concept feature space, it can significantly improve the overall disease NER performance. It has also shown that shallow linguistic features of global and local word sequence context can be used with string kernel based supporting vector machine (SVM) for efficient disease-gene relation extraction. Lastly, the disease-associated gene network was constructed by utilizing concept co-occurrence matrix computed from disease focused document collection, and subjected to systematic topology analysis. The gene network was then merged with a seed-gene expanded network to form heterogeneous disease-gene network. The author identified and prioritized disease-associated genes by graph centrality measurements. This novel approach provides a new mean for disease associated gene extraction from large corpora.Ph.D., Information Studies -- Drexel University, 201

    Generation and Applications of Knowledge Graphs in Systems and Networks Biology

    Get PDF
    The acceleration in the generation of data in the biomedical domain has necessitated the use of computational approaches to assist in its interpretation. However, these approaches rely on the availability of high quality, structured, formalized biomedical knowledge. This thesis has the two goals to improve methods for curation and semantic data integration to generate high granularity biological knowledge graphs and to develop novel methods for using prior biological knowledge to propose new biological hypotheses. The first two publications describe an ecosystem for handling biological knowledge graphs encoded in the Biological Expression Language throughout the stages of curation, visualization, and analysis. Further, the second two publications describe the reproducible acquisition and integration of high-granularity knowledge with low contextual specificity from structured biological data sources on a massive scale and support the semi-automated curation of new content at high speed and precision. After building the ecosystem and acquiring content, the last three publications in this thesis demonstrate three different applications of biological knowledge graphs in modeling and simulation. The first demonstrates the use of agent-based modeling for simulation of neurodegenerative disease biomarker trajectories using biological knowledge graphs as priors. The second applies network representation learning to prioritize nodes in biological knowledge graphs based on corresponding experimental measurements to identify novel targets. Finally, the third uses biological knowledge graphs and develops algorithmics to deconvolute the mechanism of action of drugs, that could also serve to identify drug repositioning candidates. Ultimately, the this thesis lays the groundwork for production-level applications of drug repositioning algorithms and other knowledge-driven approaches to analyzing biomedical experiments

    Natural Language Processing: Emerging Neural Approaches and Applications

    Get PDF
    This Special Issue highlights the most recent research being carried out in the NLP field to discuss relative open issues, with a particular focus on both emerging approaches for language learning, understanding, production, and grounding interactively or autonomously from data in cognitive and neural systems, as well as on their potential or real applications in different domains

    A Knowledge-based Integrative Modeling Approach for <em>In-Silico</em> Identification of Mechanistic Targets in Neurodegeneration with Focus on Alzheimer’s Disease

    Get PDF
    Dementia is the progressive decline in cognitive function due to damage or disease in the body beyond what might be expected from normal aging. Based on neuropathological and clinical criteria, dementia includes a spectrum of diseases, namely Alzheimer's dementia, Parkinson's dementia, Lewy Body disease, Alzheimer's dementia with Parkinson's, Pick's disease, Semantic dementia, and large and small vessel disease. It is thought that these disorders result from a combination of genetic and environmental risk factors. Despite accumulating knowledge that has been gained about pathophysiological and clinical characteristics of the disease, no coherent and integrative picture of molecular mechanisms underlying neurodegeneration in Alzheimer’s disease is available. Existing drugs only offer symptomatic relief to the patients and lack any efficient disease-modifying effects. The present research proposes a knowledge-based rationale towards integrative modeling of disease mechanism for identifying potential candidate targets and biomarkers in Alzheimer’s disease. Integrative disease modeling is an emerging knowledge-based paradigm in translational research that exploits the power of computational methods to collect, store, integrate, model and interpret accumulated disease information across different biological scales from molecules to phenotypes. It prepares the ground for transitioning from ‘descriptive’ to “mechanistic” representation of disease processes. The proposed approach was used to introduce an integrative framework, which integrates, on one hand, extracted knowledge from the literature using semantically supported text-mining technologies and, on the other hand, primary experimental data such as gene/protein expression or imaging readouts. The aim of such a hybrid integrative modeling approach was not only to provide a consolidated systems view on the disease mechanism as a whole but also to increase specificity and sensitivity of the mechanistic model by providing disease-specific context. This approach was successfully used for correlating clinical manifestations of the disease to their corresponding molecular events and led to the identification and modeling of three important mechanistic components underlying Alzheimer’s dementia, namely the CNS, the immune system and the endocrine components. These models were validated using a novel in-silico validation method, namely biomarker-guided pathway analysis and a pathway-based target identification approach was introduced, which resulted in the identification of the MAPK signaling pathway as a potential candidate target at the crossroad of the triad components underlying disease mechanism in Alzheimer’s dementia
    corecore