11 research outputs found

    SARSCOVIDB : a new platform for the analysis of the molecular impact of SARS-CoV-2 viral infection

    Get PDF
    The COVID-19 pandemic caused by the new coronavirus (SARS-CoV-2) has become a global emergency issue for public health. This threat has led to an acceleration in related research and, consequently, an unprecedented volume of clinical and experimental data that include changes in gene expression resulting from infection. The SARS-CoV-2 infection database (SARSCOVIDB: https://sarscovidb.org/) was created to mitigate the dificulties related to this scenario. The SARSCOVIDB is an online platform that aims to integrate all differential gene expression data, at messenger RNA and protein levels, helping to speed up analysis and research on the molecular impact of COVID-19. The database can be searched from different experimental perspectives and presents all related information from published data, such as viral strains, hosts, methodological approaches (proteomics or transcriptomics), genes/proteins, and samples (clinical or experimental). All information was taken from 24 articles related to analyses of differential gene expression out of 5,554 COVID-19/SARS-CoV-2-related articles published so far. The database features 12,535 genes whose expression has been identified as altered due to SARS-CoV-2 infection. Thus, the SARSCOVIDB is a new resource to support the health workers and the scientific community in understanding the pathogenesis and molecular impact caused by SARS-CoV-2

    D3.6 Interim study on the state of harmonisation of the rights of reproduction and adaptation and connected exceptions

    No full text
    There is global attention on new data analytic methods. Machine learning (essentially pattern recognition dressed as Artificial Intelligence or AI) is seen as a critical technology. Data scraping, the acquiring and structuring of information from online sources, is a typical first step for many advanced data analytic methods. The technologies of scraping, mining and learning are often conflated, as are the legal regimes under which they are regulated. One regulatory lever under one legal regime will not deliver policy aims, such as innovation, personal dignity, Open Science, or the currently popular ‘data sovereignty’. The legal issues involved in the governance of data range from proprietary approaches (copyright, database rights) to privacy and data protection. In addition, there are a wide range of public law instruments, for example relating to public sector data governance or the right to non-discrimination. Competition law again (which may be both privately and publicly enforceable) increasingly prescribes conduct in relation to data, such as in merger or acquisition cases, or in transparency provisions (Art. 17 CDSM; and centrally in the proposed DMA and AI Regulation). The scope of our enquiry in this report is within private law, specifically on the attempt to assert quasi-proprietary control of information and data, or vice versa limit such attempts, for example by exempting desired activities via copyright exceptions, such as the exception for text and data mining in Arts. 3 and 4 CDSM. We focus on case studies of three technological processes to explore in detail possible descriptions that would allow legal analysis, and an assessment of the need for a harmonisation of rights and connected exceptions under copyright law. The three case studies are: (1) Data scraping for scientific purposes. (2) Machine learning, in the context of Natural Language Processing (NLP). (3) Computer vision, in the context of content moderation of images. In parallel, we offer a thorough analysis of the policy rationale and legal context for the introduction of the two exceptions for text and data mining in the CDSM Directive (Art. 3 Text and data mining for the purposes of scientific research; Art. 4 Exception or limitation for text and data mining) which includes an analysis of how the right of reproduction (Art. 2 ISD) and its limitations (mainly Art. 5(1) ISD) interface with the overall. The deliverable is under acceptance by the European Commission

    SARSCOVIDB - Banco de dados de infecção pelo SARS-CoV-2 : uma nova plataforma para analisar o impacto molecular da infecção pelo vírus da COVID-19

    Get PDF
    A pandemia da COVID-19 provocada pelo novo coronavírus (SARS-COV-2) se tornou uma questão de emergência global para saúde pública devido a rápida propagação e elevado número de óbitos. Esta situação de urgência global levou a uma aceleração na pesquisa relacionada e, consequentemente, a um volume sem precedente de dados clínicos e experimentais, nos quais se incluem alterações de expressão gênica resultantes da infecção. O Banco de Dados da Infecção por SARS-COV-2 (SARSCOVIDB – https://sarscovidb.org/) foi criado para mitigar as dificuldades relacionadas a este cenário. O SARSCOVIDB é uma plataforma online que objetiva integrar todos os dados de expressão gênica diferencial, a nível de RNA mensageiro e proteína, acelerando a pesquisa sobre o impacto molecular da COVID-19. O banco de dados pode ser consultado a partir de diferentes perspectivas experimentais como as diferentes cepas virais utilizadas, hospedeiros, abordagens metodológicas (proteômica ou transcriptômica), genes/proteínas, tipo de amostra (clínica ou experimental). Todas estas informações foram retiradas de todos os 30 artigos de análise de expressão gênica diferencial relacionada cerca de 7000 artigos publicados até o momento e disponibilizados nas duas principais plataformas de busca, o PubMed e o Web of Science. O banco de dados apresenta 10534 genes identificados cuja expressão foi alterada devido a infecção por SARS-COV-2. Assim, o SARSCOVIDB é uma nova ferramenta para apoiar a comunidade científica no entendimento da patogênese e impacto molecular causado pelo SARS-COV-2.The COVID-19 pandemic caused by the new coronavirus (SARS-COV-2) has become a global emergency issue for public health due to the rapid spread and high number of deaths. This global emergency led to an acceleration in related research and, consequently, to an unprecedented volume of clinical and experimental data, which includes changes in gene expression resulting from infection. The SARS-COV-2 Infection Database (SARSCOVIDB - https://sarscovidb.org/) was created to mitigate the difficulties related to this scenario. SARSCOVIDB is an online platform that aims to integrate all differential gene expression data, at the level of messenger RNA and protein, accelerating research on the molecular impact of COVID-19. The database can be consulted from different experimental perspectives such as the different viral strains used, hosts, methodological approaches (proteomics or transcriptomics), genes / proteins, type of sample (clinical or experimental). All this information was taken from all 30 articles of analysis of related differential gene expression 7000 published so far and made available on the two main search platforms, PubMed and Web of Science. The database features 10534 identified genes whose expression has been altered due to SARS-COV-2 infection. Thus, SARSCOVIDB is a new tool to support the scientific community in understanding the pathogenesis and molecular impact caused by SARS-COV-2

    Genomic comparison of DBA/2J and C57Bl/6J strains of Mus musculus and best practice of genome alignment for bioinformatics analyses

    Get PDF
    Alcohol use disorder is known to have significant genetic components that contribute to an individual’s susceptibility to the disease. Mouse models are commonly used to study the mechanisms underlying alcohol use disorder, with C57BL/6J (B6) and DBA/2J (D2) being two of the more prominently used inbred strains. Research in the Miles Laboratory has used these two strains, and genetic panels of mice derived from them, to identify potential genes associated with variance in ethanol-related behaviors using quantitative trait loci (QTL) analysis. For example, Ninein (Nin) was identified as a potential candidate gene for the anxiolytic effects of ethanol, discovered because it resides in the confidence interval for a QTL and shows mRNA expression differences between B6 and D2 mice. This differential expression was identified using counts of RNA-Seq reads that have been aligned to a reference genome, specifically the B6 reference genome. Due to the known genetic differences between the two strains, it is possible that the D2 samples could benefit from being aligned to a D2 genome instead of the B6. This would lead to better results overall due to improved read alignment and identification of novel splicing events that might be seen in D2 mice. To test this hypothesis, a dataset consisting of deep (150 million reads) sequencing of RNA from nucleus accumbens of both B6 and D2 mice was used for multiple bioinformatics analyses (differential expression, gene ontology, semantic similarity, differential exon utilization, splice site location, and alternative splicing) with both B6 aligned D2 counts and D2 aligned D2 counts. End results of each analysis were then compared for significant differences in outcomes. The results of this analysis show that when aligning D2 samples to the D2 genome a majority of differentially expressed genes and differentially utilized exons are retained from the B6 aligned analysis while many new genes and exons are identified that are unique to the D2 aligned analysis

    Exploring Strategies to Integrate Disparate Bioinformatics Datasets

    Get PDF
    Distinct bioinformatics datasets make it challenging for bioinformatics specialists to locate the required datasets and unify their format for result extraction. The purpose of this single case study was to explore strategies to integrate distinct bioinformatics datasets. The technology acceptance model was used as the conceptual framework to understand the perceived usefulness and ease of use of integrating bioinformatics datasets. The population of this study included bioinformatics specialists of a research institution in Lebanon that has strategies to integrate distinct bioinformatics datasets. The data collection process included interviews with 6 bioinformatics specialists and reviewing 27 organizational documents relating to integrating bioinformatics datasets. Thematic analysis was used to identify codes and themes related to integrating distinct bioinformatics datasets. Key themes resulting from data analysis included a focus on integrating bioinformatics datasets, adding metadata with the submitted bioinformatics datasets, centralized bioinformatics database, resources, and bioinformatics tools. I showed throughout analyzing the findings of this study that specialists who promote standardizing techniques, adding metadata, and centralization may increase efficiency in integrating distinct bioinformatics datasets. Bioinformaticians, bioinformatics providers, the health care field, and society might benefit from this research. Improvement in bioinformatics affects poistevely the health-care field which has a positive social change. The results of this study might also lead to positive social change in research institutions, such as reduced workload, less frustration, reduction in costs, and increased efficiency while integrating distinct bioinformatics datasets

    Copyright law and the lifecycle of machine learning models

    Get PDF
    Machine learning, a subfield of artificial intelligence (AI), relies on large corpora of data as input for learning algorithms, resulting in trained models that can perform a variety of tasks. While data or information are not subject matter within copyright law, almost all materials used to construct corpora for machine learning are protected by copyright law: texts, images, videos, and so on. There are global policy moves to address the copyright implications of machine learning, in particular in the context of so-called “foundation models” that underpin generative AI. This paper takes a step back, exploring empirically three technological settings through detailed case studies. We set out the established industry methodology of a lifecycle of AI (collecting data, organising data, model training, model operation) to arrive at descriptions suitable for legal analysis. This will allow an assessment of the challenges for a harmonisation of rights, exceptions and disclosure under EU copyright law. The three case studies are: 1. Machine learning for scientific purposes, in the context of a study of regional short-term letting markets; 2. Natural Language Processing (NLP), in the context of large language models; 3. Computer vision, in the context of content moderation of images. We find that the nature and quality of data corpora at the input stage is central to the lifecycle of machine learning. Because of the uncertain legal status of data collection and processing, combined with the competitive advantage gained by firms not disclosing technological advances, the inputs of the models deployed are often unknown. Moreover, the “lawful access” requirement of the EU exception for text and data mining may turn the exception into a decision by rightholders to allow machine learning in the context of their decision to allow access. We assess policy interventions at EU level, seeking to clarify the legal status of input data via copyright exceptions, opt-outs or the forced disclosure of copyright materials. We find that the likely result is a fully copyright-licensed environment of machine learning that may have problematic effects for the structure of industry, innovation and scientific research

    Διερεύνηση των μοριακών μηχανισμών συννοσηρότητας στην κοιλιοκάκη και τις αιματολογικές κακοήθειες

    Get PDF
    Η κοιλιοκάκη είναι μια χρόνια ανοσοδιαμεσολαβούμενη διαταραχή του λεπτού εντέρου, αποτελώντας την πιο κοινή εντεροπάθεια του Δυτικού κόσμου. Η παθογένεση της νόσου είναι πολυπαραγοντική και πολυγονιδιακή, με τα γονίδια HLA-DQ2 ή/και HLA-DQ8, να αποτελούν αναγκαία, αλλά όχι ικανή συνθήκη για την ανάπτυξη της ασθένειας. Περιβαλλοντικοί παράγοντες, μεταξύ των οποίων και ιικά βιομόρια και ερεθίσματα αυτών, φαίνεται να έχουν ρόλο στην παθογένεση της νόσου, με το μηχανισμό αλληλεπίδρασής τους να είναι άγνωστος. Η αδιάγνωστη ή η χρόνια μη θεραπευθείσα μορφή της κοιλιοκάκης δημιουργεί πρόσφορο έδαφος συννοσηρότητας, συμπεριλαμβανομένων των αιματολογικών κακοηθειών. Κατά την ανοσολογική αντίδραση έναντι της γλουτένης κυριαρχεί η Th1 κυτταρική απόκριση, ενώ ταυτόχρονα μελέτες υποδεικνύουν πως οι ιοί μπορούν να διαταράσσουν την εντερική ομοιόσταση και να επάγουν απώλεια ανοχής και Th1 ανοσία έναντι της γλουτένης. Σκοπός της παρούσας εργασίας είναι η διερεύνηση των μοριακών μηχανισμών συννοσηρότητας της κοιλιοκάκης με τις αιματολογικές κακοήθειες, εστιάζοντας στον ελληνικό παιδιατρικό πληθυσμό. Εφαρμόζοντας υπολογιστικές και εργαστηριακές προσεγγίσεις, το ενδιαφέρον μας εστιάζει στο ρόλο της Τh1 κυτταρικής ανοσίας και ειδικότερα, των ιικών λοιμώξεων. Στο πλαίσιο αυτό πραγματοποιήθηκε εκτενής ερευνητική μεθοδολογία συστηματικής ανασκόπησης της βιβλιογραφίας και εξόρυξη δεδομένων και κειμένου που υπέδειξε εκείνα τα γονίδια (n=75), που συσχετίζονται θετικά με την κοιλιοκάκη και την Th1 κυτταρική ανοσία ή/και τις αιματολογικές κακοήθειες ή/και τα ιικά ερεθίσματα-βιομόρια. Μεταξύ των γονιδίων που καταγράφηκαν, n=33 συσχετίζονταν με την κοιλιοκάκη και τις αιματολογικές κακοήθειες, n=27 με την κοιλιοκάκη, τις αιματολογικές κακοήθειες και τα ιικά ερεθίσματα-βιομόρια και n=15 με την κοιλιοκάκη και τα ιικά ερεθίσματα-βιομόρια. Η στρατηγική μας ανέδειξε, εν τέλει, τρεις γενετικές παραλλαγές ως υποψήφιους βιοδείκτες συννοσηρότητας της κοιλιοκάκης με τις αιματολογικές κακοήθειες προς εκτίμηση κινδύνου εμφάνισης αιματολογικών κακοηθειών σε παιδιατρικούς ασθενείς με κοιλιοκάκη. Πρόκειται για την rs10806425 του γονιδίου BACH2, την rs1050976 του γονιδίου IRF4 και την rs10936599 του γονιδίου MYNN. Οι εν λόγω γενετικές παραλλαγές επιβεβαιώθηκαν και στην ομάδα των παιδιών με κοιλιοκάκη και δριμεία νόσο για τα οποία έχουμε δεδομένα ανάλυσης ολόκληρου του γονιδιώματος και κατόπιν, εφαρμογής της αλυσιδωτής αντίδρασης πολυμεράσης και αλληλούχησης κατά Sanger. Εμβαθύνοντας στους μοριακούς μηχανισμούς συννοσηρότητας, χρησιμοποιήθηκαν υπολογιστικά εργαλεία για τη διερεύνηση των αλληλεπιδράσεων των πρωτεϊνών BACH2, IRF4, MYNN, αλλά και τη διερεύνηση της επίδρασης των εν λόγω γενετικών παραλλαγών στη βιολογική δράση των πρωτεϊνών υπό μελέτη. Από την παρούσα μελέτη γίνεται σαφές ότι οι ιοί και τα βιομόρια αυτών (ιικά ερεθίσματα) προκαλούν τη διαταραχή της ομοιόστασης του ανοσοποιητικού συστήματος του ανθρώπου-ξενιστή και ενεργοποιούν την Th1 κυτταρική ανοσία.Celiac disease (CD) is considered to be a chronic autoimmune disease of the small intestine, whilst being the most common enteropathy of the Western World. The disease has a multifactorial etiology, with the presence of HLA-DQ2 and/or HLA-DQ8 being necessary but not efficient for disease development. GWAS and eQTL studies have identified many non-HLA genes involved in celiac disease pathogenesis. Amongst others, environmental factors such as viral infections seem to trigger celiac disease, but the underlying molecular mechanism is still unknown. Not being able to cure the disease but just ease the symptoms with gluten-free diets and considering the fact that celiac disease is under- or mis- diagnosed, give rise to comorbidities not only with other autoimmune disorders, but with hematological malignancies. CD is considered to be a Th1 mediated inflammatory disease, while studies revealed that viruses are able to promote Th1 immunity against dietary antigens, leading to gut permeability. Utter purpose of this study is to explore the molecular mechanisms in celiac disease/ hematological malignancies comorbidity in the pediatric population of Greek origin. Applying a series of state-of-the-art computational and laboratory approaches, we explored the role of Th1 cellular immunity and viral infections. Our data and text mining methods revealed n=75 genes associated with CD, Th1 immunity and/or hematological malignancies and/or viral triggers/molecules; n=33 genes were associated with CD and hematological malignancies, n=27 genes correlated well with CD, hematological malignancies and viral triggers/molecules, while n=15 genes were associated with CD and viral triggers/molecules. Our findings suggest three genomic variations as candidate comorbidity biomarkers toward risk assessment for developing hematological malignancies in pediatric patients already suffering from CD. Namely, rs10806425 BACH2, rs1050976 IRF4 and rs10936599 MYNN. Data were confirmed further in the WGS data of pediatric celiac disease patients with severe clinical phenotype, PCR and Sanger sequencing. Focusing on the molecular mechanisms in question, we employed several computational tools to assess protein-protein interactions between BACH2, IRF4, MYNN and predict the effect of the aforementioned genomic variations on the protein function. This study confirms further that viruses and viral triggers disrupt the homeostasis of the immune system and activate Th1-mediated immune responses

    PREDICTIVE CHEMINFORMATICS ANALYSIS OF DIVERSE CHEMOGENOMICS DATA SOURCES: APPLICATIONS TO DRUG DISCOVERY, ASSAY INTERFERENCE, AND TEXT MINING

    Get PDF
    In this dissertation, we describe the cheminformatics analysis of diverse chemogenomics data sources as well as the application of these data to several drug discovery efforts. In Chapter 1, we describe the discovery and characterization of novel Ebola virus inhibitors through QSAR-based virtual screening. In Chapter 2, we report the discovery and analysis of a series of potent and selective doublecortin-like kinase 1 (DCLK1) inhibitors using QSAR modeling, virtual screening, Matched Molecular Pair Analysis (MMPA), and molecular docking. In Chapter 3, we performed a large-scale analysis of publicly available data in PubChem to probe the reliability and applicability of Pan-Assay INterference compoundS (PAINS) alerts, a popular computational drug screening tool. In Chapter 4, we explore the PubMed database as a novel source of biomedical data and describe the development of Chemotext, a publicly available web server capable of text-mining the published literature.Doctor of Philosoph

    Text Mining Resources for the Life Sciences

    No full text
    Text mining is a powerful technology for quickly distilling key information from vast quantities of biomedical literature. However, to harness this power the researcher must be well versed in the availability, suitability, adaptability, interoperability and comparative accuracy of current text mining resources. In this survey, we give an overview of the text mining resources that exist in the life sciences to help researchers, especially those employed in biocuration, to engage with text mining in their own work. We categorize the various resources under three sections: Content Discovery looks at where and how to find biomedical publications for text mining; Knowledge Encoding describes the formats used to represent the different levels of information associated with content that enable text mining, including those formats used to carry such information between processes; Tools and Services gives an overview of workflow management systems that can be used to rapidly configure and compare domain- and task-specific processes, via access to a wide range of pre-built tools. We also provide links to relevant repositories in each section to enable the reader to find resources relevant to their own area of interest. Throughout this work we give a special focus to resources that are interoperable—those that have the crucial ability to share information, enabling smooth integration and reusability
    corecore