17 research outputs found

    EuPathDomains: The Divergent Domain Database for Eukaryotic Pathogens

    Get PDF
    International audienceEukaryotic pathogens (e.g. Plasmodium, Leishmania, Trypanosomes, etc.) are a major source of morbidity and mortality worldwide. In Africa, one of the most impacted continents, they cause millions of deaths and constitute an immense economic burden. While the genome sequence of several of these organisms is now available, the biological functions of more than half of their proteins are still unknown. This is a serious issue for bringing to the foreground the expected new therapeutic targets. In this context, the identification of protein domains is a key step to improve the functional annotation of the proteins. However, several domains are missed in eukaryotic pathogens because of the high phylogenetic distance of these organisms from the classical eukaryote models. We recently proposed a method, co-occurrence domain detection (CODD), that improves the sensitivity of Pfam domain detection by exploiting the tendency of domains to appear preferentially with a few other favorite domains in a protein. In this paper, we present EuPathDomains (http://www.atgc-montpellier.fr/EuPathDomains/), an extended database of protein domains belonging to ten major eukaryotic human pathogens. EuPathDomains gathers known and new domains detected by CODD, along with the associated confidence measurements and the GO annotations that can be deduced from the new domains. This database significantly extends the Pfam domain coverage of all selected genomes, by proposing new occurrences of domains as well as new domain families that have never been reported before. For example, with a false discovery rate lower than 20%, EuPathDomains increases the number of detected domains by 13% in Toxoplasma gondii genome and up to 28% in Cryptospordium parvum, and the total number of domain families by 10% in Plasmodium falciparum and up to 16% in C. parvum genome. The database can be queried by protein names, domain identifiers, Pfam or Interpro identifiers, or organisms, and should become a valuable resource to decipher the protein functions of eukaryotic pathogens

    Ten simple rules for organizing a bioinformatics training course in low- And middle-income countries

    Get PDF
    © 2021 Moore et al.Bioinformatics training is required at every stage of a scientist’s research career. Continual bioinformatics training allows exposure to an ever-changing and growing repertoire of techniques and databases, and so biologists, computational scientists, and healthcare practitioners are all seeking learning opportunities in the use of computational resources and tools designed for data storage, retrieval, and analysis. There are abundant opportunities for accessing bioinformatics training for scientists in high-income countries (HICs), with well-equipped facilities and participants and trainers requiring minimal travel and financial costs alongside a range of general advice for developing short bioinformatics training courses [1–3]. However, regionally targeted bioinformatics training in low- and middle-income countries (LMICs) often requires more extensive local and external support, organization, and travel. Due to the limited expertise in bioinformatics in LMICs in general, most bioinformatics training requires a fair amount of collaboration with experts beyond the local community, country, or region. A common model of training, used as the basis of this article, includes a local host collaborating with local, regional, and international experts gathering to train local or regional participants. Recently, there has been a growth of capacity strengthening initiatives in LMICs, such as the Pan African Bioinformatics Network for Human Heredity and Health in Africa (H3ABioNet) Initiative [4–6], the Capacity Building for Bioinformatics in Latin America (CABANA) Project [7], the Asia Pacific BioInformatics Network (APBioNet) [8], and the Wellcome Connecting Science Courses and Conferences program [9]. One of the important strands of these initiatives is a drive to organize and deliver valuable bioinformatics training, but organizing and delivering short bioinformatics training workshops in an LMIC present a unique set of challenges. This paper attempts to build upon the sage advice for organizing bioinformatics workshops with specific guidance for organizing and delivering them in LMICs. It describes the processes to follow in organizing courses taking into consideration the low-resource setting. We should also note that LMICs are not a monolithic group and that setting, context, temporality, and specific location matters. LMICs are a complex regional grouping [10] and should be treated as such; however, we will present some common lessons that we hope will help organizers and trainers of bioinformatics training events in LMICs to navigate the often different, challenging, and rewarding experience.The authors who contributed to this manuscript are funded as follows: BM receives salary support from Wellcome Trust grants [WT108749/Z/15/Z, WT108749/Z/15/A], PC, VR, NM, AG’s salaries are funded in whole, or in part, by the NIH Common Fund H3ABioNet grant [U24HG006941], MC, SLFV, AR, PG, PCL’s salaries were partly funded by the UKRI-BBSRC ‘Capacity building for bioinformatics in Latin America’ (CABANA) grant, on behalf of the Global Challenges Research Fund [BB/P027849/1], JDLR is funded by ISCiii AES [ref. PI18/00591] at the CSIC/USAL (Spain) and by CYTED, RIABIO (Red Iberoamericana 521RT0118), AM’s salary is funded by [WT206194/Z/17/Z], GO is funded by the CABANA grant and SM is funded by the EMBL-EBI

    Designing a course model for distance-based online bioinformatics training in Africa: the H3ABioNet experience

    Get PDF
    Africa is not unique in its need for basic bioinformatics training for individuals from a diverse range of academic backgrounds. However, particular logistical challenges in Africa, most notably access to bioinformatics expertise and internet stability, must be addressed in order to meet this need on the continent. H3ABioNet (www.h3abionet.org), the Pan African Bioinformatics Network for H3Africa, has therefore developed an innovative, free-of-charge "Introduction to Bioinformatics" course, taking these challenges into account as part of its educational efforts to provide on-site training and develop local expertise inside its network. A multiple-delivery±mode learning model was selected for this 3-month course in order to increase access to (mostly) African, expert bioinformatics trainers. The content of the course was developed to include a range of fundamental bioinformatics topics at the introductory level. For the first iteration of the course (2016), classrooms with a total of 364 enrolled participants were hosted at 20 institutions across 10 African countries. To ensure that classroom success did not depend on stable internet, trainers pre-recorded their lectures, and classrooms downloaded and watched these locally during biweekly contact sessions. The trainers were available via video conferencing to take questions during contact sessions, as well as via online "question and discussion" forums outside of contact session time. This learning model, developed for a resource-limited setting, could easily be adapted to other settings.IS

    Development of Bioinformatics Infrastructure for Genomics Research:

    Get PDF
    Although pockets of bioinformatics excellence have developed in Africa, generally, large-scale genomic data analysis has been limited by the availability of expertise and infrastructure. H3ABioNet, a pan-African bioinformatics network, was established to build capacity specifically to enable H3Africa (Human Heredity and Health in Africa) researchers to analyze their data in Africa. Since the inception of the H3Africa initiative, H3ABioNet's role has evolved in response to changing needs from the consortium and the African bioinformatics community

    Functional annotation of divergent genomes : application to Leishmania parasite

    No full text
    L'étude de la composition des protéines en domaines est une étape clé pour la détermination de ses fonctions. Pfam est l'une des banques de domaines les plus répandues où chaque domaine est représenté par un HMM profil construit à partir d'un alignement multiple de protéines contenant le domaine. La méthode classique de recherche des domaines Pfam consiste à comparer la séquence cible à la librairie complète des HMM profils pour mesurer sa ressemblance aux différents modèles. Cependant, appliquée aux protéines d'organismes divergents, cette méthode manque de sensibilité. L'objectif de cette thèse est d'apporter de nouvelles méthodes pour améliorer le processus de prédictions des domaines plus adaptées à l'étude des protéines divergentes. Les premiers travaux ont consisté en l'adaptation et application de la méthode CODD, récemment proposée, à l'ensemble des pathogènes de la base de données EuPathDB. Une base de données nommée EupathDomains (http://www.atgc-montpellier.fr/EuPathDomains/) recensant l'ensemble des domaines connus et ceux nouvellement prédits chez ces pathogènes a été mise en place à l'issue de ces travaux. Nous nous sommes ensuite attachés à proposer diverses améliorations. Nous proposons un algorithme ''CODD_exclusive'' qui utilise des informations d'incompatibilité de domaines pour améliorer la précision des prédictions. Nous proposons également une autre stratégie basée sur l'utilisation de règles d'association pour la détermination des co-occurrences de domaines utilisées dans le processus de certification. La dernière partie de cette thèse s'intéresse à l'utilisation des méthodes profil/profil pour annoter un génome entier. Couplée à la procédure d'annotation par co-occurrence, cette approche permet une amélioration notable en termes de nombre de domaines certifiés et également en termes de précision.The determination of protein domain composition provides strong clues for the protein function prediction. One of the most widelyused domain scheme is the Pfam database in which each family is represented by a multiple sequence alignment and a profileHidden Markov Model (profile HMM). When analyzing a new sequence, each Pfam HMM is used to compute a score measuring the similarity between the sequenceand the domain. However, applied to divergent proteins, this strategy may miss several domains. This is the case for all eukaryotic pathogens, where noPfam domains are detected in half or even more of their proteins.The main objective of this thesis is to develop methods to improve the sensitivity of Pfam domain detection in divergent proteins. We first adapted the recently proposed CODD method to the whole set of pathogens in EupathDB. A public database named EupathDomains (http://www.atgc-montpellier.fr/EuPathDomains/) gathers known and new domains detected by CODD, along with the associated confidence measurements and the GO annotations.We then proposed other methods to further improve domain detection in these organisms. We proposed ''CODD_exclusive'' algorithm that integrates domain exclusion information to prune false positive domains that are in conflict with other domains of the protein. We also suggested the use of association rules to determine the correlations between domains and used these informations in the certification process.In the last part of this thesis, we focused in the use of profile/profile methods to predict protein domains in a whole genome. Combined with the co-occurrence informations, it achieved high sensitivity and accuracy in predicting domains

    Identification of divergent protein domains by combining HMM-HMM comparisons and co-occurrence detection.

    No full text
    Identification of protein domains is a key step for understanding protein function. Hidden Markov Models (HMMs) have proved to be a powerful tool for this task. The Pfam database notably provides a large collection of HMMs which are widely used for the annotation of proteins in sequenced organisms. This is done via sequence/HMM comparisons. However, this approach may lack sensitivity when searching for domains in divergent species. Recently, methods for HMM/HMM comparisons have been proposed and proved to be more sensitive than sequence/HMM approaches in certain cases. However, these approaches are usually not used for protein domain discovery at a genome scale, and the benefit that could be expected from their utilization for this problem has not been investigated. Using proteins of P. falciparum and L. major as examples, we investigate the extent to which HMM/HMM comparisons can identify new domain occurrences not already identified by sequence/HMM approaches. We show that although HMM/HMM comparisons are much more sensitive than sequence/HMM comparisons, they are not sufficiently accurate to be used as a standalone complement of sequence/HMM approaches at the genome scale. Hence, we propose to use domain co-occurrence--the general domain tendency to preferentially appear along with some favorite domains in the proteins--to improve the accuracy of the approach. We show that the combination of HMM/HMM comparisons and co-occurrence domain detection boosts protein annotations. At an estimated False Discovery Rate of 5%, it revealed 901 and 1098 new domains in Plasmodium and Leishmania proteins, respectively. Manual inspection of part of these predictions shows that it contains several domain families that were missing in the two organisms. All new domain occurrences have been integrated in the EuPathDomains database, along with the GO annotations that can be deduced
    corecore