19 research outputs found

    DIANA-miRGen v4 : indexing promoters and regulators for more than 1500 microRNAs

    Get PDF
    Deregulation of microRNA (miRNA) expression plays a critical role in the transition from a physiological to a pathological state. The accurate miRNA promoter identification in multiple cell types is a fundamental endeavor towards understanding and characterizing the underlying mechanisms of both physiological as well as pathological conditions. DIANA-miRGen v4 (www.microrna.gr/mirgenv4) provides cell type specific miRNA transcription start sites (TSSs) for over 1500 miRNAs retrieved from the analysis of >1000 cap analysis of gene expression (CAGE) samples corresponding to 133 tissues, cell lines and primary cells available in FANTOM repository. MiRNA TSS locations were associated with transcription factor binding site (TFBSs) annotation, for >280 TFs, derived from analyzing the majority of ENCODE ChIP-Seq datasets. For the first time, clusters of cell types having common miRNA TSSs are characterized and provided through a user friendly interface with multiple layers of customization. DIANA-miRGen v4 significantly improves our understanding of miRNA biogenesis regulation at the transcriptional level by providing a unique integration of high-quality annotations for hundreds of cell specific miRNA promoters with experimentally derived TFBSs.peer-reviewe

    Knowledge discovery from biological data

    No full text
    Recent advances in technology have produced a wealth of digital machines and sensors which, along with recent advancements in biotechnology and more specifically the high-throughput sequencing methods, have led to an unprecedented explosion of data on every aspect of biology. Consequently, knowledge discovery and machine learning are today, more than ever, crucial for the intelligent analysis of biological data, knowledge discovery and ultimately the answering of fundamental questions from biology and medicine. The scope of this thesis is the knowledge discovery from biological data. This thesis belongs to the research fields of knowledge discovery from databases, machine learning and bioinformatics, which is a general term used to describe any type of computational analysis applied to biological data. The contribution of this thesis lays on the knowledge discovery and machine learning field. More specifically, the aim of this thesis is to create or extend methods for the analysis of biological data and later to apply them to extract valuable knowledge from data. Moreover, a subsequent aim is to utilize and apply the methods, tools and knowledge produced, in research programs or in clinical practice. Initially, the thesis focuses on the analysis of population genomic data and more specifically single nucleotide polymorphisms (SNPs) data, which main feature is the high dimensionality, mainly focusing on the problem of selecting the most informative markers for assigning individuals to populations of origin. Α new method for feature selection based on frequent itemsets theory is presented, which achieves much superior results compared to the existing methods is also presented. Moreover, all well-known feature selection algorithms in the field, as well as algorithms for SNP dataset manipulation are presented. In the same area of population genetics, a microsatellite pattern discovery algorithm is developed along with the software application in which it is integrated. Later, methodologies which integrate different immunogenetic and clinicobiological data sources and analyze them to study the patterns of mutations that occur through Somatic Hypermutation (SHM) are developed. All methods were applied to patient data with Chronic Lymphocytic Leukemia (CLL). Moreover, a methodology based on social choice and voting theory is presented to investigate the potential ontogenetic transformation of genes towards other genes or gene families, through SHM. Finally, a method for the polyadenylation site prediction in RNA sequences is also developed. This is a modular method consisting of two parts, the one based on interesting emerging patterns and the other on a distance based scoring method, achieving high score of adjusted accuracy.Κατά την διάρκεια των τελευταίων χρόνων η πρόοδος της τεχνολογίας έχει προσφέρει μια πληθώρα ψηφιακών μηχανών και αισθητήρων, που σε συνδυασμό με τις πρόσφατες εξελίξεις στην βιοτεχνολογία και πιο συγκεκριμένα τις μεθόδους αλληλούχισης υψηλής απόδοσης έχουν συντελέσει σε μια πρωτοφανή έκρηξη των δεδομένων σε κάθε πτυχή της επιστήμης της βιολογίας. Γίνεται αντιληπτό ότι οι περιοχές της ανακάλυψης γνώσης και της μηχανικής μάθησης είναι σήμερα, περισσότερο από ποτέ, αναγκαίες και σημαντικές για την ευφυή ανάλυση των διαθέσιμων βιολογικών δεδομένων, την εξαγωγή πολύτιμης γνώσης από αυτά και τελικά την απάντηση θεμελιωδών ερωτημάτων από την επιστήμη της βιολογίας και της ιατρικής. Το αντικείμενο της παρούσας διδακτορικής διατριβής είναι η ανακάλυψη γνώσης από βιολογικά δεδομένα. Η διατριβή εντάσσεται στα πλαίσια των ερευνητικών περιοχών της ανακάλυψης γνώσης από βάσεις δεδομένων, της μηχανικής μάθησης και της βιοπληροφορικής, ο οποίος είναι ένας γενικός όρος που χρησιμοποιείται για να περιγράψει κάθε είδους υπολογιστική ανάλυση βιολογικών δεδομένων. Στόχος της διδακτορικής διατριβής ήταν η έρευνα και η συνεισφορά στην περιοχή της ανακάλυψης γνώσης και της μηχανικής μάθησης και πιο συγκεκριμένα, αφενός η δημιουργία νέων ή η επέκταση υπαρχουσών μεθόδων για την ανάλυση βιολογικών δεδομένων και αφετέρου η εφαρμογή τους για την εξαγωγή πολύτιμης γνώσης από τα δεδομένα αυτά. Απώτερος επιθυμητός στόχος ήταν τα αποτελέσματα της (μέθοδοι, εργαλεία και γνώση) να αξιοποιηθούν από την επιστημονική κοινότητα είτε σε ερευνητικά προγράμματα, είτε στην κλινική πράξη. Αρχικά, η διατριβή επικεντρώνεται στην ανάλυση δεδομένων πληθυσμιακής γενετικής και πιο συγκεκριμένα πολυμορφισμών μονών νουκλεοτιδίων (SNPs), που έχουν σαν κύριο χαρακτηριστικό την μεγάλη διαστασιμότητα, εστιάζοντας στο πρόβλημα της επιλογής των πιο πληροφοριακών δεικτών για την ανάθεση ατόμων σε πληθυσμούς προέλευσης. Παρουσιάζεται μια νέα μέθοδος επιλογής δεικτών που βασίζεται στη θεωρία συχνών στοιχειοσυνόλων η οποία επιτυγχάνει πολύ καλύτερα αποτελέσματα από τις υπάρχουσες μεθόδους. Ακόμα, παρουσιάζονται οι αλγόριθμοι που υλοποιήθηκαν και χρησιμοποιούνται στην περιοχή για επιλογή χαρακτηριστικών καθώς και αλγόριθμοι χειρισμού συνόλων δεδομένων SNP. Στην ίδια περιοχή της πληθυσμιακής γενετικής, παρουσιάζεται ένας αλγόριθμος εύρεσης μικροδορυφόρων σε γονιδιώματα, καθώς και το ολοκληρωμένο σύστημα στο οποίο συμπεριλαμβάνεται. Στη συνέχεια παρουσιάζονται μεθοδολογίες συγκερασμού διαφορετικών πηγών ανοσογενετικών και κλινικοβιολογικών δεδομένων και ανάλυσης αυτών, που έχουν σαν στόχο την μελέτη για τα πρότυπα των μεταλλάξεων που συμβαίνουν κατά το φαινόμενο της Σωματικής Υπερμεταλλαξιγένεσης (ΣΥΜ). Η εφαρμογή των μεθόδων γίνεται σε δεδομένα ασθενών που πάσχουν από Χρόνια Λεμφοκυτταρική Λευχαιμία (ΧΛΛ). Επιπλέον, παρουσιάζεται μια μεθοδολογία βασισμένη στη θεωρία κοινωνικής επιλογής και ψηφοφορίας για την διερεύνηση του τρόπο του πιθανού οντογενετικού μετασχηματισμού γονιδίων προς άλλα γονίδια ή οικογένειες γονιδίων μέσω του φαινομένου της ΣΥΜ. Τέλος παρουσιάζεται μια μέθοδος για την εύρεση του σημείου πολυαδενυλίωσης σε ακολουθίες RNA. Η προτεινόμενη, είναι μια αρθρωτή μέθοδος που αποτελείται από δύο τμήματα, το πρώτο βασισμένο στα ενδιαφέροντα αναδυόμενα πρότυπα και το δεύτερο στην βαθμολόγηση των ακολουθιών με βάση την απόσταση τους από τις διάφορες τάξεις και υποτάξεις των αλληλουχιών, επιτυγχάνοντας υψηλά επίπεδα προσαρμοσμένης ακρίβειας

    Machine Learning and Data Mining Methods in Diabetes Research

    No full text
    The remarkable advances in biotechnology and health sciences have led to a significant production of data, such as high throughput genetic data and clinical information, generated from large Electronic Health Records (EHRs). To this end, application of machine learning and data mining methods in biosciences is presently, more than ever before, vital and indispensable in efforts to transform intelligently all available information into valuable knowledge. Diabetes mellitus (DM) is defined as a group of metabolic disorders exerting significant pressure on human health worldwide. Extensive research in all aspects of diabetes (diagnosis, etiopathophysiology, therapy, etc.) has led to the generation of huge amounts of data. The aim of the present study is to conduct a systematic review of the applications of machine learning, data mining techniques and tools in the field of diabetes research with respect to a) Prediction and Diagnosis, b) Diabetic Complications, c) Genetic Background and Environment, and e) Health Care and Management with the first category appearing to be the most popular. A wide range of machine learning algorithms were employed. In general, 85% of those used were characterized by supervised learning approaches and 15% by unsupervised ones, and more specifically, association rules. Support vector machines (SVM) arise as the most successful and widely used algorithm. Concerning the type of data, clinical datasets were mainly used. The title applications in the selected articles project the usefulness of extracting valuable knowledge leading to new hypotheses targeting deeper understanding and further investigation in DM. Keywords: Machine learning, Data mining, Diabetes mellitus, Diabetic complications, Disease prediction models, Biomarker(s) identificatio

    Integrating multiple immunogenetic data sources for feature extraction and mining somatic hypermutation patterns : the case of "towards analysis" in chronic lymphocytic leukaemia

    No full text
    Background: Somatic Hypermutation (SHM) refers to the introduction of mutations within rearranged V(D)J genes, a process that increases the diversity of Immunoglobulins (IGs). The analysis of SHM has offered critical insight into the physiology and pathology of B cells, leading to strong prognostication markers for clinical outcome in chronic lymphocytic leukaemia (CLL), the most frequent adult B-cell malignancy. In this paper we present a methodology for integrating multiple immunogenetic and clinocobiological data sources in order to extract features and create high quality datasets for SHM analysis in IG receptors of CLL patients. This dataset is used as the basis for a higher level integration procedure, inspired form social choice theory. This is applied in the Towards Analysis, our attempt to investigate the potential ontogenetic transformation of genes belonging to specific stereotyped CLL subsets towards other genes or gene families, through SHM. Results: The data integration process, followed by feature extraction, resulted in the generation of a dataset containing information about mutations occurring through SHM. The Towards analysis performed on the integrated dataset applying voting techniques, revealed the distinct behaviour of subset #201 compared to other subsets, as regards SHM related movements among gene clans, both in allele-conserved and non-conserved gene areas. With respect to movement between genes, a high percentage movement towards pseudo genes was found in all CLL subsets. Conclusions: This data integration and feature extraction process can set the basis for exploratory analysis or a fully automated computational data mining approach on many as yet unanswered, clinically relevant biological questions

    DIANA-mAP: Analyzing miRNA from Raw NGS Data to Quantification

    No full text
    microRNAs (miRNAs) are small non-coding RNAs (~22 nts) that are considered central post-transcriptional regulators of gene expression and key components in many pathological conditions. Next-Generation Sequencing (NGS) technologies have led to inexpensive, massive data production, revolutionizing every research aspect in the fields of biology and medicine. Particularly, small RNA-Seq (sRNA-Seq) enables small non-coding RNA quantification on a high-throughput scale, providing a closer look into the expression profiles of these crucial regulators within the cell. Here, we present DIANA-microRNA-Analysis-Pipeline (DIANA-mAP), a fully automated computational pipeline that allows the user to perform miRNA NGS data analysis from raw sRNA-Seq libraries to quantification and Differential Expression Analysis in an easy, scalable, efficient, and intuitive way. Emphasis has been given to data pre-processing, an early, critical step in the analysis for the robustness of the final results and conclusions. Through modularity, parallelizability and customization, DIANA-mAP produces high quality expression results, reports and graphs for downstream data mining and statistical analysis. In an extended evaluation, the tool outperforms similar tools providing pre-processing without any adapter knowledge. Closing, DIANA-mAP is a freely available tool. It is available dockerized with no dependency installations or standalone, accompanied by an installation manual through Github

    Data from: Greece: a Balkan subrefuge for a remnant red deer (Cervus elaphus) population

    No full text
    A number of phylogeographic studies have revealed the existence of multiple ice age refugia within the Balkan Peninsula marking it as a biodiversity hotspot. Greece has been reported to harbour genetically differentiated lineages from the rest of Balkans for a number of mammal species. We therefore searched for distinct red deer lineages in Greece, by analysing 78 samples originating from its last population in Parnitha Mountain (central Greece). Additionally, we tested the impact of human-induced practices on this population. The presence of two discrete mtDNA lineages was inferred: i) an abundant one not previously sampled in the Balkans and ii) a more restricted one shared with other Balkan populations, possibly the result of successful translocations of eastern-European individuals. Microsatellite-based analyses of 14 loci strongly support the existence of two subpopulations with relative frequencies similar to mitochondrial analyses. This study stresses the biogeographic importance of central Greece as a separate last Glacial Maximun Period (LGM) refugium within the Balkans. It also delineates the possible effects that recent translocations of red deer populations had on the genetic structuring within Parnitha. We suggest that the Greek red deer population of Parnitha is genetically distinct and restocking programs should take this genetic evidence into consideration
    corecore