711 research outputs found

    Exploring Patterns of Epigenetic Information With Data Mining Techniques

    Get PDF
    [Abstract] Data mining, a part of the Knowledge Discovery in Databases process (KDD), is the process of extracting patterns from large data sets by combining methods from statistics and artificial intelligence with database management. Analyses of epigenetic data have evolved towards genome-wide and high-throughput approaches, thus generating great amounts of data for which data mining is essential. Part of these data may contain patterns of epigenetic information which are mitotically and/or meiotically heritable determining gene expression and cellular differentiation, as well as cellular fate. Epigenetic lesions and genetic mutations are acquired by individuals during their life and accumulate with ageing. Both defects, either together or individually, can result in losing control over cell growth and, thus, causing cancer development. Data mining techniques could be then used to extract the previous patterns. This work reviews some of the most important applications of data mining to epigenetics.Programa Iberoamericano de Ciencia y Tecnología para el Desarrollo; 209RT-0366Galicia. Consellería de Economía e Industria; 10SIN105004PRInstituto de Salud Carlos III; RD07/0067/000

    NOVEL ALGORITHMS AND TOOLS FOR LIGAND-BASED DRUG DESIGN

    Get PDF
    Computer-aided drug design (CADD) has become an indispensible component in modern drug discovery projects. The prediction of physicochemical properties and pharmacological properties of candidate compounds effectively increases the probability for drug candidates to pass latter phases of clinic trials. Ligand-based virtual screening exhibits advantages over structure-based drug design, in terms of its wide applicability and high computational efficiency. The established chemical repositories and reported bioassays form a gigantic knowledgebase to derive quantitative structure-activity relationship (QSAR) and structure-property relationship (QSPR). In addition, the rapid advance of machine learning techniques suggests new solutions for data-mining huge compound databases. In this thesis, a novel ligand classification algorithm, Ligand Classifier of Adaptively Boosting Ensemble Decision Stumps (LiCABEDS), was reported for the prediction of diverse categorical pharmacological properties. LiCABEDS was successfully applied to model 5-HT1A ligand functionality, ligand selectivity of cannabinoid receptor subtypes, and blood-brain-barrier (BBB) passage. LiCABEDS was implemented and integrated with graphical user interface, data import/export, automated model training/ prediction, and project management. Besides, a non-linear ligand classifier was proposed, using a novel Topomer kernel function in support vector machine. With the emphasis on green high-performance computing, graphics processing units are alternative platforms for computationally expensive tasks. A novel GPU algorithm was designed and implemented in order to accelerate the calculation of chemical similarities with dense-format molecular fingerprints. Finally, a compound acquisition algorithm was reported to construct structurally diverse screening library in order to enhance hit rates in high-throughput screening

    Computational methods for the identification of genetic variants in complex diseases

    Get PDF
    Dissertação de mestrado em BioinformáticaComplex diseases, as Type 2 Diabetes, are not only affected by environmental factors but also by genetic factors involving multiple variants and their interactions. Even so, the known risk factors are not suffi cient to predict the manifestation of the disease. Some of these can be discovered with Genome-Wide Association Studies that detect associations between variants, such as Single-Nucleotide Polymorphisms, and phenotypes, but other approaches, like Machine Learning, are needed to identify their effects and interactions. Even though these methods can identify important patterns and produce good results, they are changeling to interpret. In this project, we developed a predictor for complex diseases that uses datasets from Genome-Wide Association Studies to help the identification of new genetic markers associated with Type 2 Diabetes. The pipeline developed integrates gene regions and protein-protein interaction networks in datasets of variants, extracts new features, and employs machine learning models to predict risk of disease. This study showed the models can predict the risk of disease and using gene regions and protein-protein interaction networks improves the models and provides new information about the biology of the disease. From these models it was possible to identify new genes and pathways of interest which, with further investigation, could lead to the development of new strategies for diagnosis, prevention and treatment of Type 2 Diabetes.Doenças complexas, como Diabetes Tipo 2, são tanto causadas por fatores ambientais como por fatores genéticos que envolvem múltiplas variantes e as interações entre elas. Mesmo assim, os fatores de risco conhecidos não são o suficiente para prever a manifestação da doença. Alguns destes fatores podem ser descobertos em Genome-Wide Association Studies que detetam associações entre variantes, como polimorfismos num único nucleotídeo, e fenótipos, contudo são necessárias outras abordagens, como por exemplo Aprendizagem Máquina, para identificar os seus efeitos e interações. Mesmo quando estes métodos conseguem identificar padrões e obter bons resultados, estes são difíceis de interpretar. Neste trabalho, desenvolvemos um algoritmo para doenças complexas que utiliza dados obtidos em Genome-Wide Association Studies para auxiliar na identificação de novos marcadores genéticos as sociados à Diabetes Tipo 2. A abordagem desenvolvida combina conjuntos de dados de variantes com a infomação das regiões de genes e redes de interações entre proteínas, extrai novas características, e utiliza modelos aprendizagem de máquina para prever o risco de doença. Este trabalho mostra que os modelos conseguem prever o risco de doença e que o uso de genes e de redes de interação entre proteínas melhora os seus resultados, assim como também fornecem novas informações sobre a biologia da doença. Usando esta abordagem é possivel identificar novos genes e redes metabólicas de interece, que com investigação adicional, podem levar a criação de novas estratégias de diagnóstico, prevenção e tratamento da Diabetes Tipo 2

    Image Processing and Simulation Toolboxes of Microscopy Images of Bacterial Cells

    Get PDF
    Recent advances in microscopy imaging technology have allowed the characterization of the dynamics of cellular processes at the single-cell and single-molecule level. Particularly in bacterial cell studies, and using the E. coli as a case study, these techniques have been used to detect and track internal cell structures such as the Nucleoid and the Cell Wall and fluorescently tagged molecular aggregates such as FtsZ proteins, Min system proteins, inclusion bodies and all the different types of RNA molecules. These studies have been performed with using multi-modal, multi-process, time-lapse microscopy, producing both morphological and functional images. To facilitate the finding of relationships between cellular processes, from small-scale, such as gene expression, to large-scale, such as cell division, an image processing toolbox was implemented with several automatic and/or manual features such as, cell segmentation and tracking, intra-modal and intra-modal image registration, as well as the detection, counting and characterization of several cellular components. Two segmentation algorithms of cellular component were implemented, the first one based on the Gaussian Distribution and the second based on Thresholding and morphological structuring functions. These algorithms were used to perform the segmentation of Nucleoids and to identify the different stages of FtsZ Ring formation (allied with the use of machine learning algorithms), which allowed to understand how the temperature influences the physical properties of the Nucleoid and correlated those properties with the exclusion of protein aggregates from the center of the cell. Another study used the segmentation algorithms to study how the temperature affects the formation of the FtsZ Ring. The validation of the developed image processing methods and techniques has been based on benchmark databases manually produced and curated by experts. When dealing with thousands of cells and hundreds of images, these manually generated datasets can become the biggest cost in a research project. To expedite these studies in terms of time and lower the cost of the manual labour, an image simulation was implemented to generate realistic artificial images. The proposed image simulation toolbox can generate biologically inspired objects that mimic the spatial and temporal organization of bacterial cells and their processes, such as cell growth and division and cell motility, and cell morphology (shape, size and cluster organization). The image simulation toolbox was shown to be useful in the validation of three cell tracking algorithms: Simple Nearest-Neighbour, Nearest-Neighbour with Morphology and DBSCAN cluster identification algorithm. It was shown that the Simple Nearest-Neighbour still performed with great reliability when simulating objects with small velocities, while the other algorithms performed better for higher velocities and when there were larger clusters present

    Regularized binormal ROC method in disease classification using microarray data

    Get PDF
    BACKGROUND: An important application of microarrays is to discover genomic biomarkers, among tens of thousands of genes assayed, for disease diagnosis and prognosis. Thus it is of interest to develop efficient statistical methods that can simultaneously identify important biomarkers from such high-throughput genomic data and construct appropriate classification rules. It is also of interest to develop methods for evaluation of classification performance and ranking of identified biomarkers. RESULTS: The ROC (receiver operating characteristic) technique has been widely used in disease classification with low dimensional biomarkers. Compared with the empirical ROC approach, the binormal ROC is computationally more affordable and robust in small sample size cases. We propose using the binormal AUC (area under the ROC curve) as the objective function for two-sample classification, and the scaled threshold gradient directed regularization method for regularized estimation and biomarker selection. Tuning parameter selection is based on V-fold cross validation. We develop Monte Carlo based methods for evaluating the stability of individual biomarkers and overall prediction performance. Extensive simulation studies show that the proposed approach can generate parsimonious models with excellent classification and prediction performance, under most simulated scenarios including model mis-specification. Application of the method to two cancer studies shows that the identified genes are reasonably stable with satisfactory prediction performance and biologically sound implications. The overall classification performance is satisfactory, with small classification errors and large AUCs. CONCLUSION: In comparison to existing methods, the proposed approach is computationally more affordable without losing the optimality possessed by the standard ROC method

    Deep Embedding Kernel

    Get PDF
    Kernel methods and deep learning are two major branches of machine learning that have achieved numerous successes in both analytics and artificial intelligence. While having their own unique characteristics, both branches work through mapping data to a feature space that is supposedly more favorable towards the given task. This dissertation addresses the strengths and weaknesses of each mapping method through combining them and forming a family of novel deep architectures that center around the Deep Embedding Kernel (DEK). In short, DEK is a realization of a kernel function through a newly deep architecture. The mapping in DEK is both implicit (like in kernel methods) and learnable (like in deep learning). Prior to DEK, we proposed a less advanced architecture called Deep Kernel for the tasks of classification and visualization. More recently, we integrate DEK with the novel Dual Deep Learning framework to model big unstructured data. Using DEK as a core component, we further propose two machine learning models: Deep Similarity-Enhanced K Nearest Neighbors (DSE-KNN) and Recurrent Embedding Kernel (REK). Both models have their mappings trained towards optimizing data instances\u27 neighborhoods in the feature space. REK is specifically designed for time series data. Experimental studies throughout the dissertation show that the proposed models have competitive performance to other commonly used and state-of-the-art machine learning models in their given tasks
    corecore