2,105 research outputs found

    Significant Pattern Discovery in Gene Location and Phylogeny

    Get PDF

    A data mining framework based on boundary-points for gene selection from DNA-microarrays: Pancreatic Ductal Adenocarcinoma as a case study

    Get PDF
    [EN] Gene selection (or feature selection) from DNA-microarray data can be focused on different techniques, which generally involve statistical tests, data mining and machine learning. In recent years there has been an increasing interest in using hybrid-technique sets to face the problem of meaningful gene selection; nevertheless, this issue remains a challenge. In an effort to address the situation, this paper proposes a novel hybrid framework based on data mining techniques and tuned to select gene subsets, which are meaningfully related to the target disease conducted in DNA-microarray experiments. For this purpose, the framework above deals with approaches such as statistical significance tests, cluster analysis, evolutionary computation, visual analytics and boundary points. The latter is the core technique of our proposal, allowing the framework to define two methods of gene selection. Another novelty of this work is the inclusion of the age of patients as an additional factor in our analysis, which can leading to gaining more insight into the disease. In fact, the results reached in this research have been very promising and have shown their biological validity. Hence, our proposal has resulted in a methodology that can be followed in the gene selection process from DNA-microarray data

    Phylogeny Of Penaeid Shrimps And Population Genetic Structure Of The Green Tiger Prawn (Penaeus Semisulcatus) In Malaysian Waters

    Get PDF
    This study reports on the phylogeny of selected penaieds and population genetics of the commercially important Green Tiger Prawn (Penaeus semisulcatus) in the Malaysian waters. In the first part of this study, morphological shape variations among 12 species of family Penaeidae, mainly from northwest coast of Peninsular Malaysia were investigated. This was achieved based on the Geometric Morphometrics (GM) of 18 homologous landmarks, analysed with Principal Component Analysis (PCA) and Canonical Variate Analysis (CVA) in Morpho J software. The shape variations was attributed to body shape, carapace head and telson tail among individuals. The first four components accounted for 76.24% and 78.47% for PCA and CVA, respectively. There is a tendency for closely related species to cluster together, although not absolutely consistent. To assess the phylogenetic signal, the morphometric data was mapped onto three phylogenetic trees (Neighbour Joining -NJ, Maximum Likelihood- ML and Bayesian Inference- BI) generated from the partial mitochondrial Cytochrome oxidase Subunit 1 (COI) on the same 12 species. Results revealed non significance (no phylogenetic signal) for NJ but significant phylogenetic signal (evolutionary significance) for ML and BI which suggest shape difference among all 12 penaeid prawn species was related to their evolutionary history. This discrepancy could be explained due to NJ tree are prone to errors when dealing with deeper divergence times, whereas ML and BI tree are ideal for phylogeny tree reconstruction which apply a model of sequence evolution on the data

    High Performance Computing Techniques to Better Understand Protein Conformational Space

    Get PDF
    This thesis presents an amalgamation of high performance computing techniques to get better insight into protein molecular dynamics. Key aspects of protein function and dynamics can be learned from their conformational space. Datasets that represent the complex nuances of a protein molecule are high dimensional. Efficient dimensionality reduction becomes indispensable for the analysis of such exorbitant datasets. Dimensionality reduction forms a formidable portion of this work and its application has been explored for other datasets as well. It begins with the parallelization of a known non-liner feature reduction algorithm called Isomap. The code for the algorithm was re-written in C with portions of it parallelized using OpenMP. Next, a novel data instance reduction method was devised which evaluates the information content offered by each data point, which ultimately helps in truncation of the dataset with much fewer data points to evaluate. Once a framework has been established to reduce the number of variables representing a dataset, the work is extended to explore algebraic topology techniques to extract meaningful information from these datasets. This step is the one that helps in sampling the conformations of interest of a protein molecule. The method employs the notion of hierarchical clustering to identify classes within a molecule, thereafter, algebraic topology is used to analyze these classes. Finally, the work is concluded by presenting an approach to solve the open problem of protein folding. A Monte-Carlo based tree search algorithm is put forth to simulate the pathway that a certain protein conformation undertakes to reach another conformation. The dissertation, in its entirety, offers solutions to a few problems that hinder the progress of solution for the vast problem of understanding protein dynamics. The motion of a protein molecule is guided by changes in its energy profile. In this course the molecule gradually slips from one energy class to another. Structurally, this switch is transient spanning over milliseconds or less and hence is difficult to be captured solely by the work in wet laboratories

    Statistical shape analysis of large molecular data sets

    Get PDF
    Protein classification databases are widely used in the prediction of protein structure and function, and amongst these databases the manually-curated Structural Classification of Proteins database (SCOP) is considered to be a gold standard. In SCOP, functional relationships are described by hyperfamily and superfamily categories and structural relationships are described by family, species and protein categories. We present a method to calculate a difference measure between pairs of proteins that can be used to reproduce SCOP2 structural relationship classifications, and that can also be used to reproduce a subset of functional relationship classifications at the superfamily level. Calculating the difference measure requires first finding the best correspondence between atoms in two protein configurations. The problem of finding the best correspondence is known as the unlabelled, partial matching problem. We consider the unlabelled, partial matching problem through a detailed analysis of the approach presented in Green and Mardia (2006). Using this analysis, and applying domain-specific constraints, we develop a new algorithm called GProtA for protein structure alignment. The proposed difference measure is constructed from the root mean squared deviation of the aligned protein structures and a binary similarity measure, where the binary similarity measure takes into account the proportions of atoms matching from each configuration. The GProtA algorithm and difference measure are applied to protein structure data taken from the Protein Data Bank. The difference measure is shown to correctly classify 62 of a set of 72 proteins into the correct SCOP family categories when clustered. Of the remaining 9 proteins, 2 are assigned incorrectly and 7 are considered indeterminate. In addition, a method for deriving characteristic signatures for categories is proposed. The signatures offer a mechanism by which a single comparison can be made to judge similarity to a particular category. Comparison using characteristic signatures is shown to correctly delineate proteins at the family level, including the identification of both families for a subset of proteins described by two family level categories
    corecore