5 research outputs found

    Mitochondrial DNA Haplogroup D4a Is a Marker for Extreme Longevity in Japan

    Get PDF
    We report results from the analysis of complete mitochondrial DNA (mtDNA) sequences from 112 Japanese semi-supercentenarians (aged above 105 years) combined with previously published data from 96 patients in each of three non-disease phenotypes: centenarians (99–105 years of age), healthy non-obese males, obese young males and four disease phenotypes, diabetics with and without angiopathy, and Alzheimer's and Parkinson's disease patients. We analyze the correlation between mitochondrial polymorphisms and the longevity phenotype using two different methods. We first use an exhaustive algorithm to identify all maximal patterns of polymorphisms shared by at least five individuals and define a significance score for enrichment of the patterns in each phenotype relative to healthy normals. Our study confirms the correlations observed in a previous study showing enrichment of a hierarchy of haplogroups in the D clade for longevity. For the extreme longevity phenotype we see a single statistically significant signal: a progressive enrichment of certain “beneficial” patterns in centenarians and semi-supercentenarians in the D4a haplogroup. We then use Principal Component Spectral Analysis of the SNP-SNP Covariance Matrix to compare the measured eigenvalues to a Null distribution of eigenvalues on Gaussian datasets to determine whether the correlations in the data (due to longevity) arises from some property of the mutations themselves or whether they are due to population structure. The conclusion is that the correlations are entirely due to population structure (phylogenetic tree). We find no signal for a functional mtDNA SNP correlated with longevity. The fact that the correlations are from the population structure suggests that hitch-hiking on autosomal events is a possible explanation for the observed correlations

    Breast cancer prognosis by combinatorial analysis of gene expression data

    Get PDF
    INTRODUCTION: The potential of applying data analysis tools to microarray data for diagnosis and prognosis is illustrated on the recent breast cancer dataset of van 't Veer and coworkers. We re-examine that dataset using the novel technique of logical analysis of data (LAD), with the double objective of discovering patterns characteristic for cases with good or poor outcome, using them for accurate and justifiable predictions; and deriving novel information about the role of genes, the existence of special classes of cases, and other factors. METHOD: Data were analyzed using the combinatorics and optimization-based method of LAD, recently shown to provide highly accurate diagnostic and prognostic systems in cardiology, cancer proteomics, hematology, pulmonology, and other disciplines. RESULTS: LAD identified a subset of 17 of the 25,000 genes, capable of fully distinguishing between patients with poor, respectively good prognoses. An extensive list of 'patterns' or 'combinatorial biomarkers' (that is, combinations of genes and limitations on their expression levels) was generated, and 40 patterns were used to create a prognostic system, shown to have 100% and 92.9% weighted accuracy on the training and test sets, respectively. The prognostic system uses fewer genes than other methods, and has similar or better accuracy than those reported in other studies. Out of the 17 genes identified by LAD, three (respectively, five) were shown to play a significant role in determining poor (respectively, good) prognosis. Two new classes of patients (described by similar sets of covering patterns, gene expression ranges, and clinical features) were discovered. As a by-product of the study, it is shown that the training and the test sets of van 't Veer have differing characteristics. CONCLUSION: The study shows that LAD provides an accurate and fully explanatory prognostic system for breast cancer using genomic data (that is, a system that, in addition to predicting good or poor prognosis, provides an individualized explanation of the reasons for that prognosis for each patient). Moreover, the LAD model provides valuable insights into the roles of individual and combinatorial biomarkers, allows the discovery of new classes of patients, and generates a vast library of biomedical research hypotheses

    Developed Algorithms for Maximum Pattern Generation in Logical Analysis of Data

    Get PDF
    RÉSUMÉ : Les donnĂ©es sont au coeur des industries et des organisations. Beaucoup d’entreprises possĂšdent de grandes quantitĂ©s de donnĂ©es mais Ă©chouent Ă  en tirer un bĂ©nĂ©fice consĂ©quent, bien souvent parce que ces donnĂ©es ne sont pas utilisĂ©es de façon productive. Il est indispensable de prendre des dĂ©cisions importantes au bon moment, en utilisant des outils adaptĂ©s permettant d’extraire de l’information pratique et fiable de grandes quantitĂ©s de donnĂ©es. Avec l’augmentation de la quantitĂ© et de la variĂ©tĂ© des donnĂ©es, le recours aux outils traditionnels facultatifs a Ă©tĂ© abandonnĂ© alors que l’importance de fournir des mĂ©thodes efficaces et prometteuses pour l’analyse de donnĂ©es se fait grandissante. La classification de donnĂ©es est l’un des moyens de rĂ©pondre Ă  ce besoin d’analyse de donnĂ©es. L’analyse Logique de DonnĂ©es (LAD : Logical Analysis of Data) est une nouvelle mĂ©thodologie d’analyse de donnĂ©es. Cette mĂ©thodologie qui combine l’optimisation, l’analyse combinatoire et la logique boolĂ©enne, est applicable pour le problĂšme de classification des donnĂ©es. Son but est de trouver des motifs logiques cachĂ©s qui sĂ©parent les observations d’une certaine classe de toutes les autres observations. Ces motifs sont les blocs de base de l’Analyse Logique de DonnĂ©es dont l’objectif principal est de choisir un ensemble de motifs capable de classifier correctement des observations. La prĂ©cision d’un modĂšle mesure Ă  quel point cet objectif est atteint par le modĂšle. Dans ce projet de recherche, on s’intĂ©resse Ă  un type particulier de motifs appelĂ© α-motif « α-pattern ». Ce type de motif permet de construire des modĂšles de classification LAD de trĂšs grande prĂ©cision. En dĂ©pit du grand nombre de mĂ©thodologies existantes pour gĂ©nĂ©rer des α-motifs maximaux, il n’existe pas encore de mĂ©ta-heuristique adressant ce problĂšme. Le but de ce projet de recherche est donc de dĂ©velopper une mĂ©ta-heuristique pour rĂ©soudre le problĂšme des α-motifs maximaux. Cette mĂ©ta-heuristique devra ĂȘtre efficace en termes de temps de rĂ©solution et aussi en termes de prĂ©cision des motifs gĂ©nĂ©rĂ©s. Afin de satisfaire les deux exigences citĂ©es plus haut, notre choix s’est portĂ© sur le recuit simulĂ©. Nous avons utilisĂ© le recuit simulĂ© pour gĂ©nĂ©rer des α-motifs maximaux avec une approche diffĂ©rente de celle pratiquĂ©e dans le modĂšle BLA. La performance de l’algorithme dĂ©veloppĂ© est Ă©valuĂ©e dans la suite. Les rĂ©sultats du test statistique de Friedman montrent que notre algorithme possĂšde les meilleures performances en termes de temps de rĂ©solution. De plus, pour ce qui est de la prĂ©cision, celle fournie par notre algorithme est comparable Ă  celles des autres mĂ©thodes. Notre prĂ©cision possĂšde par ailleurs de forts niveaux de confiance statistiques.----------ABSTRACT : Data is the heart of any industry or organization. Most of the companies are gifted with a large amount of data but they often fail to gain valuable insight from it, which is often because they cannot use their data productively. It is crucial to make essential and on-time decisions by using adapted tools to find applicable and accurate information from large amount of data. By increasing the amount and variety of data, the use of facultative traditional methods, were abolished and the importance of providing efficient and fruitful methods to analyze the data is growing. Data classification is one of the ways to fulfill this need of data analysis. Logical Analysis of Data is a methodology to analyze the data. This methodology, the combination of optimization, combinatorics and Boolean logic, is applicable for classification problems. Its aim is to discover hidden logical patterns that differentiate observations pertaining to one class from all of the other observations. Patterns are the key building blocks in LAD. Choosing a set of patterns that is capable of classifying observations correctly is the essential goal of LAD. Accuracy represents how successfully this goal is met. In this research study, one specific kind of pattern, called maximum α-pattern, is considered. This particular pattern helps building highly accurate LAD classification models. In spite of various presented methodologies to generate maximum α-pattern there is not yet any developed meta-heuristic algorithm. This research study is presented here with the objective of developing a meta-heuristic algorithm generating maximum α-patterns that are effective both in terms of computational time and accuracy. This study proposes a computationally efficient and accurate meta-heuristic algorithm based on the Simulated Annealing approach. The aim of the developed algorithm is to generate maximum α-patterns in a way that differs from the best linear approximation model proposed in the literature. Later, the performance of the new algorithm is evaluated. The results of the statistical Friedman test shows that the algorithm developed here has the best performance in terms of computational time. Moreover, its performance in terms of accuracy is competitive to other methods with, statistically speaking, high levels of confidence

    PATTERN-BASED FEATURE SELECTION IN GENOMICS AND PROTEOMICS

    No full text
    Abstract. A major difficulty in data analysis is due to the size of the datasets, which contain frequently large numbers of irrelevant or redundant variables. In particular, in some of the most rapidly developing areas of bioinformatics, e.g., genomics and proteomics, the expressions of the intensity levels of tens of thousands of genes or proteins are reported for each observation, in spite of the fact that very small subsets of these features are sufficient for distinguishing positive observations from negative ones. In this study, we describe a two-step procedure for feature selection. In a first “filtering ” stage, a relatively small subset of relevant features is identified on the basis of several combinatorial, statistical, and information-theoretical criteria. In the second stage, the importance of variables selected in the first step is evaluated based on the frequency of their participation in the set of all maximal patterns (defined as in the Logical Analysis of Data, and generated using an efficient, total-polynomial time algorithm), and low impact variables are eliminated. This step is applied iteratively, until arriving to a Pareto-optimal “support set”, which balances the conflicting criteria of simplicity and accuracy
    corecore