75 research outputs found

    UPGMA and the normalized equidistant minimum evolution problem

    Get PDF
    UPGMA (Unweighted Pair Group Method with Arithmetic Mean) is a widely used clustering method. Here we show that UPGMA is a greedy heuristic for the normalized equidistant minimum evolution (NEME) problem, that is, finding a rooted tree that minimizes the minimum evolution score relative to the dissimilarity matrix among all rooted trees with the same leaf-set in which all leaves have the same distance to the root. We prove that the NEME problem is NP-hard. In addition, we present some heuristic and approximation algorithms for solving the NEME problem, including a polynomial time algorithm that yields a binary, rooted tree whose NEME score is within O(log2n) of the optimum

    Reconstructing (super)trees from data sets with missing distances: Not all is lost

    Get PDF
    The wealth of phylogenetic information accumulated over many decades of biological research, coupled with recent technological advances in molecular sequence generation, present significant opportunities for researchers to investigate relationships across and within the kingdoms of life. However, to make best use of this data wealth, several problems must first be overcome. One key problem is finding effective strategies to deal with missing data. Here, we introduce Lasso, a novel heuristic approach for reconstructing rooted phylogenetic trees from distance matrices with missing values, for datasets where a molecular clock may be assumed. Contrary to other phylogenetic methods on partial datasets, Lasso possesses desirable properties such as its reconstructed trees being both unique and edge-weighted. These properties are achieved by Lasso restricting its leaf set to a large subset of all possible taxa, which in many practical situations is the entire taxa set. Furthermore, the Lasso approach is distance-based, rendering it very fast to run and suitable for datasets of all sizes, including large datasets such as those generated by modern Next Generation Sequencing technologies. To better understand the performance of Lasso, we assessed it by means of artificial and real biological datasets, showing its effectiveness in the presence of missing data. Furthermore, by formulating the supermatrix problem as a particular case of the missing data problem, we assessed Lasso's ability to reconstruct supertrees. We demonstrate that, although not specifically designed for such a purpose, Lasso performs better than or comparably with five leading supertree algorithms on a challenging biological data set. Finally, we make freely available a software implementation of Lasso so that researchers may, for the first time, perform both rooted tree and supertree reconstruction with branch lengths on their own partial datasets

    The accuracy of several multiple sequence alignment programs for proteins

    Get PDF
    BACKGROUND: There have been many algorithms and software programs implemented for the inference of multiple sequence alignments of protein and DNA sequences. The "true" alignment is usually unknown due to the incomplete knowledge of the evolutionary history of the sequences, making it difficult to gauge the relative accuracy of the programs. RESULTS: We tested nine of the most often used protein alignment programs and compared their results using sequences generated with the simulation software Simprot which creates known alignments under realistic and controlled evolutionary scenarios. We have simulated more than 30000 alignment sets using various evolutionary histories in order to define strengths and weaknesses of each program tested. We found that alignment accuracy is extremely dependent on the number of insertions and deletions in the sequences, and that indel size has a weaker effect. We also considered benchmark alignments from the latest version of BAliBASE and the results relative to BAliBASE- and Simprot-generated data sets were consistent in most cases. CONCLUSION: Our results indicate that employing Simprot's simulated sequences allows the creation of a more flexible and broader range of alignment classes than the usual methods for alignment accuracy assessment. Simprot also allows for a quick and efficient analysis of a wider range of possible evolutionary histories that might not be present in currently available alignment sets. Among the nine programs tested, the iterative approach available in Mafft (L-INS-i) and ProbCons were consistently the most accurate, with Mafft being the faster of the two

    Genomic Reconstruction of the Tree of Life

    Get PDF
    A new methodology is presented for molecular phylogenetic analysis addressing a fundamental problem in biology, name the reconstruction of the Tree of Life (TOL). Here, phylogenies are based on patterns of hybridization similarity in their DNA. Furthermore, phylogenies are based on a set of universal biomarkers (so-called nxh chips) chosen a priori, independently of the target group of organisms. Therefore, this methodology enables analyses of groups with biologically distant organisms, hence could be scaled to obtain a universal tree of life. Unlike conventional molecular methods, it produces a hypothesis in a single run, without optimizing across numerous hypotheses for consensus. Prototype hypotheses agree with the biological Ground Truth in over 70% of the relationships. Higher quality nxh chips are likely to produce better hypotheses, but more difficult to design

    New Algorithms andMethodology for Analysing Distances

    Get PDF
    Distances arise in a wide variety of di�erent contexts, one of which is partitional clustering, that is, the problem of �nding groups of similar objects within a set of objects.¿ese groups are seemingly very easy to �nd for humans, but very di�cult to �nd for machines as there are two major di�culties to be overcome: the �rst de�ning an objective criterion for the vague notion of “groups of similar objects”, and the second is the computational complexity of �nding such groups given a criterion. In the �rst part of this thesis, we focus on the �rst di�culty and show that even seemingly similar optimisation criteria used for partitional clustering can produce vastly di�erent results. In the process of showing this we develop a new metric for comparing clustering solutions called the assignment metric. We then prove some new NP-completeness results for problems using two related “sum-of-squares” clustering criteria. Closely related to partitional clustering is the problem of hierarchical clustering. We extend and formalise this problem to the problem of constructing rooted edge-weighted X-trees, that is trees with a leafset X. It is well known that an X-tree can be uniquely reconstructed from a distance on X if the distance is an ultrametric. But in practice the complete distance on X may not always be available. In the second part of this thesis we look at some of the circumstances under which a tree can be uniquely reconstructed from incomplete distance information. We use a concept called a lasso and give some theoretical properties of a special type of lasso. We then develop an algorithm which can construct a tree together with a lasso from partial distance information and show how this can be applied to various incomplete datasets

    Phylogenomic characterization of flaviviruses

    Get PDF
    Background: The occurrences of global viral pandemics have been rising as increased travel between distant countries has introduced previously endemic viruses to new envi-ronments. Major contributors to global human hemorrhagic and neurological diseases with high mortality rates include half of the ca. 70 species of the genus Flavivirus. The most widespread and well-known flaviviruses are Dengue virus, Japanese encephalitis virus, West Nile virus and Zika virus. Although the transmission routes of major viruses are well-documented and thoroughly researched, the knowledge has been gained from past outbreaks, which has been a limitation in surveillance of novel flaviviruses. Thus, having early information about potential hosts is essential in controlling and preventing viral outbreaks. Aims: The goal of the master’s thesis is to characterize the codon and nucleotide com-positions of flaviviruses and to assess a potential use to the identification of putative hosts. This methodology will be utilized to develop a new algorithm capable of identifying optimal hosts through a simple comparative codon usage analysis. This information will be highly valuable to estimate the risk of spread of a virus. Methods: The genomic characterization of flaviviruses was done with computational bi-ology methods. Computed codon usages were analyzed with clustering methods to iden-tify subgroups of viruses and their optimal hosts. The rationale behind this methodology was that codon usages vary among species and this variability is driven by the virus adaptation to the hosts. Results: (1) Genotypes of Zika viruses showed distinct codon usage patterns, which linked the origin of American and European virus cases to the Asian genotype. (2) Dis-tinct usage patterns were similarly observed when the methodology was applied to other major flaviviruses. (3) Optimal hosts for mosquito-borne flaviviruses included vertebrates and Aedes mosquitos, whereas tick-borne viruses were optimized to ticks. Aedes mos-quitoes were also optimal for insect-only flaviviruses. Culex and Anopheles mosquitoes were suboptimal to all groups. Moreover, flaviviruses clustered based on established vector-based classification, host types preferences and phylogeny. The identified hosts were in accordance to previous studies done in field and laboratory. Conclusions: The proposed methodology based on codon usages is able to estimate hosts for flaviviruses within a close range. The algorithm can be implemented in compu-tationally weak equipment, thus it may be deployed fast and on-site during viral pandem-ics. In further studies this methodology, with minor modifications, could be utilized to predict putative hosts of other viruses. A scientific article describing the host identification algorithm is under preparation (appendix 4)

    Analisis keanekaragaman genetik kosambi (Schleichera oleosa (Lour). Oken.) di Malang Raya berdasarkan marka molekuler squence-related amplified polymorphism (SRAP)

    Get PDF
    INDONESIA: Penelitian ini bertujuan untuk mengetahui keanekaragaman genetik tumbuhan Schleihera oleosa di Malang Raya menggunakan penanda molekuler Sequence-Related Amplified Polymorphism (SRAP). Keanekaragaman S.oleosa dianalisis untuk mengetahui asal-usul induk dari persebaran S.oleosa serta faktor lingkungan yang mempengaruhi keragaman genetik S.oleosa di wilayah Malang. Penanda SRAP ditargetkan pada daerah Open Reading Frame (ORF) DNA genomik yang bertanggung jawab terhadap ekspresi gen fenotipik. Faktor abiotik seperti suhu, intensitas cahaya, kelembaban udara, kelembaban tanah, pH tanah dan intensitas cahaya tanah turut dianalisis untuk melihat pengaruhnya terhadap keanekaragaman genetik S.oleosa di wilayah Malang. Penelitian ini menggunakan metode deskriptif, eksploratif dan kuantitatif. Eksplorasi dilakukan di wilayah Kota Malang dan Kabupaten Malang dengan ketinggian bervariatif antara 366–640 mdpl. Analisis molekuler dilakukan menggunakan 16 kombinasi penanda molekuler SRAP (ME1-ME4 dan EM1-EM4). Hasil amplifikasi DNA kemudian divisualisasi dan dibuat dendogram dengan metode UPGMA (Unwieghted Pair Group with Arithmatic Average) menggunakan program PAST untuk melihat pengelompokan pada masing-masing sampel. Hasil penelitian secara molekuler menunjukkan terbentuknya pengelompokan sampel S.oleosa menjadi 4 klaster. Pengelompokkan ini tidak dipengaruhi oleh faktor ketinggian tempat namun dipengaruhi oleh hubungan antara faktor genetik dan lingkungan. Faktor lingkungan dianalisis mengunakan metode PCA (Principal Component Analysis). Faktor yang paling berpengaruh antara lain faktor suhu, kelembaban udara dan kelembaban tanah. Hasil penelitian menunjukkan bahwa penanda SRAP mampu melihat keanekaragam genetik S.oleosa di wilayah Malang Raya serta terdapat korelasi positif antara faktor genetik (asal usul tanaman) dengan faktor lingkungan habitat S.oleosa. ENGLISH: This study aims to determine the genetic diversity of Schleihera oleosa plants in Malang Raya using Sequence-Related Amplified Polymorphism (SRAP) molecular markers. These markers are targeted at the Open Reading Frame (ORF) region of genomic DNA that is responsible for phenotypic gene expression. Abiotic factors such as temperature, light intensity, air humidity, soil moisture, soil pH and soil light intensity were also influenced to see their effect on the genetic diversity of S. oleosa in the Malang area. This research uses descriptive, exploratory and quantitative methods. Exploration was carried out in the area of Malang City and Malang Regency with varying altitudes between 366-640 masl. Molecular analysis was performed using 16 combinations of SRAP molecular markers (ME1-ME4 and EM1-EM4). The results of DNA amplification were then visualized and dendograms were made using the UPGMA (Unwighted Pair Group with Arithmatic Average) method using the PAST program to see the grouping of each sample. The results of the molecular research showed the formation of grouping of S.oleose samples into 4 clusters. This grouping is not influenced by altitude factors but is influenced by the relationship between genetics and the environment. Environmental Factors Analysis Using PCA (Principal Component Analysis). The most influential factors include temperature, air humidity and soil moisture. The results showed that the SRAP marker was able to see the genetic diversity of S.oleosa in the Greater Malang area and there was a positive correlation between genetic factors (plant origin) and environmental factors of S.oleosa habitat. ARABIC: تهدف هذه الدراسة إلى تحديد التنوع الجيني لنباتات Schleihera oleosa في Malang Raya باستخدام العلامة الجزيئية تعدد الأشكال المضخم المرتبط بالتسلسل (SRAP). تم تحليل تنوع S.oleosa لتحديد أصل أصل توزيع S.oleosa والعوامل البيئية التي تؤثر على التنوع الجيني S.oleosa في منطقة Malang. تستهدف علامة SRAP منطقة إطار القراءة المفتوح (ORF) للحمض النووي الجيني المسؤول عن التعبير الجيني المظهري. كما تم تحليل العوامل اللاأحيائية مثل درجة الحرارة وشدة الضوء ورطوبة الهواء ورطوبة التربة ودرجة حموضة التربة وشدة ضوء التربة لمعرفة تأثيرها على التنوع الوراثي لسمك S. أوليوزا في منطقة مالانج. يستخدم هذا البحث الأساليب الوصفية والاستكشافية والكمية. تم إجراء الاستكشاف في منطقة Malang City و Malang Regency على ارتفاعات متفاوتة بين 366-640 masl. تم إجراء التحليل الجزيئي باستخدام 16 مجموعة من الواسمات الجزيئية SRAP (ME1-ME4 و EM1-EM4). تم بعد ذلك تصور نتائج تضخيم الحمض النووي وتم إجراء مخططات الأسنان باستخدام طريقة UPGMA (مجموعة الأزواج غير المتجانسة ذات المتوسط ​​الحسابي) باستخدام برنامج PAST لرؤية التجمعات في كل عينة. أظهرت نتائج البحث الجزيئي تكوين مجموعات لعينات S.oleose في 4 مجموعات. لا يتأثر هذا التجمع بعوامل الارتفاع ولكنه يتأثر بالعلاقة بين العوامل الوراثية والبيئية. تم تحليل العوامل البيئية باستخدام طريقة PCA (تحليل المكونات الرئيسية). تشمل العوامل الأكثر تأثيرًا درجة الحرارة ورطوبة الهواء ورطوبة التربة. أظهرت النتائج أن علامة SRAP كانت قادرة على رؤية التنوع الوراثي للبكتريا S.oleosa في منطقة Malang الكبرى وكان هناك علاقة إيجابية بين العوامل الوراثية (أصل نباتي) والعوامل البيئية لموائل S.oleosa

    Dissecting multiple sequence alignment methods : the analysis, design and development of generic multiple sequence alignment components in SeqAn

    No full text
    Multiple sequence alignments are an indispensable tool in bioinformatics. Many applications rely on accurate multiple alignments, including protein structure prediction, phylogeny and the modeling of binding sites. In this thesis we dissected and analyzed the crucial algorithms and data structures required to construct such a multiple alignment. Based upon that dissection, we present a novel graph-based multiple sequence alignment program and a new method for multi-read alignments occurring in assembly projects. The advantage of the graph-based alignment is that a single vertex can represent a single character, a large segment or even an abstract entity such as a gene. This gives rise to the opportunity to apply the consistencybased progressive alignment paradigm to alignments of genomic sequences. The proposed multi-read alignment method outperforms similar methods in terms of alignment quality and it is apparently one of the first methods that can readily be used for insert sequencing. An important aspect of this thesis was the design, the development and the integration of the essential multiple sequence alignment components in the SeqAn library. SeqAn is a software library for sequence analysis that provides the core algorithmic components required to analyze large-scale sequence data. SeqAn aims at bridging the current gap between algorithm theory and available practical implementations in bioinformatics. Hence, we always describe in conjunction to the theoretical development of the methods, the actual implementation of the data structures and algorithms in order to strengthen the use of SeqAn as an experimental platform for rapidly developing and testing applications. All presented methods are part of the open source SeqAn library that can be downloaded from our website, www.seqan.de
    corecore