30 research outputs found

    Применение алгоритмов биоинформатики для обнаружения мутирующих кибератак

    Get PDF
    The functionality of any system can be represented as a set of commands that lead to a change in the state of the system. The intrusion detection problem for signature-based intrusion detection systems is equivalent to matching the sequences of operational commands executed by the protected system to known attack signatures. Various mutations in attack vectors (including replacing commands with equivalent ones, rearranging the commands and their blocks, adding garbage and empty commands into the sequence) reduce the effectiveness and accuracy of the intrusion detection. The article analyzes the existing solutions in the field of bioinformatics and considers their applicability for solving the problem of identifying polymorphic attacks by signature-based intrusion detection systems. A new approach to the detection of polymorphic attacks based on the suffix tree technology applied in the assembly and verification of the similarity of genomic sequences is discussed. The use of bioinformatics technology allows us to achieve high accuracy of intrusion detection at the level of modern intrusion detection systems (more than 0.90), while surpassing them in terms of cost-effectiveness of storage resources, speed and readiness to changes in attack vectors. To improve the accuracy indicators, a number of modifications of the developed algorithm have been carried out, as a result of which the accuracy of detecting attacks increased by up to 0.95 with the level of mutations in the sequence up to 10%. The developed approach can be used for intrusion detection both in conventional computer networks and in modern reconfigurable network infrastructures with limited resources (Internet of Things, networks of cyber-physical objects, wireless sensor networks).Функционал любой системы может быть представлен в виде совокупности команд, которые приводят к изменению состояния системы. Задача обнаружения атаки для сигнатурных систем обнаружения вторжений эквивалентна сопоставлению последовательностей команд, выполняемых защищаемой системой, с известными сигнатурами атак. Различные мутации в векторах атак (включая замену команд на равносильные, перестановку команд и их блоков, добавление мусорных и пустых команд) снижают эффективность и точность обнаружения вторжений. В статье проанализированы существующие решения в области биоинформатики, рассмотрена их применимость для идентификации мутирующих атак. Предложен новый подход к обнаружению атак на основе технологии суффиксных деревьев, используемой при сборке и проверке схожести геномных последовательностей. Применение алгоритмов биоинформатики позволяет добиться высокой точности обнаружения мутирующих атак на уровне современных систем обнаружения вторжений (более 90%), при этом превосходя их по экономичности использования памяти, быстродействию и устойчивости к изменениям векторов атак. Для улучшения показателей точности проведен ряд модификаций разработанного решения, вследствие которых точность обнаружения атак увеличена до 95% при уровне мутаций в последовательности до 10%. Метод может применяться для обнаружения вторжений как в классических компьютерных сетях, так и в современных реконфигурируемых сетевых инфраструктурах с ограниченными ресурсами (Интернет вещей, сети киберфизических объектов, сенсорные сети)

    Применение алгоритмов биоинформатики для обнаружения мутирующих кибератак

    Get PDF
    Функционал любой системы может быть представлен в виде совокупности команд, которые приводят к изменению состояния системы. Задача обнаружения атаки для сигнатурных систем обнаружения вторжений эквивалентна сопоставлению последовательностей команд, выполняемых защищаемой системой, с известными сигнатурами атак. Различные мутации в векторах атак (включая замену команд на равносильные, перестановку команд и их блоков, добавление мусорных и пустых команд) снижают эффективность и точность обнаружения вторжений. В статье проанализированы существующие решения в области биоинформатики, рассмотрена их применимость для идентификации мутирующих атак. Предложен новый подход к обнаружению атак на основе технологии суффиксных деревьев, используемой при сборке и проверке схожести геномных последовательностей. Применение алгоритмов биоинформатики позволяет добиться высокой точности обнаружения мутирующих атак на уровне современных систем обнаружения вторжений (более 90%), при этом превосходя их по экономичности использования памяти, быстродействию и устойчивости к изменениям векторов атак. Для улучшения показателей точности проведен ряд модификаций разработанного решения, вследствие которых точность обнаружения атак увеличена до 95% при уровне мутаций в последовательности до 10%. Метод может применяться для обнаружения вторжений как в классических компьютерных сетях, так и в современных реконфигурируемых сетевых инфраструктурах с ограниченными ресурсами (Интернет вещей, сети киберфизических объектов, сенсорные сети)

    Development of Copy Number Variation Detection Algorithms and Their Application to Genome Diversity Studies

    Full text link
    Copy number variation (CNV) is an important class of variation that contributes to genome evolution and disease. CNVs that become fixed in a species give rise to segmental duplications; and already duplicated sequence is prone to subsequent gain and loss leading to additional copy-number variation. Multiple methods exist for defining CNV based on high-throughput sequencing data, including analysis of mapped read-depth. However, accurately assessing CNV can be computationally costly and multi-mapping-based approaches may not specifically distinguish among paralogs or gene families. We present two rapid CNV estimation algorithms, QuicK-mer and fastCN, for second generation short sequencing data. The QuicK-mer program is a paralog sensitive CNV detector which relies on enumerating unique k-mers from a pre-tabulated reference genome. The latest version of QuicK-mer 2.0 utilizes a newly constructed k-mer counting core based on the DJB hash function and permits multithreaded CNV counting of a large input file. As a result, QuicK-mer 2.0 can produce copy-number profiles form a 10X coverage mammalian genome in less than 5 minutes. The second CNV estimator, fastCN, is based on sequence mapping and has tolerance for mismatches. The pipeline is built around the mrsFAST read mapper and does not use additional time compared to the mrsFAST mapping process. We validated the accuracy of both approaches with existing data on human paralogous regions from the 1000 Genomes Project. We also employed QuicK-mer to perform an assessment of copy number variation on chimpanzee and human Y chromosomes. CNV has also been associated with phenotypic changes that occur also during animal domestication. Large scale CNVs were observed previously in cattle, pigs and chicken domestication. We assessed the role of CNV in dog domestication though a comparison of semi-feral village dogs and a global collection of wolfs. Our CNV selection scan uncovered many previously confirmed duplications and deletions but did not identify fixed variants that may have contributed to the initial domestication process. During this selection study, we uncovered CNVs that are errors in the existing canine reference assembly. We attempted to the complement the current CanFam3.1 reference with the de novo genome assembly of a Great Dane breed dog named Zoey. A 50x PacBio long reads sequencing with median insert size of 7.8kbp was conducted. The resulting assembly shows significant improvement with 20x increased continuity and two third reductions of unplaced contigs. The Zoey Great Dane assembly closes 80% of CanFam3.1 gaps where high GC content was the major culprit in the original assembly. Using unique k-mers assigned in these closed gaps, QuicK-mer was able to find many of these regions are fixed across dogs while small proportion shows variability.PHDHuman GeneticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/150064/1/feichens_1.pdfDescription of feichens_1.pdf : Restricted to UM users only

    Reference-guided assembly of metagenomes

    Get PDF
    Microorganisms play an important role in all of the Earth's ecosystems, and are critical for the health of humans [1], plants, and animals. Most microbes are not easily cultured [2]; yet, Metagenomics, the analysis of organismal DNA sequences obtained directly from an environmental sample, enables the study of these microorganisms. Metagenomic assembly is a computational process aimed at reconstructing genes and genomes from metagenomic mixtures. The two main paradigms for this method are de novo assembly (i.e., reconstructing genomes directly from the read data), and reference-guided assembly (i.e., reconstructing genomes using closely related organisms). Because the latter paradigm has a high computational cost—due to the mapping of tens of millions of reads to thousands of full genome sequences—Metagenomic studies have primarily relied on the former paradigm. However, the increased availability of high-throughput sequencing technologies has generated thousands of bacterial genomes, making reference-guided assembly a valuable resource regardless of its computational cost. Thus, this study describes a novel metagenome assembly approach, called MetaCompass, that combines reference-guided assembly and de novo assembly, and it is organized in the following stages: (i) selecting reference genomes from a database using a metagenomic taxonomy classification software that combines gene and genome comparison methods, achieving species and strain level resolution; (ii) performing reference-guided assembly in a new manner, which uses the minimum set cover principle to remove redundancy in a metagenome read mapping while performing consensus calling; and (iii) performing de novo assembly using the reads that have not been mapped to any reference genomes. We show that MetaCompass improves the most common metrics used to evaluate assembly quality—contiguity, consistency, and reference-bases metrics—for both synthetic and real datasets such as the ones gathered in the Human Microbiome Project (HMP) [3], and it also facilitates the assembly of low abundance microorganisms retrieved with the reference-guided approach. Lastly, we used our HMP assembly results to characterize the relative advantages and limitations of de novo and reference-guided assembly approaches, thereby providing guidance on analytical strategies for characterizing the human-associated microbiota

    A large-scale computational framework for comparative analyses in population genetics and metagenomics

    Get PDF
    Population genetics is the study of spatio-temporal genetic variants among individuals. Its purpose is to understand evolution: the change in frequency of alleles over time. The effects of these alleles are expressed on different levels of biological organization, from molecular com-plexes to entire organisms. Eventually, they will affect traits that can influence the survival and reproduction of organisms. Fitness is a probability of transferring alleles to subsequent genera-tions with respect to successful survival and reproduction. Due to differential fitness, any phe-notypic properties that confer beneficial effects on survival and reproduction may presumably become prevalent in a population. Random mutations introduce new alleles in a population. The underlying changes in DNA sequences can be caused by replication errors, failures in DNA repair processes, or insertion and deletion of transposable elements. For sexual organisms, genetic recombination randomly mixes up the alleles in chromosomes, in turn, yielding a new composition of alleles though it does not change the allele frequencies. On the molecular level, mutations on a set of loci may cause a gain or loss of function resulting in totally different phenotypes, hereby influencing the survival of an organism. Despite the dominance of neutral mutations, the accumulation of small changes over time may affect the fitness, and further contribute to evolution. The goal of this study is to provide a framework for a comparative analysis on large-scale genomic datasets, especially, of a population within a species such as the 1001 Genomes Project of Arabidopsis thaliana, the 1000 Genomes Project of humans, or metagenomics datasets. Algo-rithms have been developed to provide following features: 1) denoising and improving the ef-fective coverage of raw genomic datasets (Trowel), 2) performing multiple whole genome alignments (WGAs) and detecting small variations in a population (Kairos), 3) identifying struc-tural variants (SVs) (Apollo), and 4) classifying microorganisms in metagenomics datasets (Po-seidon). The algorithms do not furnish any interpretation of raw genomic data but provide anal-yses as basis for biological hypotheses. With the advances in distributed and parallel computing, many modern bioinformatics algo-rithms have come to utilize multi-core processing on CPUs or GPUs. Having increased computa-tional capacity allows us to solve bigger and more complex problems. However, such hardware advances do not spontaneously give rise to the improved utilization of large-size datasets and do not bring insights by themselves to biological questions. Smart data structures and algorithms are required in order to exploit the enhanced computing power and to extract high quality infor-mation. For population genetics, an efficient representation for a pan genome and relevant for-mulas should be manifested. On top of such representation, sequence alignments play pivotal roles in solving biological problems such that one may calculate allele frequencies, detect rare variants, associate genotypes to phenotypes, and infer causality of certain diseases. To detect mutations in a population, the conventional alignment method is enhanced as multiple genomes are simultaneously aligned. The number of complete genome sequences has steadily increased, but the analysis of large, complex datasets remains challenging. Next Generation Sequencing (NGS) technology is consid-ered one of the great advances in modern biology, and has led to a dramatically more precise and detailed understanding of genomes and their activities. The contiguity and accuracy of se-quencing reads have been improving so that a complete genome sequence of a single cell may become obtainable from a sequencing library in the future. Though chemical and optical engi-neering are main drivers to advance sequencing technology, informatics and computer engineer-ing have significantly influenced the quality of sequences. Genomic sequencing data contain errors in forms of substitution, insertion, and deletion of nucleotides. The read length is far shorter than a given genome. These problems can be alleviated by means of error corrections and genome assemblies, leading to more accurate downstream analyses. Short read aligners have been the key ingredient for measuring and observing genetic muta-tions using Illumina sequencing technology, the dominant technology in the last decade. As long reads from newer methods or assembled contigs become accessible, mapping schemes capturing long-range context, but not lingering in local matches should be devised. Parameters for short read aligners such as the number of mismatches, gap-opening and -extending penalty are not directly applicable to long read alignments. At the other end of the spectrum, whole genome aligners (WGA) attempt to solve the alignment problem in a much longer context, providing es-sential data for comparative studies. However, available WGA algorithms are not yet optimized concerning practical uses in population genetics due to high computing demands. Moreover, too little attention has been paid to define an ideal data format for applications in comparative ge-nomics. To deal with datasets representing a large population of diverse individuals, multiple se-quence alignment (MSA) algorithms should be combined with WGA methods, known as multi-ple whole genome alignment (MWGA). Though several MWGA algorithms have been proposed, the accuracy of algorithms has not been clearly measured. In fact, known quality assessment tools have yielded highly fluctuating results dependent on the selection of organisms, and se-quencing profiles. Of even more serious concern, experiments to measure the performance of MWGA methods have been only ambiguously described. In turn, it has been difficult to inter-pret the multiple alignment results. With known precise locations of variants from simulations and standardized statistics, I present a far more comprehensive method to measure the accuracy of a MWGA algorithm. Metagenomics is a study of the genetic composition in a given community (often, predomi-nantly microbial). It overcomes the limitation of having to culture each organism for genome sequencing and also provides quantitative information on the composition of a community. Though an environmental sample provides more natural genetic material, the complexity of analyses is greatly increased. The number of species can be very large and only small portions of a genome may be sampled. I provide an algorithm, Poseidon, classifying sequencing reads to taxonomy identifiers at a species resolution and helping to quantify their relative abundances in the samples. The interactions among individual bacteria in a certain population can result in both conflict and cooperation. Thus, a mixture of diverse bacteria species shows a set of functional adaptations to a particular environment. The composition of species would be changed by dis-tinct biotic or abiotic factors, which may lead to a successive alteration in susceptibility of a host to a certain disease. In turn, basic concerns for a metagenomics study are an accurate quantifica-tion of species and deciphering their functional role in a given environment. In summary, this work presents advanced bioinformatics methods: Trowel, Kairos, Apollo, and Poseidon. Trowel corrects sequencing errors in reads by utilizing a piece of high-quality k-mer information. Kairos aligns query sequences against multiple genomes in a population of a single species. Apollo characterizes genome-wide genetic variants from point mutations to large structural variants on top of the alignments of Kairos. Poseidon classifies metagenomics datasets to taxonomy identifiers. Though the work does not directly address any specific biological ques-tions, it would provide preliminary materials for further downstream analyses.In der Populationsgenetik werden die räumlichen und zeitlichen Verteilungen von genetischen Varianten in Individuen einer Population untersucht. Über die Generationen ändert sich die Frequenz von Genen und Allelen. Die Auswirkungen der durch diese evolutionären Mechanismen gebildete Diversität zeigt sich auf verschiedenen Stufen biologischer Organisation, von einzelnen Molekülen bis hin zu gesamten Organismen. Sind Eigenschaften betroffen welche einen Einfluss auf die Überlebens- und Reproduktionsrate haben, werden die zugrundeliegenden Allele mit höherer Wahrscheinlichkeit in die nachfolgende Generation übetragen werden. Allele mit positiver Auswirkungen auf die Fitness eines Organismus könnten sich so in einer Population verbreiten. Zufällige Mutationen sind eine Quelle für neue Allele in einer Population. Die zugrundeliegenden Veränderungen der DNA-Sequenzen können durch Fehler bei der DNA-Replikation oder von DNA-Reparaturmechanismen, sowie Insertionen und Deletionen von mobilen genetischen Elementen entstehen. In sich sexuell fortpflanzenden Organismen sorgt genetische Rekombination für eine Vermischung der Allele auf den Chromosomen. Obwohl die Allelfrequenzen nicht verändert werden, entstehen dadurch neue Kombinationen von Allelen. Auf der molekularen Ebene können Genloci durch Mutationen an Aktivität gewinnen oder funktionslos werden, was wiederum eine Auswirkung auf den entstehenden Phänotyp und die Überlebensfähigkeit des Organismus hat. Trotz der höherer Verbreitung neutraler Mutationen, kann das Ansammeln von kleinen Veränderungen im Laufe der Zeit die Fitness beeinflussen und weiter der Evolution beitragen. Das Ziel dieser Arbeit war es ein Rahmenwerk für die vergleichende Analyse großer genomischer Datensets zur Verfügung zu stellen. Im Besonderen für Datensätze mit vielen Individuen einer Spezies wie im 1001 Genomes Project (Arabidopsis thaliana), im 1000 Genomes Project (Homo sapiens) sowie in metagenomische Datensätzen. Für die folgenden Problemstellungen wurden Algorithmen entwickelt: 1) Fehlerkorrektur und Verbesserung der effektiven Coverage von genomischen Rohdaten (Trowel), 2) multiple Gesamt-Genomalinierungen (whole genome alignments; WGAs) und die Detektion kleiner Unterschiede innerhalb einer Population (Kairos), 3) Identifikation struktureller Varianten (SV) (Apollo), und 4) Klassifikation von Mikroorganismen in metagenomischen Datensätzen (Poseidon). Diese Algorithmen nehmen keine Interpretation biologischer Rohdaten vor sondern stellen Ausgangspunkte für biologische Hypothesen zur Verfügung. Auf Grund der Fortschritte in verteiltem und paralellem Rechnen nutzen viele moderne Bioinformatikalgorithmen Paralellisierung auf CPUs oder GPUs. Diese erhöhte Rechenkapazität erlaubt es uns größere und komplexere Probleme zu lösen. Allerdings machen diese technische Fortschritte allein es noch nicht möglich, sehr große Datensätze zu nutzen und bringen auch keine Antworten auf biologische Fragen. Um von diesen Fortschritten zu profitieren und hochqualitative Informationen aus Rohdaten extrahieren zu können, sind gut durchdachte Datenstrukturen und Algorithmen notwendig. Für die Populationsgenetik sollte eine effiziente Repräsentation eines Pan-Genoms und dazugehöriger Formeln geschaffen werden. Zusätzlich zu einer solchen Repräsentation spielen Sequenzalalinierungen eine entscheidende Rolle im Lösen biologischer Probleme wie der Berechnung von Allelfrequenzen, der Detektion seltener Varianten, der Assoziation von Genotypen und Phänotypen und Inferenz von Kausalität bezüglich bestimmter Krankheiten. Um Mutationen in einer Population zu detektieren wird die konventionelle Alinierungsmethode verbessert da mehrere Genome gleichzeitig aliniert werden. Obwohl die Anzahl vollständiger Genomsequenzen stetig gestiegen ist, ist die Analyse dieser großen und komplexen Datensätze immer noch schwierig. Die Hochdurchsatz-Sequenzierung (Next Generation Sequencing; NGS), die ein präziseres und detaillierteres Bild der Genomik geliefert hat, ist einer der großen Fortschritte in der Biotechnologie. Die Länge und Genauigkeit der Sequenzier-Abschnitte (Reads) hat sich so weit verbessert, dass in Zukunft wahrscheinlich ein vollständiges Genom von nur einer einzelnen Zelle als Ausgangsmaterial rekonstruiert werden kann. Obwohl die wichtigsten Schritte zur Realisierung von Sequenzierungsfortschritten eine Domäne der Verfahrenstechnik sind, haben auch die Informatik und Computertechnik die Qualität der Sequenzen entscheidend beeinflusst. Sequenzierdaten enthalten Fehler in Form von Substitutionen, Insertionen oder Deletionen von Nukleotiden. Außerdem ist die Länge der erzeugten Reads deutlich kürzer als die eines vollständigen Genoms. Diese Schwierigkeiten können durch Fehlerkorrekturen und Genomassemblierung verringert werden, wodurch nachfolgende Analysen genauer werden. Programme zur Alinierung kurzer Reads waren bisher die wichtigste Methode um genetische Mutationen zu detektieren. Da nun duch neue Technologien häufig längere Reads oder auch Contigs verfügbar sind, werden Kartierungsmethoden benötigt die sich an langen Ähnlichkeiten orientieren und sich nicht von kurzen lokalen Übereinstimmungen fehlleiten lassen. Die Parameter für Programme zur Alinierung von kurzen Reads welche nichtübereinstimmende Basen und das Eröffnen und Verlängern von Lücken bestrafen, sind nicht direkt auf die Alinierung längerer Reads anwendbar. Alternativ können WGA-Algorithmen verwendet werden, die das Alinierungsproblem in einem längeren Kontext lösen und dadurch essentielle Daten für vergleichende Studien liefern. Allerdings haben bisherige WGA-Algorithmen noch Probleme in der praktischen Anwendung für die Populationsgenetik wegen ihrer hohen Zeit- und Speicherkomplexität. Außerdem wurde der Definition idealer Datenformate für Anwendungen der komparativen Genomik nur wenig Aufmerksamkeit gewidmet. Um Datensätze großer Populationen verarbeiten zu können sollten Algorithmen für multiple Sequenzalinierung (MSA) mit WGA-Methoden zur multiplen Gesamtgenomalinierung (MWGA) kombiniert werden. Obwohl bereits viele MWGA-Methoden vorgestellt wurden, wurde ihre Genauigkeit noch nicht aussagekräftig überprüft. Vielmehr lieferten Qualitätskontrollen sehr unterschiedliche Ergebnisse, abhängig von der Auswahl von Organismen und verwendeten Sequenzen. Ein noch größeres Problem ist die ungenaue Beschreibung von Experimenten zur Messung der Funktionalität von MWGA-Methoden. Daher war es schwierig die multiplen Alinierungs-Ergebnisse zu interpretieren. Ich beschreibe in dieser Arbeit eine deutlich umfassendere Methode um die Genauigkeit eines MWGA-Algorithmus zu bestimmen. Sie macht von vorab bekannten Positionen der Varianten Gebrauch wozu Simulationen und standardisierte Statistiken herangezogen werden. Die Metagenomik untersucht die genetische Zusammensetzung einer (oft hauptsächlich mikrobiellen) natürlichen Organismen-Gemeinschaft. Sie ist unabhängig von der Kultivierung einzelner Mikroben und liefert auch quantitative Informationen zur Zusammensetzung der Gemeinschaft. Während Proben aus der Umwelt ein natürlicheres Ausgangsmaterial liefern ist gleichzeitig auch die Komplexität der Analysen deutlich höher: die Anzahl der enthaltenen Arten kann sehr groß sein, so dass nur ein Bruchteil der Genome tatsächlich analysiert wird. Ich stelle einen Algorithmus vor, Poseidon, der Reads zur taxonomischen Identifikation mit Arten-genauer Auflösung zuordnet und damit hilft deren relative Häufigkeit in einer Probe zu quantifizieren. Die Interaktionen zwischen Bakterien kann Konflikte und auch Kooperationen hervorrufen. Die spezielle Mischung unterschiedlicher Artem kann daher eine Reihe funktionaler Anpassungen an eine bestimmte Umgebung aufzeigen. Die Zusammensetzung der Arten könnte durch biotische oder abiotische Faktoren verändert werden, was im Kontext eines Krankheitsbildes zu einer Veränderung der Anfälligkeit eines Wirts bezüglich eines bestimmten Erregers führen kann. Daher sind die genaue Quantifizierung von Arten und die Entschlüsselung ihrer funktionalen Rolle in einer bestimmten Umgebung grundlegend für metagenomische Studien. Zusammenfassend stelle ich in dieser Arbeit fortgeschrittene bioinformatische Methoden, Trowel, Kairos, Apollo und Poseidon vor. Trowel korrigiert Fehler in Sequenzabschnitten mit Hilfe von k-mer Informationen von hoher Qualität. Kairos führt die Alinierung einer Sequenz zu multiplen Genomen einer Art durch. Apollo charakterisiert genomweit genetische Varianten basierend auf den Alinierungen von Kairos, und erfasst sowohl Punktmutationen als auch große strukturelle Varianten. Poseidon ordnet metagenomische Datensätze taxonomischen Identifikatoren zu. Auch wenn keine spezifischen biologischen Fragestellungen beantwortet werden, wird die Basis für zukünftige Fragen geschaffen

    Altitudinal adaptations of earthworms

    Get PDF
    To date few have looked into how earthworms have adapted or acclimatised to the harsh and dynamic environment of high altitude. In this work, I explore the terrestrial invertebrates, earthworms that were found at high altitude on the volcanic island of Pico in the Azores (Portugal) and at Les Deux Alpes in the French Alps. I initially identify species presence along an altitudinal transect compare species diversity and lineage, before investigating gene regulatory control and genomic adaptation between high and low altitude populations to identify if high altitude populations have acquired a genetic advantage to their low altitude cousin or if all worms have it within them to survive if given time to acclimatise. Altitudinal transects of two temperate-zone mountains were conducted, at Les Deux Alpes and Pico, to identify presence and abundance of species. The two most abundant species, Lumbricus terrestris and Aporrectodea caliginosa, were investigated to identify diversity and species lineage to determine which species better allowed for adaption and acclimatisation investigations, that are not heavily influenced by deeply rooted species diversity. Having identified A. caliginosa in Pico as the most suitable candidate for investigating adaption and acclimatisation with its low population diversity, an de novo genome assembly was developed and annotated. Live individuals of A. caliginosa from a high and a low altitude site on Pico were acclimatised to standard laboratory conditions for six months prior to experimental exposure to conditions simulating six climatic conditions for two weeks with temperature and oxygen as variables. RNAseq was performed on the RNA taken from a body transect (including muscular, nerve and gut tissues) of the exposed experimental worms, and differential gene expression was calculated and explored between the high and low altitude populations. Despite both populations normalising in identical soils for 6 months, high altitude individuals had a lower response in gene expression than the Low altitude individuals and suggested an element of epigenetic conditioning or adaption allowing a more plastic response to the changes in conditions. In particular, HMGB1, a gene that is known for its roles in regulating environmental responses, had a comparatively lower expression in the high altitude population than the low altitude population when exposed to simulated high altitude climatic stressors. SNP analysis from transcriptomic sequences revealed the high altitude individuals had SNPs associated with genes that linked to directly to this gene indicating a level of adaption through SNPs and acclimatisation through potential epigenetic priming within the high altitude population

    Localization in 1D non-parametric latent space models from pairwise affinities

    Full text link
    We consider the problem of estimating latent positions in a one-dimensional torus from pairwise affinities. The observed affinity between a pair of items is modeled as a noisy observation of a function f(xi,xj)f(x^*_{i},x^*_{j}) of the latent positions xi,xjx^*_{i},x^*_{j} of the two items on the torus. The affinity function ff is unknown, and it is only assumed to fulfill some shape constraints ensuring that f(x,y)f(x,y) is large when the distance between xx and yy is small, and vice-versa. This non-parametric modeling offers a good flexibility to fit data. We introduce an estimation procedure that provably localizes all the latent positions with a maximum error of the order of log(n)/n\sqrt{\log(n)/n}, with high-probability. This rate is proven to be minimax optimal. A computationally efficient variant of the procedure is also analyzed under some more restrictive assumptions. Our general results can be instantiated to the problem of statistical seriation, leading to new bounds for the maximum error in the ordering
    corecore