115 research outputs found

    Modeling Drug Resistance Emergence and Transmission in HIV-1 in the UK

    Get PDF
    A deeper understanding of HIV-1 transmission and drug resistance mechanisms can lead to improvements in current treatment policies. However, the rates at which HIV-1 drug resistance mutations (DRMs) are acquired and which transmitted DRMs persist are multi-factorial and vary considerably between different mutations. We develop a method for the estimation of drug resistance acquisition and transmission patterns. The method uses maximum likelihood ancestral character reconstruction informed by treatment roll-out dates and allows for the analysis of very large datasets. We apply our method to transmission trees reconstructed on the data obtained from the UK HIV Drug Resistance Database to make predictions for known DRMs. Our results show important differences between DRMs, in particular between polymorphic and non-polymorphic DRMs and between the B and C subtypes. Our estimates of reversion times, based on a very large number of sequences, are compatible but more accurate than those already available in the literature, with narrower confidence intervals. We consistently find that large resistance clusters are associated with polymorphic DRMs and DRMs with long loss times, which require special surveillance. As in other high-income countries (e.g., Switzerland), the prevalence of sequences with DRMs is decreasing, but among these, the fraction of transmitted resistance is clearly increasing compared to the fraction of acquired resistance mutations. All this indicates that efforts to monitor these mutations and the emergence of resistance clusters in the population must be maintained in the long term

    Genomic epidemiology of community-associated Staphylococcus aureus in northern Australia and Papua New Guinea

    Get PDF
    Eike Steinig investigated the transmission dynamics of community-associated MRSA in Far North Queensland and Papua New Guinea. He developed bioinformatics methods based on portable nanopore sequencing technology to infer epidemiological outbreak parameters and clinical genotypes of bacterial pathogens, with applications in rural and remote hospitals that lack sequencing infrastructure

    Privacy Preserving Data Mining

    Get PDF
    Data mining techniques provide benefits in many areas such as medicine, sports, marketing, signal processing as well as data and network security. However, although data mining techniques used in security subjects such as intrusion detection, biometric authentication, fraud and malware classification, “privacy” has become a serious problem, especially in data mining applications that involve the collection and sharing of personal data. For these reasons, the problem of protecting privacy in the context of data mining differs from traditional data privacy protection, as data mining can act as both a friend and foe. Chapter covers the previously developed privacy preserving data mining techniques in two parts: (i) techniques proposed for input data that will be subject to data mining and (ii) techniques suggested for processed data (output of the data mining algorithms). Also presents attacks against the privacy of data mining applications. The chapter conclude with a discussion of next-generation privacy-preserving data mining applications at both the individual and organizational levels

    マルチレベル並列化とアプリケーション指向データレイアウトを用いるハードウェアアクセラレータの設計と実装

    Get PDF
    学位の種別: 課程博士審査委員会委員 : (主査)東京大学教授 稲葉 雅幸, 東京大学教授 須田 礼仁, 東京大学教授 五十嵐 健夫, 東京大学教授 山西 健司, 東京大学准教授 稲葉 真理, 東京大学講師 中山 英樹University of Tokyo(東京大学

    Generalized and efficient outlier detection for spatial, temporal, and high-dimensional data mining

    Get PDF
    Knowledge Discovery in Databases (KDD) ist der Prozess, nicht-triviale Muster aus großen Datenbanken zu extrahieren, mit dem Ziel, dass diese bisher unbekannt, potentiell nützlich, statistisch fundiert und verständlich sind. Der Prozess umfasst mehrere Schritte wie die Selektion, Vorverarbeitung, Evaluierung und den Analyseschritt, der als Data-Mining bekannt ist. Eine der zentralen Aufgabenstellungen im Data-Mining ist die Ausreißererkennung, das Identifizieren von Beobachtungen, die ungewöhnlich sind und mit der Mehrzahl der Daten inkonsistent erscheinen. Solche seltene Beobachtungen können verschiedene Ursachen haben: Messfehler, ungewöhnlich starke (aber dennoch genuine) Abweichungen, beschädigte oder auch manipulierte Daten. In den letzten Jahren wurden zahlreiche Verfahren zur Erkennung von Ausreißern vorgeschlagen, die sich oft nur geringfügig zu unterscheiden scheinen, aber in den Publikationen experimental als ``klar besser'' dargestellt sind. Ein Schwerpunkt dieser Arbeit ist es, die unterschiedlichen Verfahren zusammenzuführen und in einem gemeinsamen Formalismus zu modularisieren. Damit wird einerseits die Analyse der Unterschiede vereinfacht, andererseits aber die Flexibilität der Verfahren erhöht, indem man Module hinzufügen oder ersetzen und damit die Methode an geänderte Anforderungen und Datentypen anpassen kann. Um die Vorteile der modularisierten Struktur zu zeigen, werden (i) zahlreiche bestehende Algorithmen in dem Schema formalisiert, (ii) neue Module hinzugefügt, um die Robustheit, Effizienz, statistische Aussagekraft und Nutzbarkeit der Bewertungsfunktionen zu verbessern, mit denen die existierenden Methoden kombiniert werden können, (iii) Module modifiziert, um bestehende und neue Algorithmen auf andere, oft komplexere, Datentypen anzuwenden wie geographisch annotierte Daten, Zeitreihen und hochdimensionale Räume, (iv) mehrere Methoden in ein Verfahren kombiniert, um bessere Ergebnisse zu erzielen, (v) die Skalierbarkeit auf große Datenmengen durch approximative oder exakte Indizierung verbessert. Ausgangspunkt der Arbeit ist der Algorithmus Local Outlier Factor (LOF). Er wird zunächst mit kleinen Erweiterungen modifiziert, um die Robustheit und die Nutzbarkeit der Bewertung zu verbessern. Diese Methoden werden anschließend in einem gemeinsamen Rahmen zur Erkennung lokaler Ausreißer formalisiert, um die entsprechenden Vorteile auch in anderen Algorithmen nutzen zu können. Durch Abstraktion von einem einzelnen Vektorraum zu allgemeinen Datentypen können auch räumliche und zeitliche Beziehungen analysiert werden. Die Verwendung von Unterraum- und Korrelations-basierten Nachbarschaften ermöglicht dann, einen neue Arten von Ausreißern in beliebig orientierten Projektionen zu erkennen. Verbesserungen bei den Bewertungsfunktionen erlauben es, die Bewertung mit der statistischen Intuition einer Wahrscheinlichkeit zu interpretieren und nicht nur eine Ausreißer-Rangfolge zu erstellen wie zuvor. Verbesserte Modelle generieren auch Erklärungen, warum ein Objekt als Ausreißer bewertet wurde. Anschließend werden für verschiedene Module Verbesserungen eingeführt, die unter anderem ermöglichen, die Algorithmen auf wesentlich größere Datensätze anzuwenden -- in annähernd linearer statt in quadratischer Zeit --, indem man approximative Nachbarschaften bei geringem Verlust an Präzision und Effektivität erlaubt. Des weiteren wird gezeigt, wie mehrere solcher Algorithmen mit unterschiedlichen Intuitionen gleichzeitig benutzt und die Ergebnisse in einer Methode kombiniert werden können, die dadurch unterschiedliche Arten von Ausreißern erkennen kann. Schließlich werden für reale Datensätze neue Ausreißeralgorithmen konstruiert, die auf das spezifische Problem angepasst sind. Diese neuen Methoden erlauben es, so aufschlussreiche Ergebnisse zu erhalten, die mit den bestehenden Methoden nicht erreicht werden konnten. Da sie aus den Bausteinen der modularen Struktur entwickelt wurden, ist ein direkter Bezug zu den früheren Ansätzen gegeben. Durch Verwendung der Indexstrukturen können die Algorithmen selbst auf großen Datensätzen effizient ausgeführt werden.Knowledge Discovery in Databases (KDD) is the process of extracting non-trivial patterns in large data bases, with the focus of extracting novel, potentially useful, statistically valid and understandable patterns. The process involves multiple phases including selection, preprocessing, evaluation and the analysis step which is known as Data Mining. One of the key techniques of Data Mining is outlier detection, that is the identification of observations that are unusual and seemingly inconsistent with the majority of the data set. Such rare observations can have various reasons: they can be measurement errors, unusually extreme (but valid) measurements, data corruption or even manipulated data. Over the previous years, various outlier detection algorithms have been proposed that often appear to be only slightly different than previous but ``clearly outperform'' the others in the experiments. A key focus of this thesis is to unify and modularize the various approaches into a common formalism to make the analysis of the actual differences easier, but at the same time increase the flexibility of the approaches by allowing the addition and replacement of modules to adapt the methods to different requirements and data types. To show the benefits of the modularized structure, (i) several existing algorithms are formalized within the new framework (ii) new modules are added that improve the robustness, efficiency, statistical validity and score usability and that can be combined with existing methods (iii) modules are modified to allow existing and new algorithms to run on other, often more complex data types including spatial, temporal and high-dimensional data spaces (iv) the combination of multiple algorithm instances into an ensemble method is discussed (v) the scalability to large data sets is improved using approximate as well as exact indexing. The starting point is the Local Outlier Factor (LOF) algorithm, which is extended with slight modifications to increase robustness and the usability of the produced scores. In order to get the same benefits for other methods, these methods are abstracted to a general framework for local outlier detection. By abstracting from a single vector space, other data types that involve spatial and temporal relationships can be analyzed. The use of subspace and correlation neighborhoods allows the algorithms to detect new kinds of outliers in arbitrarily oriented subspaces. Improvements in the score normalization bring back a statistic intuition of probabilities to the outlier scores that previously were only useful for ranking objects, while improved models also offer explanations of why an object was considered to be an outlier. Subsequently, for different modules found in the framework improved modules are presented that for example allow to run the same algorithms on significantly larger data sets -- in approximately linear complexity instead of quadratic complexity -- by accepting approximated neighborhoods at little loss in precision and effectiveness. Additionally, multiple algorithms with different intuitions can be run at the same time, and the results combined into an ensemble method that is able to detect outliers of different types. Finally, new outlier detection methods are constructed; customized for the specific problems of these real data sets. The new methods allow to obtain insightful results that could not be obtained with the existing methods. Since being constructed from the same building blocks, there however exists a strong and explicit connection to the previous approaches, and by using the indexing strategies introduced earlier, the algorithms can be executed efficiently even on large data sets

    Advances in knowledge discovery and data mining Part II

    Get PDF
    19th Pacific-Asia Conference, PAKDD 2015, Ho Chi Minh City, Vietnam, May 19-22, 2015, Proceedings, Part II</p

    Phylogenomic and population genomic insights on the evolutionary history of coffee leaf rust within the rust fungi

    Get PDF
    Tese de doutoramento, Biologia e Ecologia das Alterações Globais (Biologia do Genoma e Evolução), Universidade de Lisboa, Faculdade de Ciências, 2018Fungi are currently responsible for more than 30% of the emerging diseases worldwide and rust fungi (Pucciniales, Basidiomycota) are one of the most destructive groups of plant pathogens. In this thesis, two genomic approaches were pursued to further our knowledge on these pathogenic fungi at the macro-evolutionary level, using phylogenomics, and micro-evolutionary level, using population genomics. At the macro-evolutionary level, a phylogenomics pipeline was developed with the aim of investigating the role of positive selection on the origin of the rusts, particularly related to their obligate biotrophic life-style and pathogenicity. With up to 30% of the ca. 1000 screened genes showing a signal of positive selection, these results revealed a pervasive role of natural selection on the origin of this fungal group, with an enrichment of functional classes involved in nutrient uptake and secondary metabolites. Furthermore, positive selection was detected on conserved amino acid sites revealing an unexpected but potentially important role of natural selection on codon usage preferences. At the micro-evolutionary level, the focus was shifted to the coffee rust, Hemileia vastatrix, which is the causal agent of leaf rust disease and the main threat to Arabic coffee production worldwide. Using RAD sequencing to produce thousands of informative SNPs for a broad and unique sampling of this species, the aim was to investigate its evolutionary history and translate population genomic insights into recommendations for disease control. The results of this work overturned most of the preconceptions about the pathogen by revealing that instead of a single unstructured and large population, H. vastatrix is most likely a complex of cryptic species with marked host specialization. Moreover, genomic signatures of hybridization and introgression occurring between these lineages were uncovered, raising the possibility that virulence factors may be quickly exchanged. The most recent “domesticated” lineage infects exclusively the most important coffee species and SNP linkage analyses revealed the presence of recombination among isolates that were previously thought to be clonal. Altogether, these results considerably raise the evolutionary potential of this pathogen to overcome disease control measures in coffee crops. To undertake most of the tasks in this project, a new computational application called TriFusion was developed to streamline the gathering, processing and visualization of big genomic data

    Authenticated Outlier Mining for Outsourced Databases

    Get PDF
    The Data-Mining-as-a-Service (DMaS) paradigm is becoming the focus of research, as it allows the data owner (client) who lacks expertise and/or computational resources to outsource their data and mining needs to a third-party service provider (server). Outsourcing, however, raises some issues about result integrity : how could the client verify the mining results returned by the server are both sound and complete? In this paper, we focus on outlier mining, an important mining task. Previous verification techniques use an authenticated data structure (ADS) for correctness authentication, which may incur much space and communication cost. In this paper, we propose a novel solution that returns a probabilistic result integrity guarantee with much cheaper verification cost. The key idea is to insert a set of artificial records ( ARAR A R s) into the dataset, from which it constructs a set of artificial outliers ( AOAO A O s) and artificial non-outliers ( ANOANO A N O s). The AOAO A O s and ANOANO A N O s are used by the client to detect any incomplete and/or incorrect mining results with a probabilistic guarantee. The main challenge that we address is how to construct ARAR A R s so that they do not change the (non-)outlierness of original records, while guaranteeing that the client can identify ANOANO A N O s and AOAO A O s without executing mining. Furthermore, we build a strategic game and show that a Nash equilibrium exists only when the server returns correct outliers. Our implementation and experiments demonstrate that our verification solution is efficient and lightweight
    corecore