6 research outputs found

    AIRR Data Under the EU Trade Secrets Directive: Aligning Scientific Practices with Commercial Realities

    Get PDF
    Whether the E.U. Trade Secrets Directive sufficiently and appropriately covers cutting-edge complex technologies is of critical interest to policy-makers, scientists, and commercial developers alike. One such technology — adaptive immune receptor repertoire sequencing, or AIRR-seq — raises difficult questions concerning what information is and should be protected under the new Directive, and how to best align scientific practices with commercial realities. The ‘raw’ form AIRR-seq data — massive genetic datasets of hundreds of millions of individuals’ immune cells — tends to be freely shared among academic researchers, thus typically destroying the protectability of the underlying information. But follow-on data — essentially, information interpreting that data — is nonetheless protectable under the Directive because it is both economically valuable and not readily available from an examination of the raw data itself. Protecting this follow-on information while encouraging the free sharing of AIRR-seq data best accords the purpose of the Trade Secrets Directive. Lessons from the case of AIRR-seq data also sheds light on other puzzles concerning the tensions between disclosure and various forms of legal protections, such as the mutual exclusivity of patents and trade secrets, the sharing of clinical trial data, and protecting genetic diagnostics.Ope

    A Computational Framework for Host-Pathogen Protein-Protein Interactions

    Get PDF
    Infectious diseases cause millions of illnesses and deaths every year, and raise great health concerns world widely. How to monitor and cure the infectious diseases has become a prevalent and intractable problem. Since the host-pathogen interactions are considered as the key infection processes at the molecular level for infectious diseases, there have been a large amount of researches focusing on the host-pathogen interactions towards the understanding of infection mechanisms and the development of novel therapeutic solutions. For years, the continuously development of technologies in biology has benefitted the wet lab-based experiments, such as small-scale biochemical, biophysical and genetic experiments and large-scale methods (for example yeast-two-hybrid analysis and cryogenic electron microscopy approach). As a result of past decades of efforts, there has been an exploded accumulation of biological data, which includes multi omics data, for example, the genomics data and proteomics data. Thus, an initiative review of omics data has been conducted in Chapter 2, which has exclusively demonstrated the recent update of ‘omics’ study, particularly focusing on proteomics and genomics. With the high-throughput technologies, the increasing amount of ‘omics’ data, including genomics and proteomics, has even further boosted. An upsurge of interest for data analytics in bioinformatics comes as no surprise to the researchers from a variety of disciplines. Specifically, the astonishing rate at which genomics and proteomics data are generated leads the researchers into the realm of ‘Big Data’ research. Chapter 2 is thus developed to providing an update of the omics background and the state-of-the-art developments in the omics area, with a focus on genomics data, from the perspective of big data analytics..

    A novel compression approach for mapped high-throughput sequencing data set

    Get PDF
    Eine der grĂ¶ĂŸten aktuellen Herausforderungen im Zusammenhang mit Hochdurchsatz-Sequenzierungsexperimenten (High-Throughput Sequencing, HTS) liegt nicht im Erzeugen der Daten selbst, sondern in deren Prozessierung, Speicherung und Übertragung. Die enorme GrĂ¶ĂŸe dieser Daten motiviert die Entwicklung von Datenkompressionsalgorithmen fĂŒr die Realisierung der verschiedenen Datenspeicherkonzepte die auf die produzierten (Zwischen-)Ergebnisse von HTS Experimenten angewandt werden. Die vorliegende Arbeit gibt einen Überblick ĂŒber das Feld der Hochdurchsatz-NukleinsĂ€ure-Sequenzierung und in aktuelle AnsĂ€tze fĂŒr die Kompression solcher Daten. Im Hauptteil der Arbeit wird NGC vorgestellt, ein Werkzeug fĂŒr die Kompression von gemappten reads die im weitverbreiteten SAM Format gespeichert sind (eine Art von HTS Daten). NGC ermöglicht sowohl verlustfreie als auch verlustbehaftete Kompression und beinhaltet zwei neuartige Ideen: Erstens enthĂ€lt es eine Methode zur Reduktion der erforderlichen Code-Wörter, welche gemeinsame Merkmale der reads die an dieselbe genomische Position gemappt wurden ausnĂŒtzt. Zweitens beinhaltet NGC eine konfigurierbare Methode fĂŒr die Quantisierung der QualitĂ€tswerte welche deren Einfluss auf nach-gelagerte Anwendungen berĂŒcksichtigt. NGC, mit mehreren echten DatensĂ€tzen evaluiert, spart 33-66% des benötigten Speicherplatzes bei verlustfreier und bis zu 98% des benötigten Speicherplatzes bei verlustbehafteter Kompression ein. Durch die Anwendung zweier gĂ€ngiger Varianten- und Genotyp-Vorhersagewerkzeuge auf die dekomprimierten Daten wird gezeigt, dass die verlustbehaftete Kompression, besser als vergleichbare Werkzeuge in manchen Konfigurationen, ĂŒber 99% der gefundenen Varianten prĂ€serviert.A major challenge of current high-throughput sequencing (HTS) experiments is not only the generation of the sequencing data itself but also their processing, storage and transmission. The enormous size of these data motivates the development of data compression algorithms usable for the implementation of the various storage policies that are applied to the produced intermediate and final result files. This thesis gives a brief introduction into the field of high-throughput nucleic acid sequencing and into current approaches for the compression of the data resulting from such experiments. In the main part of the thesis, NGC, a tool for the compression of mapped read data stored in the SAM format (one kind of HTS data), is presented. NGC enables lossless and lossy compression and introduces two novel ideas: First, it contains a way to reduce the number of required code words by exploiting common features of the sequenced reads mapped to the same genomic positions; second, it contains a highly configurable way for the quantization of per-base quality values which takes their influence on downstream analyses into account. NGC, evaluated with several real-world data sets, saves 33-66% of disc space using lossless and up to 98% disc space using lossy compression. By applying two popular variant and genotype prediction tools to the decompressed data, we show that the lossy compression modes preserve over 99% of all called variants while outperforming comparable methods in some configurations
    corecore