57 research outputs found

    Classifying distinct data types: textual streams protein sequences and genomic variants

    Get PDF
    Artificial Intelligence (AI) is an interdisciplinary field combining different research areas with the end goal to automate processes in the everyday life and industry. The fundamental components of AI models are an “intelligent” model and a functional component defined by the end-application. That is, an intelligent model can be a statistical model that can recognize patterns in data instances to distinguish differences in between these instances. For example, if the AI is applied in car manufacturing, based on an image of a part of a car, the model can categorize if the car part is in the front, middle or rear compartment of the car, as a human brain would do. For the same example application, the statistical model informs a mechanical arm, the functional component, for the current car compartment and the arm in turn assembles this compartment, of the car, based on predefined instructions, likely as a human hand would follow human brain neural signals. A crucial step of AI applications is the classification of input instances by the intelligent model. The classification step in the intelligent model pipeline allows the subsequent steps to act in similar fashion for instances belonging to the same category. We define as classification the module of the intelligent model, which categorizes the input instances based on predefined human-expert or data-driven produced patterns of the instances. Irrespectively of the method to find patterns in data, classification is composed of four distinct steps: (i) input representation, (ii) model building (iii) model prediction and (iv) model assessment. Based on these classification steps, we argue that applying classification on distinct data types holds different challenges. In this thesis, I focus on challenges for three distinct classification scenarios: (i) Textual Streams: how to advance the model building step, commonly used for static distribution of data, to classify textual posts with transient data distribution? (ii) Protein Prediction: which biologically meaningful information can be used in the input representation step to overcome the limited training data challenge? (iii) Human Variant Pathogenicity Prediction: how to develop a classification system for functional impact of human variants, by providing standardized and well accepted evidence for the classification outcome and thus enabling the model assessment step? To answer these research questions, I present my contributions in classifying these different types of data: temporalMNB: I adapt the sequential prediction with expert advice paradigm to optimally aggregate complementary distributions to enhance a Naive Bayes model to adapt on drifting distribution of the characteristics of the textual posts. dom2vec: our proposal to learn embedding vectors for the protein domains using self-supervision. Based on the high performance achieved by the dom2vec embeddings in quantitative intrinsic assessment on the captured biological information, I provide example evidence for an analogy between the local linguistic features in natural languages and the domain structure and function information in domain architectures. Last, I describe GenOtoScope bioinformatics software tool to automate standardized evidence-based criteria for pathogenicity impact of variants associated with hearing loss. Finally, to increase the practical use of our last contribution, I develop easy-to-use software interfaces to be used, in research settings, by clinical diagnostics personnel.Künstliche Intelligenz (KI) ist ein interdisziplinäres Gebiet, das verschiedene Forschungsbereiche mit dem Ziel verbindet, Prozesse im Alltag und in der Industrie zu automatisieren. Die grundlegenden Komponenten von KI-Modellen sind ein “intelligentes” Modell und eine durch die Endanwendung definierte funktionale Komponente. Das heißt, ein intelligentes Modell kann ein statistisches Modell sein, das Muster in Dateninstanzen erkennen kann, um Unterschiede zwischen diesen Instanzen zu unterscheiden. Wird die KI beispielsweise in der Automobilherstellung eingesetzt, kann das Modell auf der Grundlage eines Bildes eines Autoteils kategorisieren, ob sich das Autoteil im vorderen, mittleren oder hinteren Bereich des Autos befindet, wie es ein menschliches Gehirn tun würde. Bei der gleichen Beispielanwendung informiert das statistische Modell einen mechanischen Arm, die funktionale Komponente, über den aktuellen Fahrzeugbereich, und der Arm wiederum baut diesen Bereich des Fahrzeugs auf der Grundlage vordefinierter Anweisungen zusammen, so wie eine menschliche Hand den neuronalen Signalen des menschlichen Gehirns folgen würde. Ein entscheidender Schritt bei KI-Anwendungen ist die Klassifizierung von Eingabeinstanzen durch das intelligente Modell. Unabhängig von der Methode zum Auffinden von Mustern in Daten besteht die Klassifizierung aus vier verschiedenen Schritten: (i) Eingabedarstellung, (ii) Modellbildung, (iii) Modellvorhersage und (iv) Modellbewertung. Ausgehend von diesen Klassifizierungsschritten argumentiere ich, dass die Anwendung der Klassifizierung auf verschiedene Datentypen unterschiedliche Herausforderungen mit sich bringt. In dieser Arbeit konzentriere ich uns auf die Herausforderungen für drei verschiedene Klassifizierungsszenarien: (i) Textdatenströme: Wie kann der Schritt der Modellerstellung, der üblicherweise für eine statische Datenverteilung verwendet wird, weiterentwickelt werden, um die Klassifizierung von Textbeiträgen mit einer instationären Datenverteilung zu erlernen? (ii) Proteinvorhersage: Welche biologisch sinnvollen Informationen können im Schritt der Eingabedarstellung verwendet werden, um die Herausforderung der begrenzten Trainingsdaten zu überwinden? (iii) Vorhersage der Pathogenität menschlicher Varianten: Wie kann ein Klassifizierungssystem für die funktionellen Auswirkungen menschlicher Varianten entwickelt werden, indem standardisierte und anerkannte Beweise für das Klassifizierungsergebnis bereitgestellt werden und somit der Schritt der Modellbewertung ermöglicht wird? Um diese Forschungsfragen zu beantworten, stelle ich meine Beiträge zur Klassifizierung dieser verschiedenen Datentypen vor: temporalMNB: Verbesserung des Naive-Bayes-Modells zur Klassifizierung driftender Textströme durch Ensemble-Lernen. dom2vec: Lernen von Einbettungsvektoren für Proteindomänen durch Selbstüberwachung. Auf der Grundlage der berichteten Ergebnisse liefere ich Beispiele für eine Analogie zwischen den lokalen linguistischen Merkmalen in natürlichen Sprachen und den Domänenstruktur- und Funktionsinformationen in Domänenarchitekturen. Schließlich beschreibe ich ein bioinformatisches Softwaretool, GenOtoScope, zur Automatisierung standardisierter evidenzbasierter Kriterien für die orthogenitätsauswirkungen von Varianten, die mit angeborener Schwerhörigkeit in Verbindung stehen

    Capturing protein domain structure and function using self-supervision on domain architectures

    Get PDF
    Predicting biological properties of unseen proteins is shown to be improved by the use of protein sequence embeddings. However, these sequence embeddings have the caveat that biological metadata do not exist for each amino acid, in order to measure the quality of each unique learned embedding vector separately. Therefore, current sequence embedding cannot be intrinsically evaluated on the degree of their captured biological information in a quantitative manner. We address this drawback by our approach, dom2vec, by learning vector representation for protein domains and not for each amino acid base, as biological metadata do exist for each domain separately. To perform a reliable quantitative intrinsic evaluation in terms of biology knowledge, we selected the metadata related to the most distinctive biological characteristics of a domain, which are its structure, enzymatic, and molecular function. Notably, dom2vec obtains an adequate level of performance in the intrinsic assessment—therefore, we can draw an analogy between the local linguistic features in natural languages and the domain structure and function information in domain architectures. Moreover, we demonstrate the dom2vec applicability on protein prediction tasks, by comparing it with state-of-the-art sequence embeddings in three downstream tasks. We show that dom2vec outperforms sequence embeddings for toxin and enzymatic function prediction and is comparable with sequence embeddings in cellular location prediction. © 2021 by the authors. Licensee MDPI, Basel, Switzerland

    Impact of system factors on the water saving efficiency of household grey water recycling

    Get PDF
    Copyright © 2010 Taylor & Francis. This is an Author's Accepted Manuscript of an article published in Desalination and Water Treatment Volume 24, Issue 1-3 (2010), available online at: http://www.tandfonline.com/10.5004/dwt.2010.1542A general concern when considering the implementation of domestic grey water recycling is to understand the impacts of system factors on water saving efficiency. Key factors include household occupancy, storage volumes, treatment capacity and operating mode. Earlier investigations of the impacts of these key factors were based on a one-tank system only. This paper presents the results of an investigation into the effect of these factors on the performance of a more realistic ‘two tank’ system with treatment using an object based household water cycle model. A Monte-Carlo simulation technique was adopted to generate domestic water appliance usage data which allows long-term prediction of the system's performance to be made. Model results reveal the constraints of treatment capacity, storage tank sizes and operating mode on percentage of potable water saved. A treatment capacity threshold has been discovered at which water saving efficiency is maximised for a given pair of grey and treated grey water tank. Results from the analysis suggest that the previous one-tank model significantly underestimates the tank volumes required for a given target water saving efficiency

    Clinical development of new drug-radiotherapy combinations.

    Get PDF
    In countries with the best cancer outcomes, approximately 60% of patients receive radiotherapy as part of their treatment, which is one of the most cost-effective cancer treatments. Notably, around 40% of cancer cures include the use of radiotherapy, either as a single modality or combined with other treatments. Radiotherapy can provide enormous benefit to patients with cancer. In the past decade, significant technical advances, such as image-guided radiotherapy, intensity-modulated radiotherapy, stereotactic radiotherapy, and proton therapy enable higher doses of radiotherapy to be delivered to the tumour with significantly lower doses to normal surrounding tissues. However, apart from the combination of traditional cytotoxic chemotherapy with radiotherapy, little progress has been made in identifying and defining optimal targeted therapy and radiotherapy combinations to improve the efficacy of cancer treatment. The National Cancer Research Institute Clinical and Translational Radiotherapy Research Working Group (CTRad) formed a Joint Working Group with representatives from academia, industry, patient groups and regulatory bodies to address this lack of progress and to publish recommendations for future clinical research. Herein, we highlight the Working Group's consensus recommendations to increase the number of novel drugs being successfully registered in combination with radiotherapy to improve clinical outcomes for patients with cancer.National Institute for Health ResearchThis is the final version of the article. It first appeared from Nature Publishing Group via http://dx.doi.org/10.1038/nrclinonc.2016.7

    Beitraege zur Chemie der Tetraboran(10) transannular verbrueckte (#mu#_2)_2-Dimercaptotetraboran(10)- und Di-(#mu#_2-Mercapto)Tetraboran(10)-Derivate

    No full text
    SIGLEAvailable from TIB Hannover: DW 3565 / FIZ - Fachinformationszzentrum Karlsruhe / TIB - Technische InformationsbibliothekDEGerman

    Use of a microbial sensor: inhibition effect of azo-reactive dyes on activated sludge

    No full text

    Bioprocess performance, transformation pathway, and bacterial community dynamics in an immobilized cell bioreactor treating fludioxonil-contaminated wastewater under microaerophilic conditions

    No full text
    Fludioxonil is a post-harvest fungicide contained in effluents produced by fruit packaging plants, which should be treated prior to environmental dispersal. We developed and evaluated an immobilized cell bioreactor, operating under microaerophilic conditions and gradually reduced hydraulic retention times (HRTs) from 10 to 3.9 days, for the biotreatment of fludioxonil-rich wastewater. Fludioxonil removal efficiency was consistently above 96%, even at the shortest HRT applied. A total of 12 transformation products were tentatively identified during fludioxonil degradation by using liquid chromatography coupled to quadrupole time-of-flight Mass spectrometry (LC-QTOF-MS). Fludioxonil degradation pathway was initiated by successive hydroxylation and carbonylation of the pyrrole moiety and disruption of the oxidized cyanopyrrole ring at the NH-C bond. The detection of 2,2-difluoro-2H-1,3-benzodioxole-4-carboxylic acid verified the decyanation and deamination of the molecule, whereas its conversion to the tentatively identified compound 2,3-dihydroxybenzoic acid indicated its defluorination. High-throughput amplicon sequencing revealed that HRT shortening led to reduced α-diversity, significant changes in the β-diversity, and a shift in the bacterial community composition from an initial activated sludge system typical community to a community composed of bacterial taxa like Clostridium, Oligotropha, Pseudomonas, and Terrimonas capable of performing advanced degradation and/or aerobic denitrification. Overall, the immobilized cell bioreactor operation under microaerophilic conditions, which minimizes the cost for aeration, can provide a sustainable solution for the depuration of fludioxonil-contaminated agro-industrial effluents. © 2021, The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature
    corecore