141 research outputs found
Diagnóstico no invasivo de patologías humanas combinando análisis de aliento y modelización con redes neuronales
Tesis inédita de la Universidad Complutense de Madrid, Facultad de Ciencias Químicas, leída el 09-09-2016It is currently known that there is a direct relation between the moment a disease is detected or diagnosed and the consequences it will have on the patient, as an early detection is generally linked to a more favorable outcome. This concept is the basis of the present research, due to the fact that its main goal is the development of mathematical tools based on computational artificial intelligence to safely and non-invasively attain the detection of multiple diseases. To reach these devices, this research has focused on the breath analysis of patients with diverse diseases, using several analytical methodologies to extract the information contained in these samples, and multiple feature selection algorithms and neural networks for data analysis. In the past, it has been shown that there is a correlation between the molecular composition of breath and the clinical status of a human being, proving the existence of volatile biomarkers that can aid in disease detection depending on their presence or amount. During this research, two main types of analytical approaches have been employed to study the gaseous samples, and these were cross-reactive sensor arrays (based on organically functionalized silicon nanowire field-effect transistors (SiNW FETs) or gold nanoparticles (GNPs)) and proton transfer reaction-mass spectrometry (PTR-MS). The cross-reactive sensors analyze the bulk of the breath samples, offering global, fingerprint-like information, whereas PTR-MS quantifies the volatile molecules present in the samples. All of the analytical equipment employed leads to the generation of large amounts of data per sample, forcing the need of a meticulous mathematical analysis to adequately interpret the results. In this work, two fundamental types of mathematical tools were utilized. In first place, a set of five filter-based feature selection algorithms (χ2 (chi2) score, Fisher’s discriminant ratio, Kruskal-Wallis test, Relief-F algorithm, and information gain test) were employed to reduce the amount of independent in the large databases to the ones which contain the greatest discriminative power for a further modeling task. On the other hand, and in relation to mathematical modeling, artificial neural networks (ANNs), algorithms that are categorized as computational artificial intelligence, have been employed. These non-linear tools have been used to locate the relations between the independent variables of a system and the dependent ones to fulfill estimations or classifications. The type of ANN that has been used in this thesis coincides with the one that is more commonly employed in research, which is the supervised multilayer perceptron (MLP), due to its proven ability to create reliable models for many different applications...Actualmente es sabido que existe una relación directa entre el momento en el cual se detecta o diagnostica una enfermedad y las consecuencias que tendrá sobre el paciente, ya que una detección temprana va generalmente ligada a un desarrollo más favorable. Este concepto es el cimiento de la presente investigación, cuyo objetivo fundamental es el desarrollo de herramientas basadas en inteligencia artificial computacional que consigan, mediante medios seguros y no invasivos, la detección de diversas enfermedades. Para alcanzar dichos sistemas, los estudios han sido enfocados en el análisis de muestras de aliento de pacientes de diversas enfermedades, empleando varias técnicas para extraer información, y diversos algoritmos de selección de variables y redes neuronales para el procesamiento matemático. En el pasado, se ha comprobado que hay una correlación entre la composición molecular del aliento y el estado clínico de una persona, evidenciando la existencia de biomarcadores volátiles que pueden ayudar a detectar enfermedades, ya sea por su presencia o por su cantidad. Durante el transcurso de esta investigación, se han empleado esencialmente dos tipos de técnicas analíticas para estudiar las muestras gaseosas, y estas son conjuntos de sensores de reactividad cruzada (basados en transistores de efecto de campo con nanocables de silicio (SiNW FETs) o en nanopartículas de oro (GNPs), ambos funcionalizados con cadenas orgánicas) y equipos de reacción de transferencia de protones con espectrometría de masas (PTR-MS). Los sensores de reactividad cruzada analizan el aliento en su conjunto, extrayéndose información de la muestra global, mientras que usando PTR-MS, se cuantifican las moléculas volátiles presentes en las muestras analizadas. Todas las técnicas empleadas desembocan en la generación de grandes cantidades de datos por muestra, por lo que un análisis matemático exhaustivo es necesario para poder sacar el máximo rendimiento de los estudios. En este trabajo, se emplearon principalmente dos tipos de herramientas matemáticas. Las primeras son un grupo de cinco algoritmos de selección de variables, concretamente, filtros de variables (cálculos basados en estadística de χ2 (chi2), ratio discriminante de Fisher, análisis de Kruskal-Wallis, algoritmo relief-F y test de ganancia de información), que se han empleado en las bases de datos con grandes cantidades de variables independientes para localizar aquellas con mayor importancia o poder discriminativo para una tarea de modelización matemática posterior. Por otro lado, en cuando a dicha modelización, se ha empleado un tipo de algoritmo que se cataloga dentro del área de la inteligencia artificial computacional: las redes neuronales artificiales (ANNs). Estas herramientas matemáticas de naturaleza no lineal se han utilizado para localizar las relaciones existentes entre las variables independientes de un sistema y las variables dependientes o parámetros a estimar o clasificar. Se ha empleado el tipo de ANN supervisada más extensamente usado en investigación, que son los perceptrones multicapa (MLPs), debido a su habilidad contrastada para originar modelos fiables para numerosas aplicaciones...Fac. de Ciencias QuímicasTRUEunpu
A Study on Deep Learning for Bioinformatics
Bioinformatics, an interdisciplinary area of biology and computer science, handles large and complex data sets with linear and non-linear relationships between attributes. To handle such relationships, deep learning has got a greater importance these days. This paper analyses different deep learning architectures and their applications in Bioinformatics. The paper also addresses the limitations and challenges of deep learning
Machine Learning Prediction of COVID-19 Severity Levels From Salivaomics Data
The clinical spectrum of severe acute respiratory syndrome coronavirus 2
(SARS-CoV-2), the strain of coronavirus that caused the COVID-19 pandemic, is
broad, extending from asymptomatic infection to severe immunopulmonary
reactions that, if not categorized properly, may be life-threatening.
Researchers rate COVID-19 patients on a scale from 1 to 8 according to the
severity level of COVID-19, 1 being healthy and 8 being extremely sick, based
on a multitude of factors including number of clinic visits, days since the
first sign of symptoms, and more. However, there are two issues with the
current state of severity level designation. Firstly, there exists variation
among researchers in determining these patient scores, which may lead to
improper treatment. Secondly, researchers use a variety of metrics to determine
patient severity level, including metrics involving plasma collection that
require invasive procedures. This project aims to remedy both issues by
introducing a machine learning framework that unifies severity level
designations based on noninvasive saliva biomarkers. Our results show that we
can successfully use machine learning on salivaomics data to predict the
severity level of COVID-19 patients, indicating the presence of viral load
using saliva biomarkers
Machine learning application in personalised lung cancer recurrence and survivability prediction
Machine learning is an important artificial intelligence technique that is widely applied in cancer diagnosis and detection. More recently, with the rise of personalised and precision medicine, there is a growing trend towards machine learning applications for prognosis prediction. However, to date, building reliable prediction models of cancer outcomes in everyday clinical practice is still a hurdle. In this work, we integrate genomic, clinical and demographic data of lung adenocarcinoma (LUAD) and squamous cell carcinoma (LUSC) patients from The Cancer Genome Atlas (TCGA) and introduce copy number variation (CNV) and mutation information of 15 selected genes to generate predictive models for recurrence and survivability. We compare the accuracy and benefits of three well-established machine learning algorithms: decision tree methods, neural networks and support vector machines. Although the accuracy of predictive models using the decision tree method has no significant advantage, the tree models reveal the most important predictors among genomic information (e.g. KRAS, EGFR, TP53), clinical status (e.g. TNM stage and radiotherapy) and demographics (e.g. age and gender) and how they influence the prediction of recurrence and survivability for both early stage LUAD and LUSC. The machine learning models have the potential to help clinicians to make personalised decisions on aspects such as follow-up timeline and to assist with personalised planning of future social care needs
Recommended from our members
Automated Classification of Benign and Malignant Proliferative Breast Lesions
Misclassification of breast lesions can result in either cancer progression or unnecessary chemotherapy. Automated classification tools are seen as promising second opinion providers in reducing such errors. We have developed predictive algorithms that automate the categorization of breast lesions as either benign usual ductal hyperplasia (UDH) or malignant ductal carcinoma in situ (DCIS). From diagnosed breast biopsy images from two hospitals, we obtained 392 biomarkers using Dong et al.’s (2014) computational tools for nuclei identification and feature extraction. We implemented six machine learning models and enhanced them by reducing prediction variance, extracting active features, and combining multiple algorithms. We used the area under the curve (AUC) of the receiver operating characteristic (ROC) curve for performance evaluation. Our top-performing model, a Combined model with Active Feature Extraction (CAFE) consisting of two logistic regression algorithms, obtained an AUC of 0.918 when trained on data from one hospital and tested on samples of the other, a statistically significant improvement over Dong et al.’s AUC of 0.858. Pathologists can substantially improve their diagnoses by using it as an unbiased validator. In the future, our work can also serve as a valuable methodology for differentiating between low-grade and high-grade DCIS
Reconstrução e classificação de sequências de ADN desconhecidas
The continuous advances in DNA sequencing technologies and techniques
in metagenomics require reliable reconstruction and accurate classification
methodologies for the diversity increase of the natural repository while contributing
to the organisms' description and organization. However, after
sequencing and de-novo assembly, one of the highest complex challenges
comes from the DNA sequences that do not match or resemble any biological
sequence from the literature. Three main reasons contribute to this
exception: the organism sequence presents high divergence according to the
known organisms from the literature, an irregularity has been created in the
reconstruction process, or a new organism has been sequenced. The inability
to efficiently classify these unknown sequences increases the sample
constitution's uncertainty and becomes a wasted opportunity to discover
new species since they are often discarded.
In this context, the main objective of this thesis is the development and
validation of a tool that provides an efficient computational solution to
solve these three challenges based on an ensemble of experts, namely
compression-based predictors, the distribution of sequence content, and
normalized sequence lengths. The method uses both DNA and amino acid
sequences and provides efficient classification beyond standard referential
comparisons. Unusually, it classifies DNA sequences without resorting directly
to the reference genomes but rather to features that the species biological
sequences share. Specifically, it only makes use of features extracted
individually from each genome without using sequence comparisons.
RFSC was then created as a machine learning classification pipeline that
relies on an ensemble of experts to provide efficient classification in metagenomic
contexts. This pipeline was tested in synthetic and real data, both
achieving precise and accurate results that, at the time of the development
of this thesis, have not been reported in the state-of-the-art. Specifically, it
has achieved an accuracy of approximately 97% in the domain/type classification.Os contínuos avanços em tecnologias de sequenciação de ADN e técnicas
em meta genómica requerem metodologias de reconstrução confiáveis e de
classificação precisas para o aumento da diversidade do repositório natural,
contribuindo, entretanto, para a descrição e organização dos organismos.
No entanto, após a sequenciação e a montagem de-novo, um dos desafios
mais complexos advém das sequências de ADN que não correspondem ou se
assemelham a qualquer sequencia biológica da literatura. São três as principais
razões que contribuem para essa exceção: uma irregularidade emergiu
no processo de reconstrução, a sequência do organismo é altamente dissimilar
dos organismos da literatura, ou um novo e diferente organismo foi
reconstruído. A incapacidade de classificar com eficiência essas sequências
desconhecidas aumenta a incerteza da constituição da amostra e desperdiça
a oportunidade de descobrir novas espécies, uma vez que muitas vezes são
descartadas.
Neste contexto, o principal objetivo desta tese é fornecer uma solução computacional
eficiente para resolver este desafio com base em um conjunto
de especialistas, nomeadamente preditores baseados em compressão, a distribuição de conteúdo de sequência e comprimentos de sequência normalizados.
O método usa sequências de ADN e de aminoácidos e fornece classificação eficiente além das comparações referenciais padrão. Excecionalmente,
ele classifica as sequências de ADN sem recorrer diretamente a genomas
de referência, mas sim às características que as sequências biológicas da
espécie compartilham. Especificamente, ele usa apenas recursos extraídos
individualmente de cada genoma sem usar comparações de sequência. Além
disso, o pipeline é totalmente automático e permite a reconstrução sem referência de genomas a partir de reads FASTQ com a garantia adicional de
armazenamento seguro de informações sensíveis.
O RFSC é então um pipeline de classificação de aprendizagem automática
que se baseia em um conjunto de especialistas para fornecer classificação
eficiente em contextos meta genómicos. Este pipeline foi aplicado em dados
sintéticos e reais, alcançando em ambos resultados precisos e exatos que,
no momento do desenvolvimento desta dissertação, não foram relatados na
literatura. Especificamente, esta ferramenta desenvolvida, alcançou uma
precisão de aproximadamente 97% na classificação de domínio/tipo.Mestrado em Engenharia de Computadores e Telemátic
- …