5 research outputs found

    Análisis de datos de citometría de flujo mediante el uso de domain-adversarial autoencoders

    Full text link
    Trabajo fin de máster en Bioinformática y Biología ComputacionalMachine Learning is a field of Artificial Intelligence focused on automatic data analysis. In the era of big data, there appear algorithms that allow the analysis of large quantities of data efficiently, incorporating more knowledge into our studies. One of the main fields of application for these algorithms is bioinformatics, where large amounts of high-dimensional data are typically analyzed. However, one of the main difficulties in the automatic analysis of data with a biological origin is the inevitable variation that occurs in the experimental conditions, causing the well-known batch effects. This makes it difficult to integrate data that come from different experimental sources, thus reducing the simultaneous capacity for analysis and losing relevant biological information. Focused on flow cytometry data, in this work we propose a new algorithm in the context of unsupervised learning with the aim of smoothing the influence of batch effects simultaneously under an arbitrary number of experimental conditions. Applying state-of-the-art techniques in Machine Learning, such as domain adaptation and adversarial learning, we present the domainadversarial autoencoder (DAE). For the validation of the DAE as a domain adaptation or batch normalization algorithm, in this work we carry out experiments with three data sets. The first two are simple, artificial datasets composed of beads that have been passed through the cytometer in a controlled environment. In one of them, the clogging or misalignment of the cytometer is artificially simulated. In the other, we have the same data analyzed on two different machines. The third example is a real dataset with dendritic cells of mice that have also been collected on two different cytometers. Firstly, we show how these batch effects influence the analysis typically applied by flow cytometry users, such as clustering with Phenograph or visualization with t-SNE. Secondly, we see how the DAE manages to efficiently alleviate the batch effects in these examples and improve the clustering results, achieving a notable increase in the F1-score after the correction. In addition, we provide with a visual evaluation of the representations in two-dimensional spaces learnt with a standard autoencoder (SAE), t-SNE and a DAE. Additionally, in this work we present a novel method to evaluate the quality of the batch normalization of data using statistical distances. In particular, we use the multidimensional version of the Kolmogorov-Smirnov distance between distributions. We show that the distribution of the data in the latent representation of the DAE is very similar when the data comes from different experiments, presenting a smaller distance than in the case of the SAE, where we do not provide the algorithm with domain information in the training step. Therefore, this work allows us to conclude that domain adaptation in flow cytometry data opens a new line of research, which is focused in developing tools for the integration of data from different experiment

    Análisis e implementación de diferentes medidas de similitud para un algoritmo global de selección de variables

    Full text link
    La selección de variables es un paradigma dentro del Aprendizaje Automático que trata de elegir o seleccionar las variables más importantes o relevantes en un problema de reconocimiento de patrones. Hoy en día tiene especial interés debido a la aparición de problemas de alta dimensionalidad, en los que puede haber variables que no aportan información útil o necesaria al problema, pero, en cambio, aumentan el coste computacional y la dificultad de encontrar las reglas o comportamientos subyacentes a los datos. En consecuencia, la literatura que examina, propone y compara métodos de selección de variables, así como las métricas que éstos utilizan, es muy extensa. Dentro de la literatura existente cobran especial importancia y popularidad los métodos que intentan minimizar la redundancia entre variables a la vez que maximizan la relevancia con la clase a predecir. Dentro de esta categoría se enmarcan algoritmos como el minimum Redundancy Maximum Relevance (mRMR) y Quadratic Programming Feature Selection (QPFS). El principal objetivo de este trabajo es analizar el uso de distintas medidas de similitud entre variables aleatorias (i.e. métricas) en el algoritmo global de selección de variables QPFS. Las medidas de similitud escogidas han probado ser eficaces en otros trabajos y son la Correlación de Pearson, la Información Mutua, la Información Mutua Condicionada, la Distancia de Covarianzas y la Distancia de Correlaciones. Este trabajo realiza una estimación de la complejidad computacional que supone añadir el uso de estas medidas a QPFS y estudia el rendimiento en términos de acierto en clasificación sobre distintos conjuntos de datos de la selección de variables efectuada por QPFS cuando se utiliza Naive Bayes como clasificador. En este trabajo también se propone un nuevo algoritmo, DQPFS (Diagonal Programming Quadratic Feature Selection), para tratar de solucionar un problema de la implementación original de QPFS, que penaliza a las variables más entrópicas cuando usa la Información Mutua como medida de información, acción que puede desembocar en resultados subóptimos. Para la realización de este TFG se ha utilizado Python como lenguaje de programación, así como algunas de sus librerías(Scipy, Numpy, Pandas).Aplicar QPFS o DQPFS requiere resolver un problema QP (Quadratic Programming), para lo que se ha empleado la librería CVXOPT. Se han realizado pruebas exhaustivas variando los parámetros inherentes a QPFS y DQPFS y a las distintas medidas de similitud. La contribución de este TFG es un estudio de las medidas de similitud entre variables aleatorias más populares y de algunas que están ganando importancia en la literatura. Estas métricas no se habían probado con anterioridad en un algoritmo global de selección de variables. Los resultados obtenidos muestran un rendimiento positivo de las medidas de similitud propuestas, en especial de la Información Mutua Condicionada.Feature Selection is a paradigm in the field of Machine Learning which aim is to choose or select the most important features in a pattern recognition problem. Nowadays, interest in feature selection methods is growing in popularity due to the appearance of high dimensional problems, in which there may be not useful and nonrelevant features. Furthermore, this kind of features increase the problem complexity and complicate the discovery of underlying rules under the data. As a consequence, the literature that examines, suggests and compares feature selection methods is extensive. Among the most popular methods are those that try to minimize redundancy between variables at the same time they maximize the relevance with the class. Examples of algorithms in this category are minimum Redundancy Maximum Relevance (mRMR) and Quadratic Programming Feature Selection (QPFS). The main objective of this work is to analyse the use of different similarity measures between random variables (i.e. metrics) applied in a global feature selection method, QPFS. The similarity measures used in this work have proved to be efficient in other works and include Pearson Correlation Coefficient, Mutual Information, Conditioned Mutual Information, Distance Covariance and Distance Correlation. An analysis of the computational complexity added to QPFS because of the use of this similarity measurements is also carried out in this work. Different datasets will be used in order to estimate the performance of a Naive Bayes classifier in terms of classification accuracy when the QPFS algorithm with different metrics is applied before the classifier. Due to the fact that the original implementation of QPFS penalizes entropic features, leading to suboptimal solutions, this work also suggests an algorithm, named DQPFS (Diagonal Quadratic Programming Feature Selection), to solve this problem. Exhaustive tests varying inherent parameters of QPFS and DQPFS have also been carried out. The programming language selected to implement this work is Python as well as some of its packages (Numpy, Pandas, Scipy). Applying QPFS and DQPFS requires the optimization of a QP problem (Quadratic Programming), which is solved by CVXOPT Python Package. The main contribution and novelty of this TFG is a study of the most popular similarity measures and those that are gaining popularity among literature applied to a global feature selectionalgorithm.Theresultsobtainedinthisworkshowthattheproposedsimilaritymeasures combined with QPFS provide competitive classification accuracies. It is especially remarkable the good performance obtained by the Conditional Mutual Informatio

    Low-Rank Approximation and Difusion Maps

    Full text link
    La teoría de la reducción de la dimensionalidad es fundamental para muchos problemas de Aprendizaje Automático. Existen multitud de enfoques, pero este trabajo se centrará en los métodos de aprendizaje de variedades. El punto de partida es asumir que los datos viven en una variedad de dimensión menor que la de partida para lograr entender el fenómeno subyacente que los ha generado. Dentro de este campo, es de especial interés, debido a su fuerte base matemática, el algoritmo conocido como Mapas de Difusión, objeto principal de este trabajo. Primero realizaremos un estudio de los Mapas de Difusión así como de la teoría matemática necesaria para su correcta comprensión, estudiando conceptos como los Grafos de Semejanza y sus Laplacianos y la Distancia de Difusión. El principal inconveniente de los Mapas de Difusión, así como de otros algoritmos espectrales, es que requiere la diagonalización de una matriz cuadrada cuya dimensión es el número de ejemplos. Por lo tanto, su coste computacional es O(N3), donde N se refiere al número de ejemplos. Es por ello que uno de los objetivos de este trabajo es calcular una aproximación de rango bajo para los Mapas de Difusión mediante el método de Nyström. Además, para evaluar la calidad de la aproximación, propondremos una métrica que se basa en el error de reconstrucción de la matriz de difusión. Por otro lado, existe otro problema cuando se quiere dar la proyección de un ejemplo que no está en la muestra inicial utilizada para el cálculo de la transformación. Es necesario rehacer el análisis espectral de la matriz, lo que es especialmente crítico si las aplicaciones tienen restricciones de funcionamiento en tiempo real. En este trabajo también analizaremos dos propuestas para paliar este coste: aprender el mapeo por medio de redes neuronales para regresión (Redes de Difusión), pudiendo así calcular la proyección para un nuevo ejemplo, y calcular una extensión de la transformación para un nuevo punto con el método de Nyström. A la hora de mostrar los resultados se van a utilizar tres conjuntos de datos, uno de los cuales será sintético. Para todos ellos, se calculará la transformación a un espacio de menor dimensión por medio de Mapas de Difusión con el objetivo de extender los mismos a ejemplos fuera de muestra y evaluar la aproximación de rango bajo conseguida por el método de Nyström. Además, se calcularán extensiones para patrones fuera de muestra y se compararán los resultados obtenidos, tanto por las Redes de Difusión como por el método de Nyström, de forma visual. En resumen, en cuanto a la calidad de la aproximación de rango bajo veremos que, como cabría esperar, incrementar el número de ejemplos en el conjunto de entrenamiento conlleva una reducción del error de reconstrucción de la matriz de difusión. En cuanto al funcionamiento de los métodos para extender el mapeo a ejemplos fuera de muestra, observaremos que tanto el método de Nyström como las Redes de Difusión obtienen resultados visualmente similares, proyectando ejemplos de la misma clase en las mismas regiones del espacio. Este trabajo da lugar a nuevas líneas de investigación, ya que como trabajo futuro es de especial interés, entre otros, conseguir comparar la calidad de las extensiones para ejemplos fuera de muestra.The theory of dimensionality reduction is fundamental for many problems of Machine Learning. There are many approaches, but this work will focus on the methods of Manifold Learning. The starting point is to assume that data live in a manifold of smaller dimension than the starting one in order to understand the underlying phenomenon that has generated them. Within this eld, it is of special interest, due to its strong mathematical foundation, the algorithm known as Di usion Maps, the main object of this work. First we will study the theory of Di usion Maps, as well as the mathematical theory necessary for its correct understanding, studying concepts such as Similarity Graphs and their Laplacians, and the Di usion Distance. The main drawback of Di usion Maps, as well as other spectral algorithms, is that they require the diagonalization of a square matrix whose dimension is the number of examples. Therefore, its computational cost is O(N3), where N refers to the number of examples. Because of that, one of the main objectives of this work is to compute a low-rank approximation for Di usion Maps using Nystr om's method. In addition, in order to evaluate the quality of the approach, we will propose a metric that is based on the reconstruction error of the di usion matrix. On the other hand, there is another problem when giving the embedding for examples that were not in the initial sample used to compute the embedding. It is necessary to redo the spectral analysis of the matrix, which is especially critical if the applications have operating restrictions in real time. In this work we will also analyze two proposals to allevaite this cost: to learn the embedding by means of neural networks for regression (Di usion Nets), being able to compute the embedding for a new example, and to compute an extension of the embedding with Nystr om's method. Regarding the results, three datasets will be used, one of which will be synthetic. For all of them, we will compute the embedding via Di usion Maps with the objective of extending it to out-of-sample (OOS) examples and to evaluate the low range approximation achieved by Nystr om's method. In addition, extensions for OOS patterns will be computed via Di usion Nets and Nystr om's method. To sum up, regarding the low-rank approximation quality we will see that increasing the number of examples in the training set entails a reduction in the reconstruction error of the di usion matrix, as we could expect. Regarding OOS extensions, we will observe that both, Nystr om's method and Di usion Nets, obtain visually similar results, embedding examples of the same class in the same regions of space. This work gives rise to new lines of research, since as future work it is of special interest, among others, to be able to compare the quality of the extensions for OOS examples

    Real-life disease monitoring in follicular lymphoma patients using liquid biopsy ultra-deep sequencing and PET/CT

    Get PDF
    In the present study, we screened 84 Follicular Lymphoma patients for somatic mutations suitable as liquid biopsy MRD biomarkers using a targeted next-generation sequencing (NGS) panel. We found trackable mutations in 95% of the lymph node samples and 80% of the liquid biopsy baseline samples. Then, we used an ultra-deep sequencing approach with 2 · 10−4 sensitivity (LiqBio-MRD) to track those mutations on 151 follow-up liquid biopsy samples from 54 treated patients. Positive LiqBio-MRD at first-line therapy correlated with a higher risk of progression both at the interim evaluation (HRINT 11.0, 95% CI 2.10–57.7, p = 0.005) and at the end of treatment (HREOT, HR 19.1, 95% CI 4.10–89.4, p < 0.001). Similar results were observed by PET/CT Deauville score, with a median PFS of 19 months vs. NR (p < 0.001) at the interim and 13 months vs. NR (p < 0.001) at EOT. LiqBio-MRD and PET/CT combined identified the patients that progressed in less than two years with 88% sensitivity and 100% specificity. Our results demonstrate that LiqBio-MRD is a robust and non-invasive approach, complementary to metabolic imaging, for identifying FL patients at high risk of failure during the treatment and should be considered in future response-adapted clinical trials.This study has been funded by Instituto de Salud Carlos III (ISCIII) and co-funded by the European Union through the projects PI21/00314, PI19/01430, PI19/01518 and PI18/00295, PTQ2020-011372, CP19/00140, CP22/00082, Doctorado industrial CAM IND2020/TIC-17402 and CRIS cancer foundation

    Detection of kinase domain mutations in BCR::ABL1 leukemia by ultra-deep sequencing of genomic DNA

    Get PDF
    Documento escrito por un elevado número de autores/as, solo se referencia el/la que aparece en primer lugar y los/as autores/as pertenecientes a la UC3M.The screening of the BCR::ABL1 kinase domain (KD) mutation has become a routine analysis in case of warning/failure for chronic myeloid leukemia (CML) and B-cell precursor acute lymphoblastic leukemia (ALL) Philadelphia (Ph)-positive patients. In this study, we present a novel DNA-based next-generation sequencing (NGS) methodology for KD ABL1 mutation detection and monitoring with a 1.0E−4 sensitivity. This approach was validated with a well-stablished RNA-based nested NGS method. The correlation of both techniques for the quantification of ABL1 mutations was high (Pearson r = 0.858, p < 0.001), offering DNA-DeepNGS a sensitivity of 92% and specificity of 82%. The clinical impact was studied in a cohort of 129 patients (n = 67 for CML and n = 62 for B-ALL patients). A total of 162 samples (n = 86 CML and n = 76 B-ALL) were studied. Of them, 27 out of 86 harbored mutations (6 in warning and 21 in failure) for CML, and 13 out of 76 (2 diagnostic and 11 relapse samples) did in B-ALL patients. In addition, in four cases were detected mutation despite BCR::ABL1 < 1%. In conclusion, we were able to detect KD ABL1 mutations with a 1.0E−4 sensitivity by NGS using DNA as starting material even in patients with low levels of disease.This project was funded in part by CRIS CANCER FOUNDATIO
    corecore