16 research outputs found
Sparsity in machine learning: theory and practice.
The thesis explores sparse machine learning algorithms for supervised (classification and regression) and unsupervised (subspace methods) learning. For classification, we review the set covering machine (SCM) and propose new algorithms that directly minimise the SCMs sample compression generalisation error bounds during the training phase. Two of the resulting algorithms are proved to produce optimal or near-optimal solutions with respect to the loss bounds they minimise. One of the SCM loss bounds is shown to be incorrect and a corrected derivation of the sample compression bound is given along with a framework for allowing asymmetrical loss in sample compression risk bounds. In regression, we analyse the kernel matching pursuit (KMP) algorithm and derive a loss bound that takes into account the dual sparse basis vectors. We make connections to a sparse kernel principal components analysis (sparse KPCA) algorithm and bound its future loss using a sample compression argument. This investigation suggests a similar argument for kernel canonical correlation analysis (KCCA) and so the application of a similar sparsity algorithm gives rise to the sparse KCCA algorithm. We also propose a loss bound for sparse KCCA using the novel technique developed for KMP. All of the algorithms and bounds proposed in the thesis are elucidated with experiments
Sparse Dimensionality Reduction Methods: Algorithms and Applications
Ph.DDOCTOR OF PHILOSOPH
An ANOVA approach for statistical comparisons of brain networks
The study of brain networks has developed extensively over the last couple of decades. By contrast, techniques for the statistical analysis of these networks are less developed. In this paper, we focus on the statistical comparison of brain networks in a nonparametric framework and discuss the associated detection and identification problems. We tested network differences between groups with an analysis of variance (ANOVA) test we developed specifically for networks. We also propose and analyse the behaviour of a new statistical procedure designed to identify different subnetworks. As an example, we show the application of this tool in resting-state fMRI data obtained from the Human Connectome Project. We identify, among other variables, that the amount of sleep the days before the scan is a relevant variable that must be controlled. Finally, we discuss the potential bias in neuroimaging findings that is generated by some behavioural and brain structure variables. Our method can also be applied to other kind of networks such as protein interaction networks, gene networks or social networks.Fil: Fraiman Borrazás, Daniel Edmundo. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina. Universidad de San Andrés. Departamento de Matemáticas y Ciencias; ArgentinaFil: Fraiman, Jacob Ricardo. Universidad de la República; Uruguay. Instituto Pasteur de Montevideo; Urugua
Iterative Random Forests to detect predictive and stable high-order interactions
Genomics has revolutionized biology, enabling the interrogation of whole
transcriptomes, genome-wide binding sites for proteins, and many other
molecular processes. However, individual genomic assays measure elements that
interact in vivo as components of larger molecular machines. Understanding how
these high-order interactions drive gene expression presents a substantial
statistical challenge. Building on Random Forests (RF), Random Intersection
Trees (RITs), and through extensive, biologically inspired simulations, we
developed the iterative Random Forest algorithm (iRF). iRF trains a
feature-weighted ensemble of decision trees to detect stable, high-order
interactions with same order of computational cost as RF. We demonstrate the
utility of iRF for high-order interaction discovery in two prediction problems:
enhancer activity in the early Drosophila embryo and alternative splicing of
primary transcripts in human derived cell lines. In Drosophila, among the 20
pairwise transcription factor interactions iRF identifies as stable (returned
in more than half of bootstrap replicates), 80% have been previously reported
as physical interactions. Moreover, novel third-order interactions, e.g.
between Zelda (Zld), Giant (Gt), and Twist (Twi), suggest high-order
relationships that are candidates for follow-up experiments. In human-derived
cells, iRF re-discovered a central role of H3K36me3 in chromatin-mediated
splicing regulation, and identified novel 5th and 6th order interactions,
indicative of multi-valent nucleosomes with specific roles in splicing
regulation. By decoupling the order of interactions from the computational cost
of identification, iRF opens new avenues of inquiry into the molecular
mechanisms underlying genome biology
Data integration in inflammatory bowel disease
[eng] INTRODUCTION: Inflammatory bowel disease is a complex intestinal disease with several genetic and environmental factors that can influence its course. The ethiology and pathophysiology of the disease is not fully understood. There is some evidence that microbiome can play a role. Finding relationships between microbiome and host’s mucosa could help advance prevention, diagnosis or treatment.
METHODS: We based our analysis on intestinal bacterial 16S rRNA and human transcriptome data from biopsies from multiple timepoints and intestine segments. We extended regularized generalized canonical correlation analysis to find models that are coherent with previous knowledge on the disease taking into account the samples’ information. Multiple inflammatory bowel disease datasets on different treatments and conditions were analysed and the models defining those dataset were compared. The results were compared with multiple co-inertia analysis.
RESULTS: Splitting sample variables into different blocks results in models of these relationships that show differences on the genes and microorganisms selected. The models generated using our new method inteRmodel outperformed multiple coinertia analysis to classify the samples according to their location. Despite being used on datasets of different sources the resulting models show similar relationships between variables.
DISCUSSION: Comparing multiple models helps find out the relationships within datasets. Our method finds how strong are the relationships between the microbiome, transcriptome and environmental variables. On different datasets genes selected were common. This approach is robust and flexible to different datasets and settings.
CONCLUSION: With inteRmodel we found that the microbiome relates more closely to the sample location than to disease, but the transcriptome is highly related to the location of the sample on the intestine. There is a common transcriptome between datasets while microorganisms depend of the dataset. We can improve sample classification by taking into account both bacterial 16S rRNA and host transcriptome.[cat] INTRODUCCIÓ: La malaltia inflamatòria intestinal és una malaltia intestinal complexa amb diversos factors genètics i ambientals que poden influir en el seu curs.
L'etiologia i fisiopatologia de la malaltia no es con eix del tot.
Hi ha evidències que el microbioma pot tenir un paper rellevant.
Trobar relacions entre el microbioma i la mucosa de l'hoste podria ajudar a avançar en la prevenció, el diagnòstic o el tractament.
MÈTODES: Vam basar la nostra anàlisi en dades d'ARNr 16S bacteriana intestinal i de transcriptoma humà de biòpsies de múltiples punts de temps i segments intestinals.
Hem ampliat l'anàlisi de correlació canònica generalitzada regularitzada per trobar models coherents amb el coneixement previ sobre la malaltia tenint en compte la informació de les mostres. Es van analitzar diversos conjunts de dades de malaltia inflamatòria intestinal sobre diferents tractaments i condicions i es van comparar els models que defineixen aquest conjunt de dades.
Els resultats es van comparar amb l'anàlisi de coinèrcia múltiple.
RESULTATS: Dividir les variables de la mostra en diferents blocs dona com a resultat models d'aquestes relacions que mostren diferències en els gens i els microorganismes seleccionats.
Els models generats mitjançant el nostre nou mètode intermodel van superar l'anàlisi de coinèrcia múltiple per classificar les mostres segons la seva ubicació.
Tot i utilitzar-se en conjunts de dades de diferents fonts, els models resultants mostren relacions similars entre variables.
DISCUSSIÓ: La comparació de diversos models ajuda a esbrinar les relacions dins dels conjunts de dades.
El nostre mètode troba com de fortes són les relacions entre el microbioma, el transcriptoma i les variables ambientals.
En diferents conjunts de dades, els gens seleccionats eren comuns.
Aquest enfocament és robust i flexible per a diferents conjunts de dades i configuracions.
CONCLUSIÓ: Amb inteRmodel vam trobar que el microbioma es relaciona més estretament amb la ubicació de la mostra que amb la malaltia, però el transcriptoma està molt relacionat amb la ubicació de la mostra a l'intestí.
Hi ha un transcriptoma comú entre conjunts de dades, mentre que els microorganismes depenen del conjunt de dades.
Podem millorar la classificació de les mostres tenint en compte tant l'ARNr 16S bacterià com el transcriptoma hoste.[spa] INTRODUCCIÓN: La enfermedad inflamatoria intestinal es una enfermedad intestinal compleja con factores genéticos y ambientales que pueden influir en su curso. La etiología y la fisiopatología de la enfermedad no se conocen por completo. Existen evidencias que el microbioma puede desempeijar un papel relevante. Encontrar relaciones entre el microbioma y la mucosa del huésped podría ayudar a avanzar en la prevención, el diagnóstico o el tratamiento.
MÉTODOS: Basamos nuestro análisis en el ARNr 16S bacteriano intestinal y en datos de transcriptomas humanos de biopsias de múltiples puntos temporales y segmentos intestinales. Extendimos el análisis de correlación canónica generalizada regularizado para encontrar modelos coherentes con el conocimiento previo sobre la enfermedad teniendo en cuenta la información de las muestras. Se analizaron múltiples conjuntos de datos de enfermedad inflamatoria intestinal en diferentes tratamientos y condiciones y se compararon los modelos que definen esos conjuntos de datos. Los resultados se compararon con análisis de coinercia múltiple.
RESULTADOS: Dividir las variables de la muestra en diferentes bloques resulta en modelos de estas relaciones que muestran diferencias en los genes y microorganismos seleccionados. Los modelos generados con nuestro nuevo método, inter-Rmodel, superaron el análisis de múltiples coinercias para clasificar las muestras según su ubicación. A pesar de ser utilizados en conjuntos de datos de diferentes fuentes, los modelos resultantes muestran unas relaciones similares entre las variables.
DISCUSIÓN: La comparación de varios modelos ayuda a descubrir las relaciones dentro de los conjuntos de datos. Nuestro método encuentra cuán fuertes son las relaciones entre el microbioma, el transcriptoma y las variables ambientales. En diferentes conjuntos de datos, los genes seleccionados eran comunes. Este enfoque es robusto y flexible para diferentes conjuntos de datos y configuraciones.
CONCLUSIÓN: Con inteRmodel descubrimos que el microbioma se relaciona más estrechamente con la ubicación de la muestra que con la enfermedad, pero el transcriptoma está muy relacionado con la ubicación de la muestra en el intestino. Hay un transcriptoma común entre los conjuntos de datos, mientras que los microorganismos dependen del conjunto de datos. Podemos mejorar la clasificación de las muestras teniendo en cuenta tanto el ARNr 16S bacteriano como el transcriptoma del huésped
Recommended from our members
A survey on wearable sensor modality centred human activity recognition in health care
Increased life expectancy coupled with declining birth rates is leading to an aging population structure. Aging-caused changes, such as physical or cognitive decline, could affect people's quality of life, result in injuries, mental health or the lack of physical activity. Sensor-based human activity recognition (HAR) is one of the most promising assistive technologies to support older people's daily life, which has enabled enormous potential in human-centred applications. Recent surveys in HAR either only focus on the deep learning approaches or one specific sensor modality. This survey aims to provide a more comprehensive introduction for newcomers and researchers to HAR. We first introduce the state-of-art sensor modalities in HAR. We look more into the techniques involved in each step of wearable sensor modality centred HAR in terms of sensors, activities, data pre-processing, feature learning and classification, including both conventional approaches and deep learning methods. In the feature learning section, we focus on both hand-crafted features and automatically learned features using deep networks. We also present the ambient-sensor-based HAR, including camera-based systems, and the systems which combine the wearable and ambient sensors. Finally, we identify the corresponding challenges in HAR to pose research problems for further improvement in HAR