1,941 research outputs found

    Swarm-Organized Topographic Mapping

    Get PDF
    Topographieerhaltende Abbildungen versuchen, hochdimensionale oder komplexe Datenbestände auf einen niederdimensionalen Ausgaberaum abzubilden, wobei die Topographie der Daten hinreichend gut wiedergegeben werden soll. Die Qualität solcher Abbildung hängt gewöhnlich vom eingesetzten Nachbarschaftskonzept des konstruierenden Algorithmus ab. Die Schwarm-Organisierte Projektion ermöglicht eine Lösung dieses Parametrisierungsproblems durch die Verwendung von Techniken der Schwarmintelligenz. Die praktische Verwendbarkeit dieser Methodik wurde durch zwei Anwendungen auf dem Feld der Molekularbiologie sowie der Finanzanalytik demonstriert

    Learning metrics and discriminative clustering

    Get PDF
    In this work methods have been developed to extract relevant information from large, multivariate data sets in a flexible, nonlinear way. The techniques are applicable especially at the initial, explorative phase of data analysis, in cases where an explicit indicator of relevance is available as part of the data set. The unsupervised learning methods, popular in data exploration, often rely on a distance measure defined for data items. Selection of the distance measure, part of which is feature selection, is therefore fundamentally important. The learning metrics principle is introduced to complement manual feature selection by enabling automatic modification of a distance measure on the basis of available relevance information. Two applications of the principle are developed. The first emphasizes relevant aspects of the data by directly modifying distances between data items, and is usable, for example, in information visualization with the self-organizing maps. The other method, discriminative clustering, finds clusters that are internally homogeneous with respect to the interesting variation of the data. The techniques have been applied to text document analysis, gene expression clustering, and charting the bankruptcy sensitivity of companies. In the first, more straightforward approach, a new local metric of the data space measures changes in the conditional distribution of the relevance-indicating data by the Fisher information matrix, a local approximation of the Kullback-Leibler distance. Discriminative clustering, on the other hand, directly minimizes a Kullback-Leibler based distortion measure within the clusters, or equivalently maximizes the mutual information between the clusters and the relevance indicator. A finite-data algorithm for discriminative clustering is also presented. It maximizes a partially marginalized posterior probability of the model and is asymptotically equivalent to maximizing mutual information.reviewe

    History and Theoretical Basics of Hidden Markov Models

    Get PDF

    Simple but Not Simplistic: Reducing the Complexity of Machine Learning Methods

    Get PDF
    Programa Oficial de Doutoramento en Computación . 5009V01[Resumo] A chegada do Big Data e a explosión do Internet das cousas supuxeron un gran reto para os investigadores en Aprendizaxe Automática, facendo que o proceso de aprendizaxe sexa mesmo roáis complexo. No mundo real, os problemas da aprendizaxe automática xeralmente teñen complexidades inherentes, como poden ser as características intrínsecas dos datos, o gran número de mostras, a alta dimensión dos datos de entrada, os cambios na distribución entre o conxunto de adestramento e test, etc. Todos estes aspectos son importantes, e requiren novoS modelos que poi dan facer fronte a estas situacións. Nesta tese, abordáronse todos estes problemas, tratando de simplificar o proceso de aprendizaxe automática no escenario actual. En primeiro lugar, realízase unha análise de complexidade para observar como inflúe esta na tarefa de clasificación, e se é posible que a aplicación dun proceso previo de selección de características reduza esta complexidade. Logo, abórdase o proceso de simplificación da fase de aprendizaxe automática mediante a filosofía divide e vencerás, usando un enfoque distribuído. Seguidamente, aplicamos esa mesma filosofía sobre o proceso de selección de características. Finalmente, optamos por un enfoque diferente seguindo a filosofía do Edge Computing, a cal permite que os datos producidos polos dispositivos do Internet das cousas se procesen máis preto de onde se crearon. Os enfoques propostos demostraron a súa capacidade para reducir a complexidade dos métodos de aprendizaxe automática tradicionais e, polo tanto, espérase que a contribución desta tese abra as portas ao desenvolvemento de novos métodos de aprendizaxe máquina máis simples, máis robustos, e máis eficientes computacionalmente.[Resumen] La llegada del Big Data y la explosión del Internet de las cosas han supuesto un gran reto para los investigadores en Aprendizaje Automático, haciendo que el proceso de aprendizaje sea incluso más complejo. En el mundo real, los problemas de aprendizaje automático generalmente tienen complejidades inherentes) como pueden ser las características intrínsecas de los datos, el gran número de muestras, la alta dimensión de los datos de entrada, los cambios en la distribución entre el conjunto de entrenamiento y test, etc. Todos estos aspectos son importantes, y requieren nuevos modelos que puedan hacer frente a estas situaciones. En esta tesis, se han abordado todos estos problemas, tratando de simplificar el proceso de aprendizaje automático en el escenario actual. En primer lugar, se realiza un análisis de complejidad para observar cómo influye ésta en la tarea de clasificación1 y si es posible que la aplicación de un proceso previo de selección de características reduzca esta complejidad. Luego, se aborda el proceso de simplificación de la fase de aprendizaje automático mediante la filosofía divide y vencerás, usando un enfoque distribuido. A continuación, aplicamos esa misma filosofía sobre el proceso de selección de características. Finalmente, optamos por un enfoque diferente siguiendo la filosofía del Edge Computing, la cual permite que los datos producidos por los dispositivos del Internet de las cosas se procesen más cerca de donde se crearon. Los enfoques propuestos han demostrado su capacidad para reducir la complejidad de los métodos de aprendizaje automático tnidicionales y, por lo tanto, se espera que la contribución de esta tesis abra las puertas al desarrollo de nuevos métodos de aprendizaje máquina más simples, más robustos, y más eficientes computacionalmente.[Abstract] The advent of Big Data and the explosion of the Internet of Things, has brought unprecedented challenges to Machine Learning researchers, making the learning task more complexo Real-world machine learning problems usually have inherent complexities, such as the intrinsic characteristics of the data, large number of instauces, high input dimensionality, dataset shift, etc. AH these aspects matter, and can fOI new models that can confront these situations. Thus, in this thesis, we have addressed aH these issues) simplifying the machine learning process in the current scenario. First, we carry out a complexity analysis to see how it inftuences the classification models, and if it is possible that feature selection might result in a deerease of that eomplexity. Then, we address the proeess of simplifying learning with the divide-and-conquer philosophy of the distributed approaeh. Later, we aim to reduce the complexity of the feature seleetion preprocessing through the same philosophy. FinallYl we opt for a different approaeh following the eurrent philosophy Edge eomputing, whieh allows the data produeed by Internet of Things deviees to be proeessed closer to where they were ereated. The proposed approaehes have demonstrated their eapability to reduce the complexity of traditional maehine learning algorithms, and thus it is expeeted that the eontribution of this thesis will open the doors to the development of new maehine learning methods that are simpler, more robust, and more eomputationally efficient

    E2M: A deep learning framework for associating combinatorial methylation patterns with gene expression

    Get PDF
    We focus on the new problem of determining which methylation patterns in gene promoters strongly associate with gene expression in cancer cells of different types. Although a number of results regarding the influence of methylation on expression data have been reported in the literature, our approach is unique insofar as it retrospectively predicts the combinations of methylated sites in promoter regions of genes that are reflected in the expression data. Reversing the traditional prediction order in many cases makes estimation of the model parameters easier, as real-valued data are used to predict categorical data, rather than vice-versa; in addition, our approach allows one to better assess the overall influence of methylation in modulating expression via state-of-the-art learning methods. For this purpose, we developed a novel neural network learning framework termed E2M (Expression-to- Methylation) to predict the status of different methylation sites in promoter regions of several bio-marker genes based on sufficient statistics of the whole gene expression captured through Landmark genes. We ran our experiments on unquantized and quantized expression sets and neural network weights to illustrate the robustness of the method and reduce the storage footprint of the processing pipeline. We implemented a number of machine learning algorithms to address the new problem of methylation pattern inference, including multiclass regression, canonical correlation analysis (CCA), naive fully connected neural network and inception neural networks. Inception neural networks such as E2M learners outperform all other techniques and offer an average prediction accuracy of 82% when tested on 3, 671 pan-cancer samples including low grade glioma, glioblastoma, lung adenocarcinoma, lung squamus cell carcinoma, and stomach adenocarcinoma. As an illustrative example, one can increase the prediction accuracy for the methylation pattern in the promoter of gene GATA6 in glioblastoma samples by 20% when using inception rather than simple fully connected neural networks. These performance guarantees remain largely unchanged even when both expression values and network weights are quantized. Our work also provides new insight about the importance of specific methylation site patterns on expression variations for different genes. In this context, we identified genes for which the overwhelming majority of patients exhibit one methylation pattern, and other genes with three or more significant classes of methylation patterns. Inception networks identify such patterns with high accuracy and suggest possible stratification of cancers based on methylation pattern profiles. The E2M code and datasets are freely available at https://github.com/jianhao2016/E2

    Classification of Primary versus Metastatic Pancreatic Tumor Cells Using Multiple Biomarkers and Whole Slide Imaging

    Get PDF
    Pancreatic cancer is a challenging cancer with a high mortality rate and a 5-year survival rate between 2% to 9%. The role of biomarkers is crucial in cancer prognosis, diagnosis, and predicting the possible responses to a specific therapy. The Discovery and development of various types of biomarkers have been studied intensively in the hope of determining the best treatment approaches, better management, and possibly cure of this deadly cancer. However, metastasis, responsible for about 90% of the deaths from cancer, is still poorly understood. A few research that have investigated the expression of a particular biomarker or a panel of biomarkers in the primary and secondary (metastatic) tumor demonstrates that the expression of different biomarkers in the primary and secondary tumor sites is not necessarily the same, even though the primary and metastatic tumor cells are originated from the same organ. In this project, we aim to design a classifier to distinguish between primary and secondary tumor cells based on their uptake of different biomarkers, using immunofluorescence whole slide imaging. For this purpose, we first register consecutive images of the same slide together to be able to locate multiple biomarkers that belong to a cell and later we design our classifier based on vectors that show the presence or absence of multiple antibodies in addition to the amount of that antibody in a tumor cell. Advisor: Khalid Sayoo

    A Novel Model-Free Data Analysis Technique Based on Clustering in a Mutual Information Space: Application to Resting-State fMRI

    Get PDF
    Non-parametric data-driven analysis techniques can be used to study datasets with few assumptions about the data and underlying experiment. Variations of independent component analysis (ICA) have been the methods mostly used on fMRI data, e.g., in finding resting-state networks thought to reflect the connectivity of the brain. Here we present a novel data analysis technique and demonstrate it on resting-state fMRI data. It is a generic method with few underlying assumptions about the data. The results are built from the statistical relations between all input voxels, resulting in a whole-brain analysis on a voxel level. It has good scalability properties and the parallel implementation is capable of handling large datasets and databases. From the mutual information between the activities of the voxels over time, a distance matrix is created for all voxels in the input space. Multidimensional scaling is used to put the voxels in a lower-dimensional space reflecting the dependency relations based on the distance matrix. By performing clustering in this space we can find the strong statistical regularities in the data, which for the resting-state data turns out to be the resting-state networks. The decomposition is performed in the last step of the algorithm and is computationally simple. This opens up for rapid analysis and visualization of the data on different spatial levels, as well as automatically finding a suitable number of decomposition components
    corecore