10 research outputs found

    Supervised and Semi-Supervised Self-Organizing Maps for Regression and Classification Focusing on Hyperspectral Data

    Get PDF
    Machine learning approaches are valuable methods in hyperspectral remote sensing, especially for the classification of land cover or for the regression of physical parameters. While the recording of hyperspectral data has become affordable with innovative technologies, the acquisition of reference data (ground truth) has remained expensive and time-consuming. There is a need for methodological approaches that can handle datasets with significantly more hyperspectral input data than reference data. We introduce the Supervised Self-organizing Maps (SuSi) framework, which can perform unsupervised, supervised and semi-supervised classification as well as regression on high-dimensional data. The methodology of the SuSi framework is presented and compared to other frameworks. Its different parts are evaluated on two hyperspectral datasets. The results of the evaluations can be summarized in four major findings: (1) The supervised and semi-Supervised Self-organizing Maps (SOM) outperform random forest in the regression of soil moisture. (2) In the classification of land cover, the supervised and semi-supervised SOM reveal great potential. (3) The unsupervised SOM is a valuable tool to understand the data. (4) The SuSi framework is versatile, flexible, and easy to use. The SuSi framework is provided as an open-source Python package on GitHub

    Diabeettisten komplikaatioiden uusien fenotyyppiprofiilien etsintä, sekä ryhmien välisten geneettisten komponenttien tunnistus koneoppimismenetelmiä hyödyntäen

    Get PDF
    Patients with Type 1 diabetes (T1D) may develop a wide variety of additional slowly progressing complications, which have been shown to be partly heritable and to correlate with each other. However, the genetic and biological mechanisms behind them are still mostly unknown. The goal of this work was to use machine learning and data mining approaches that could capture the progressive nature of multiple complications simultaneously, and create novel phenotype classes that could help to solve the pathogenesis and genetics of diabetic complications. To achieve this, a dual-layer self-organizing map (SOM) was trained using clinical and environmental patient data from the FinnDiane study, and the trained SOM node prototypes were clustered to classes using agglomerative hierarchical clustering. The genetic differences between the created classes were evaluated using heritability estimates, and the genetic markers associated with the class assignments showing significant heritability were analysed in genome-wide association study (GWAS). The created class assignments were biologically plausible, and were estimated to be up to 42% genetically determined. The GWAS analyses detected a genetic marker (rs202095311, located in the last intron of the gene NRIP1) genome-wide significantly (p<5×10^-8) associated with one of the created class assignments. In addition, GWAS detected multiple other genetic regions with suggestive p-values that contained mostly genes and processes previously linked to diabetic complications or their risk factors. Overall, the new approach to study the genetics of complex diseases was found to perform well in case of T1D and its complications, and could be used to study also other complex traits and diseases.Tyypin 1 diabetikoille saattaa kehittyä useita hitaasti eteneviä lisäsairauksia, jotka ovat osittain perinnöllisiä sekä keskenään korreloivia. Sekä geneettiset että biologiset mekanismit näiden taustalla ovat kuitenkin pääasiassa vielä tuntemattomia. Tämän työn tarkoituksena oli hyödyntää koneoppimis- ja tiedonlouhintamenetelmiä, joiden avulla pystyttäisiin vangitsemaan samanaikaisesti useiden diabeettisten komplikaatioden etenevä luonne, sekä muodostamaan uusia fenotyyppiluokkia diabeettisten komplikaatioiden ja niiden genetiikan tutkimuksen avuksi. Työssä opetettiin monitasoinen itseorganisoituva kartta (SOM) käyttäen FinnDiane tutimuksessa kliinisistä muuttujista sekä ympäristötekijöistä kerättyä potilasdataa. Uusien fenotyyppiluokkien luomiseksi opetetun kartan prototyyppialkiot klusteroitiin kokoavalla hierarkkisella klusteroinnilla. Luokkien välisiä geneettisiä eroja vertailtiin heritabiliteettiestimaateilla. Lisäksi luokkajakoon assosioituvien geneettisten markkereiden vaikutusta tutkittiin perimänlaajuisessa assosiaatiotutkimuksessa (GWAS) niiden luokkien välillä, jotka saavuttivat merkitseviä estimaatteja heritabiliteeteille. Muodostetut potilasluokat olivat biologisesti mielekkäitä ja muodostetun luokkajaon estimoitiin olevan jopa 42% geneettisesti määräytyvä. Perimänlaajuisissa assosiaatiotutkimuksissa geneettinen variantti (rs202095311 NRIP1 geenin viimeisessä intronissa) assosioitui yhteen muodostetuista luokkajaoista genominlaajuisella merkitsevyystasolla (p<5×10^-8). Lisäksi analyyseissa havaittiin viitteellisillä p-arvoilla useita muita geneettisiä alueita, joilla sijaitsee aiemmin diabeettisiin komplikaatioihin tai niiden riskitekijöihin yhdistettyjä geenejä ja prosesseja. Yleisesti, uusi lähestymistapa kompleksisten sairauksien genetiikan tutkimukseen suoriutui sille asetetuista haasteista tyypin 1 diabeteksen ja sen komplikaatioiden tutkimuksessa ja vastaava lähestymistapa voisi olla hyödynnettävissä myös muiden kompleksisten sairauksien tutkimuksessa

    Development and Applications of Machine Learning Methods for Hyperspectral Data

    Get PDF
    Die hyperspektrale Fernerkundung der Erde stützt sich auf Daten passiver optischer Sensoren, die auf Plattformen wie Satelliten und unbemannten Luftfahrzeugen montiert sind. Hyperspektrale Daten umfassen Informationen zur Identifizierung von Materialien und zur Überwachung von Umweltvariablen wie Bodentextur, Bodenfeuchte, Chlorophyll a und Landbedeckung. Methoden zur Datenanalyse sind erforderlich, um Informationen aus hyperspektralen Daten zu erhalten. Ein leistungsstarkes Werkzeug bei der Analyse von Hyperspektraldaten ist das Maschinelle Lernen, eine Untergruppe von Künstlicher Intelligenz. Maschinelle Lernverfahren können nichtlineare Korrelationen lösen und sind bei steigenden Datenmengen skalierbar. Jeder Datensatz und jedes maschinelle Lernverfahren bringt neue Herausforderungen mit sich, die innovative Lösungen erfordern. Das Ziel dieser Arbeit ist die Entwicklung und Anwendung von maschinellen Lernverfahren auf hyperspektrale Fernerkundungsdaten. Im Rahmen dieser Arbeit werden Studien vorgestellt, die sich mit drei wesentlichen Herausforderungen befassen: (I) Datensätze, welche nur wenige Datenpunkte mit dazugehörigen Ausgabedaten enthalten, (II) das begrenzte Potential von nicht-tiefen maschinellen Lernverfahren auf hyperspektralen Daten und (III) Unterschiede zwischen den Verteilungen der Trainings- und Testdatensätzen. Die Studien zur Herausforderung (I) führen zur Entwicklung und Veröffentlichung eines Frameworks von Selbstorganisierten Karten (SOMs) für unüberwachtes, überwachtes und teilüberwachtes Lernen. Die SOM wird auf einen hyperspektralen Datensatz in der (teil-)überwachten Regression der Bodenfeuchte angewendet und übertrifft ein Standardverfahren des maschinellen Lernens. Das SOM-Framework zeigt eine angemessene Leistung in der (teil-)überwachten Klassifikation der Landbedeckung. Es bietet zusätzliche Visualisierungsmöglichkeiten, um das Verständnis des zugrunde liegenden Datensatzes zu verbessern. In den Studien, die sich mit Herausforderung (II) befassen, werden drei innovative eindimensionale Convolutional Neural Network (CNN) Architekturen entwickelt. Die CNNs werden für eine Bodentexturklassifikation auf einen frei verfügbaren hyperspektralen Datensatz angewendet. Ihre Leistung wird mit zwei bestehenden CNN-Ansätzen und einem Random Forest verglichen. Die beiden wichtigsten Erkenntnisse lassen sich wie folgt zusammenfassen: Erstens zeigen die CNN-Ansätze eine deutlich bessere Leistung als der angewandte nicht-tiefe Random Forest-Ansatz. Zweitens verbessert das Hinzufügen von Informationen über hyperspektrale Bandnummern zur Eingabeschicht eines CNNs die Leistung im Bezug auf die einzelnen Klassen. Die Studien über die Herausforderung (III) basieren auf einem Datensatz, der auf fünf verschiedenen Messgebieten in Peru im Jahr 2019 erfasst wurde. Die Unterschiede zwischen den Messgebieten werden mit qualitativen Methoden und mit unüberwachten maschinellen Lernverfahren, wie zum Beispiel Principal Component Analysis und Autoencoder, analysiert. Basierend auf den Ergebnissen wird eine überwachte Regression der Bodenfeuchte bei verschiedenen Kombinationen von Messgebieten durchgeführt. Zusätzlich wird der Datensatz mit Monte-Carlo-Methoden ergänzt, um die Auswirkungen der Verschiebung der Verteilungen des Datensatzes auf die Regression zu untersuchen. Der angewandte SOM-Regressor ist relativ robust gegenüber dem Rauschen des Bodenfeuchtesensors und zeigt eine gute Leistung bei kleinen Datensätzen, während der angewandte Random Forest auf dem gesamten Datensatz am besten funktioniert. Die Verschiebung der Verteilungen macht diese Regressionsaufgabe schwierig; einige Kombinationen von Messgebieten bilden einen deutlich sinnvolleren Trainingsdatensatz als andere. Insgesamt zeigen die vorgestellten Studien, die sich mit den drei größten Herausforderungen befassen, vielversprechende Ergebnisse. Die Arbeit gibt schließlich Hinweise darauf, wie die entwickelten maschinellen Lernverfahren in der zukünftigen Forschung weiter verbessert werden können

    Computer aided identification of biological specimens using self-organizing maps

    Get PDF
    For scientific or socio-economic reasons it is often necessary or desirable that biological material be identified. Given that there are an estimated 10 million living organisms on Earth, the identification of biological material can be problematic. Consequently the services of taxonomist specialists are often required. However, if such expertise is not readily available it is necessary to attempt an identification using an alternative method. Some of these alternative methods are unsatisfactory or can lead to a wrong identification. One of the most common problems encountered when identifying specimens is that important diagnostic features are often not easily observed, or may even be completely absent. A number of techniques can be used to try to overcome this problem, one of which, the Self Organizing Map (or SOM), is a particularly appealing technique because of its ability to handle missing data. This thesis explores the use of SOMs as a technique for the identification of indigenous trees of the Acacia species in KwaZulu-Natal, South Africa. The ability of the SOM technique to perform exploratory data analysis through data clustering is utilized and assessed, as is its usefulness for visualizing the results of the analysis of numerical, multivariate botanical data sets. The SOM’s ability to investigate, discover and interpret relationships within these data sets is examined, and the technique’s ability to identify tree species successfully is tested. These data sets are also tested using the C5 and CN2 classification techniques. Results from both these techniques are compared with the results obtained by using a SOM commercial package. These results indicate that the application of the SOM to the problem of biological identification could provide the start of the long-awaited breakthrough in computerized identification that biologists have eagerly been seeking.Dissertation (MSc)--University of Pretoria, 2011.Computer Scienceunrestricte

    Vector Quantization Techniques for Approximate Nearest Neighbor Search on Large-Scale Datasets

    Get PDF
    The technological developments of the last twenty years are leading the world to a new era. The invention of the internet, mobile phones and smart devices are resulting in an exponential increase in data. As the data is growing every day, finding similar patterns or matching samples to a query is no longer a simple task because of its computational costs and storage limitations. Special signal processing techniques are required in order to handle the growth in data, as simply adding more and more computers cannot keep up.Nearest neighbor search, or similarity search, proximity search or near item search is the problem of finding an item that is nearest or most similar to a query according to a distance or similarity measure. When the reference set is very large, or the distance or similarity calculation is complex, performing the nearest neighbor search can be computationally demanding. Considering today’s ever-growing datasets, where the cardinality of samples also keep increasing, a growing interest towards approximate methods has emerged in the research community.Vector Quantization for Approximate Nearest Neighbor Search (VQ for ANN) has proven to be one of the most efficient and successful methods targeting the aforementioned problem. It proposes to compress vectors into binary strings and approximate the distances between vectors using look-up tables. With this approach, the approximation of distances is very fast, while the storage space requirement of the dataset is minimized thanks to the extreme compression levels. The distance approximation performance of VQ for ANN has been shown to be sufficiently well for retrieval and classification tasks demonstrating that VQ for ANN techniques can be a good replacement for exact distance calculation methods.This thesis contributes to VQ for ANN literature by proposing five advanced techniques, which aim to provide fast and efficient approximate nearest neighbor search on very large-scale datasets. The proposed methods can be divided into two groups. The first group consists of two techniques, which propose to introduce subspace clustering to VQ for ANN. These methods are shown to give the state-of-the-art performance according to tests on prevalent large-scale benchmarks. The second group consists of three methods, which propose improvements on residual vector quantization. These methods are also shown to outperform their predecessors. Apart from these, a sixth contribution in this thesis is a demonstration of VQ for ANN in an application of image classification on large-scale datasets. It is shown that a k-NN classifier based on VQ for ANN performs on par with the k-NN classifiers, but requires much less storage space and computations

    Content-based visualisation to aid common navigation of musical audio

    Get PDF

    Self-organizing Map Initialization

    No full text
    The solution obtained by Self-Organizing Map (SOM) strongly depends on the initial cluster centers. However, all existing SOM initialization methods do not guarantee to obtain a better minimal solution. Generally, we can group these methods in two classes: random initialization and data analysis based initialization classes. This work proposes an improvement of linear projection initialization method. This method belongs to the second initialization class. Instead of using regular rectangular grid our method combines a linear projection technique with irregular rectangular grid. By this way the distribution of results produced by the linear projection technique is considred. The experiments confirm that the proposed method gives better solutions compared to its original version

    Advancing Neuro-Fuzzy Algorithm for Automated Classification in Largescale Forensic and Cybercrime Investigations: Adaptive Machine Learning for Big Data Forensic

    No full text
    Abstract Cyber Crime Investigators are challenged by the huge amount and complexity of digital data seized in criminal cases. Human experts are present in the Court of Law and make decisions with respect to the digital data and evidence found. Therefore, it is necessary to combine automated analysis and human-understandable representation of digital data and evidences. Machine Learning methods such as Artificial Neural Networks, Support Vector Machines and Bayes Networks have been successfully applied in Digital Investigation & Forensics. The challenge however is in the fact that these methods neither provide precise human-explainable models nor can work without prior knowledge. Our research is inspired by the emerging area of Computational Forensics. We focus on the Neuro-Fuzzy rule-extraction classification method, a promising Hybrid Intelligence model. The contribution goes towards the improved performance of Neuro-Fuzzy in extracting accurate fuzzy rules that are human-explainable. These rules can be presented and explained in a Court of Law, which is better than a set of numerical parameters obtained from more abstract Machine Learning models. In our initial research on the Neuro-Fuzzy method, we found that its application in Digital Forensics was promising, but with a number of drawbacks. These include (i) poor performance in learning from real-world in comparison to other state of the art Machine Learning methods, (ii) a number of output fuzzy rules so large that no human expert can understand them, (iii) a strong model overfitting caused by the huge number of fuzzy rules, and (iv) an intrinsic learning procedure that neglects part of the data, which therefore becomes inaccurate. Due to this criticism, Neuro- Fuzzy method’s latent potential has not been widely applied to the area yet. The contribution of this work is the following: (1) theoretical in the improvement of Neuro-Fuzzy method and (2) empirical in the experimental design using large scale datasets in Digital Forensics domain. The entire study was conducted during2013-2017 at the NTNU Digital Forensics Group. Add. 1. Neuro-Fuzzy was revised and therefore we first contributed to the Machine Learning domain and subsequently the large-scale Digital Forensics application. In particular, (i) we proposed exploratory data analysis to improve Self-Organizing Map initialization and generalization of the Neuro-Fuzzy method targeting largescale datasets; (ii) we also improved the compactness and generalization of fuzzy patches, resulting in the increased accuracy and robustness of the method through a chi-square goodness of fit test; (iii) we constructed the new membership function based on Gaussian multinomial distribution that considers fuzzy patches representation as a statistically estimated hyperellipsoid; (iv) we reformulated the application of the Neuro-Fuzzy in solving multi-class problems rather than conventional two classes problems; (v) finally, we designed a new approach to model non-linear data using D ep Learning and Neuro-Fuzzy method that results in a Deep Neuro- Fuzzy architecture. Add. 2. The experimental study includes extended evaluation of the proposed improvements with respect to the challenges and requirements of a variety of different real-world applications, including: (i) state of the art datasets like the Android malware dataset, network intrusion detection KDD CUP 1999 and web application firewalls PKDD 2007 datasets. Moreover, community-accepted datasets from UCI collection were also used, including large-scale datasets such as SUSY and HIGGS. (ii) A new, novel large-scale collection of Windows Portable Executable 32-bit malware files was also composed as a part of this PhD work. It consists of 328,000 labelled malware samples that represent 10,362 families and 35 categories; these were further tested as non-trivial multi-class problems, neither sufficiently studied in the literature nor previously explored. Sammendrag Etterforskere som arbeider med cyberkriminalitet blir utfordret av den store mengden av og kompleksiteten på digitale data som blir beslaglagt i kriminalsaker. Menneskelige eksperter er tilstede i retten og tar beslutninger basert på de digitale data og bevisene som er funnet. Det er derfor nødvendig å kombinere automatiske analyser med en representasjon av de digitale data og bevis som er forståelig for mennesker. Maskinlæringsmetoder, som kunstige nevrale nettverk, støttevektormaskiner og bayesianske nettverk har blitt benyttet vellykket innenfor digital etterforsking. Utfordringene er at disse metodene verken gir modeller som er lett forståelig for mennesker, eller virker uten forkunnskap. Vår forskning er inspirert av det fremvoksende området computational forensics. Vi fokuserer på metoden neuro-fuzzy rule-extraction, en lovende hybrid intelligensmodell. Bidraget går til å forbedre ytelsen av neuro-fuzzy til å finne presise fuzzy- regler som er forståelige for mennesker. Disse reglene kan bli presentert og forklart i retten, noe som er bedre enn et sett med numeriske parametere tatt fra en mer abstrakt maskinlæringsmodell. I starten av vår forskning på neuro-fuzzy metoden fant vi at dens anvendelse innenfor digital etterforskning var lovende, men med en del ulemper. Disse inkluderer (i) dårlig ytelse når det gjelder læring av modeller, fra den virkelige verden, sammenlignet med andre rådende metoder innenfor maskinlæring, (ii) en del av fuzzyreglene er så store at ingen menneskelig ekspert kan forstå dem, (iii) en sterk overtilpasning av modeller, forårsaket av den store mengden fuzzy-regler, og (iv) en iboende læringsprosedyre som forsømmer deler av dataene og derfor blir unøyaktig. På bakgrunn av denne kritikken har neuro-fuzzy metodens latente potensiale ikke blitt mye benyttet innenfor dette området enda. Bidragene fra dette verket er som følger: (1) teoretisk i forbedring av neuro-fuzzy metoden og (2) empirisk gjennom eksperimentell design ved hjelp av storskala datasett fra domenet digital etterforskning. Hele studien ble utført 2013-2017 ved gruppen for digital etterforskning ved NTNU. Add. 1. Vi har revidert neuro-fuzzy metoden, og derfor først bidratt innenfor maskinlæringsdomenet og dernest til anvendelsen innenfor storskala digital etterforskning. Spesielt, (i) har vi foreslått utforskende dataanalyser for å forbedre initialisering av selvorganiserende kart og generalisering av neuro-fuzzy metoden rettet mot storskala datasett; (ii) vi har også forbedret kompaktheten og generaliseringen til fuzzy-patches, noe som resulterte i økt nøyaktighet og robusthet av metoden ved hjelp av chi-kvadrat godhet av passformtest; (iii) vi laget en ny medlemskapsfunksjon basert på gaussisk multinomisk fordeling som tar høyde for representasjonen av fuzzy-patches som en statistisk estimert hyperellipsoide; (iv) vi reformulerte anvendelsen av neuro-fuzzy til å løse multiklasseproblemer i stedet for konvensjonelle toklasseproblemer; (v) tilslutt designet vi en ny fremgangsmåte for å modellere ikke-lineære data ved hjelp av deep learning og neuro-fuzzy, som resulterte i en deep neuro-fuzzy arkitektur. Add. 2. Den eksperimentelle studien inkluderer bred evaluering av de foreslåtte forbedringene med hensyn til de utfordringene og kravene fra den varierte anvendelsen fra den reelle verden, inkludert: (i) rådende datasett, som Android malware datasettet, detektering av nettverksinnbrudd i KDD CUP 1999 og datasettet med brannmurer for web-applikasjoner, PKDD 2007. I tillegg ble det brukt andre datasett som er akseptert i miljøet, inkludert storskala datasett som SUSY og HIGGs. (ii) I tillegg ble det gjort en ny storskala innsamling av Windows Portable Executable 32-bit skadevare filer som en del av dette PhD-arbeidet. Det består av 328,000 merkede prøver av skadevare som representerer 10,362 familier og 35 kategorier; disse ble videre testet som ikke-trivielle multiklasseproblemer som ikke var tilstrekkelig studert i litteraturen eller utforsket tidligere
    corecore