2,715 research outputs found
Recommended from our members
Parallelizing support vector machines for scalable image annotation
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.Machine learning techniques have facilitated image retrieval by automatically classifying and annotating images with keywords. Among them Support Vector Machines (SVMs) are used extensively due to their generalization properties. However, SVM training is notably a computationally intensive process especially when the training dataset is large.
In this thesis distributed computing paradigms have been investigated to speed up SVM training, by partitioning a large training dataset into small data chunks and process each chunk in parallel utilizing the resources of a cluster of computers. A resource aware parallel SVM algorithm is introduced for large scale image annotation in parallel using a cluster of computers. A genetic algorithm based load balancing scheme is designed to optimize the performance of the algorithm in heterogeneous computing environments.
SVM was initially designed for binary classifications. However, most classification problems arising in domains such as image annotation usually involve more than two classes. A resource aware parallel multiclass SVM algorithm for large scale image annotation in parallel using a cluster of computers is introduced.
The combination of classifiers leads to substantial reduction of classification error in a wide range of applications. Among them SVM ensembles with bagging is shown to outperform a single SVM in terms of classification accuracy. However, SVM ensembles training are notably a computationally intensive process especially when the number replicated samples based on bootstrapping is large. A distributed SVM ensemble algorithm for image annotation is introduced which re-samples the training data based on bootstrapping and training SVM on each sample in parallel using a cluster of computers.
The above algorithms are evaluated in both experimental and simulation environments showing that the distributed SVM algorithm, distributed multiclass SVM algorithm, and distributed SVM ensemble algorithm, reduces the training time significantly while maintaining a high level of accuracy in classifications
Online Analysis of Dynamic Streaming Data
Die Arbeit zum Thema "Online Analysis of Dynamic Streaming Data" beschäftigt sich mit der Distanzmessung dynamischer, semistrukturierter Daten in kontinuierlichen Datenströmen um Analysen auf diesen Datenstrukturen bereits zur Laufzeit zu ermöglichen. Hierzu wird eine Formalisierung zur Distanzberechnung für statische und dynamische Bäume eingeführt und durch eine explizite Betrachtung der Dynamik von Attributen einzelner Knoten der Bäume ergänzt. Die Echtzeitanalyse basierend auf der Distanzmessung wird durch ein dichte-basiertes Clustering ergänzt, um eine Anwendung des Clustering, einer Klassifikation, aber auch einer Anomalieerkennung zu demonstrieren.
Die Ergebnisse dieser Arbeit basieren auf einer theoretischen Analyse der eingeführten Formalisierung von Distanzmessungen für dynamische Bäume. Diese Analysen werden unterlegt mit empirischen Messungen auf Basis von Monitoring-Daten von Batchjobs aus dem Batchsystem des GridKa Daten- und Rechenzentrums. Die Evaluation der vorgeschlagenen Formalisierung sowie der darauf aufbauenden Echtzeitanalysemethoden zeigen die Effizienz und Skalierbarkeit des Verfahrens. Zudem wird gezeigt, dass die Betrachtung von Attributen und Attribut-Statistiken von besonderer Bedeutung für die Qualität der Ergebnisse von Analysen dynamischer, semistrukturierter Daten ist. Außerdem zeigt die Evaluation, dass die Qualität der Ergebnisse durch eine unabhängige Kombination mehrerer Distanzen weiter verbessert werden kann. Insbesondere wird durch die Ergebnisse dieser Arbeit die Analyse sich über die Zeit verändernder Daten ermöglicht
Symmetry-Adapted Machine Learning for Information Security
Symmetry-adapted machine learning has shown encouraging ability to mitigate the security risks in information and communication technology (ICT) systems. It is a subset of artificial intelligence (AI) that relies on the principles of processing future events by learning past events or historical data. The autonomous nature of symmetry-adapted machine learning supports effective data processing and analysis for security detection in ICT systems without the interference of human authorities. Many industries are developing machine-learning-adapted solutions to support security for smart hardware, distributed computing, and the cloud. In our Special Issue book, we focus on the deployment of symmetry-adapted machine learning for information security in various application areas. This security approach can support effective methods to handle the dynamic nature of security attacks by extraction and analysis of data to identify hidden patterns of data. The main topics of this Issue include malware classification, an intrusion detection system, image watermarking, color image watermarking, battlefield target aggregation behavior recognition model, IP camera, Internet of Things (IoT) security, service function chain, indoor positioning system, and crypto-analysis
Fast Machine Learning Algorithms for Massive Datasets with Applications in the Biomedical Domain
The continuous increase in the size of datasets introduces computational challenges for machine learning algorithms. In this dissertation, we cover the machine learning algorithms and applications in large-scale data analysis in manufacturing and healthcare. We begin with introducing a multilevel framework to scale the support vector machine (SVM), a popular supervised learning algorithm with a few tunable hyperparameters and highly accurate prediction. The computational complexity of nonlinear SVM is prohibitive on large-scale datasets compared to the linear SVM, which is more scalable for massive datasets. The nonlinear SVM has shown to produce significantly higher classification quality on complex and highly imbalanced datasets. However, a higher classification quality requires a computationally expensive quadratic programming solver and extra kernel parameters for model selection. We introduce a generalized fast multilevel framework for regular, weighted, and instance weighted SVM that achieves similar or better classification quality compared to the state-of-the-art SVM libraries such as LIBSVM. Our framework improves the runtime more than two orders of magnitude for some of the well-known benchmark datasets. We cover multiple versions of our proposed framework and its implementation in detail. The framework is implemented using PETSc library which allows easy integration with scientific computing tasks. Next, we propose an adaptive multilevel learning framework for SVM to reduce the variance between prediction qualities across the levels, improve the overall prediction accuracy, and boost the runtime. We implement multi-threaded support to speed up the parameter fitting runtime that results in more than an order of magnitude speed-up. We design an early stopping criteria to reduce the extra computational cost when we achieve expected prediction quality. This approach provides significant speed-up, especially for massive datasets. Finally, we propose an efficient low dimensional feature extraction over massive knowledge networks. Knowledge networks are becoming more popular in the biomedical domain for knowledge representation. Each layer in knowledge networks can store the information from one or multiple sources of data. The relationships between concepts or between layers represent valuable information. The proposed feature engineering approach provides an efficient and highly accurate prediction of the relationship between biomedical concepts on massive datasets. Our proposed approach utilizes semantics and probabilities to reduce the potential search space for the exploration and learning of machine learning algorithms. The calculation of probabilities is highly scalable with the size of the knowledge network. The number of features is fixed and equivalent to the number of relationships or classes in the data. A comprehensive comparison of well-known classifiers such as random forest, SVM, and deep learning over various features extracted from the same dataset, provides an overview for performance and computational trade-offs. Our source code, documentation and parameters will be available at https://github.com/esadr/
Machine Learning and Integrative Analysis of Biomedical Big Data.
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
- …