4,925 research outputs found
Machine Learning and Integrative Analysis of Biomedical Big Data.
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
Multilevel Weighted Support Vector Machine for Classification on Healthcare Data with Missing Values
This work is motivated by the needs of predictive analytics on healthcare
data as represented by Electronic Medical Records. Such data is invariably
problematic: noisy, with missing entries, with imbalance in classes of
interests, leading to serious bias in predictive modeling. Since standard data
mining methods often produce poor performance measures, we argue for
development of specialized techniques of data-preprocessing and classification.
In this paper, we propose a new method to simultaneously classify large
datasets and reduce the effects of missing values. It is based on a multilevel
framework of the cost-sensitive SVM and the expected maximization imputation
method for missing values, which relies on iterated regression analyses. We
compare classification results of multilevel SVM-based algorithms on public
benchmark datasets with imbalanced classes and missing values as well as real
data in health applications, and show that our multilevel SVM-based method
produces fast, and more accurate and robust classification results.Comment: arXiv admin note: substantial text overlap with arXiv:1503.0625
Pedestrian Attribute Recognition: A Survey
Recognizing pedestrian attributes is an important task in computer vision
community due to it plays an important role in video surveillance. Many
algorithms has been proposed to handle this task. The goal of this paper is to
review existing works using traditional methods or based on deep learning
networks. Firstly, we introduce the background of pedestrian attributes
recognition (PAR, for short), including the fundamental concepts of pedestrian
attributes and corresponding challenges. Secondly, we introduce existing
benchmarks, including popular datasets and evaluation criterion. Thirdly, we
analyse the concept of multi-task learning and multi-label learning, and also
explain the relations between these two learning algorithms and pedestrian
attribute recognition. We also review some popular network architectures which
have widely applied in the deep learning community. Fourthly, we analyse
popular solutions for this task, such as attributes group, part-based,
\emph{etc}. Fifthly, we shown some applications which takes pedestrian
attributes into consideration and achieve better performance. Finally, we
summarized this paper and give several possible research directions for
pedestrian attributes recognition. The project page of this paper can be found
from the following website:
\url{https://sites.google.com/view/ahu-pedestrianattributes/}.Comment: Check our project page for High Resolution version of this survey:
https://sites.google.com/view/ahu-pedestrianattributes
Data mining for detecting Bitcoin Ponzi schemes
Soon after its introduction in 2009, Bitcoin has been adopted by
cyber-criminals, which rely on its pseudonymity to implement virtually
untraceable scams. One of the typical scams that operate on Bitcoin are the
so-called Ponzi schemes. These are fraudulent investments which repay users
with the funds invested by new users that join the scheme, and implode when it
is no longer possible to find new investments. Despite being illegal in many
countries, Ponzi schemes are now proliferating on Bitcoin, and they keep
alluring new victims, who are plundered of millions of dollars. We apply data
mining techniques to detect Bitcoin addresses related to Ponzi schemes. Our
starting point is a dataset of features of real-world Ponzi schemes, that we
construct by analysing, on the Bitcoin blockchain, the transactions used to
perform the scams. We use this dataset to experiment with various machine
learning algorithms, and we assess their effectiveness through standard
validation protocols and performance metrics. The best of the classifiers we
have experimented can identify most of the Ponzi schemes in the dataset, with a
low number of false positives
A Novel Approach For Identifying Cloud Clusters Developing Into Tropical Cyclones
Providing advance notice of rare events, such as a cloud cluster (CC) developing into a tropical cyclone (TC), is of great importance. Having advance warning of such rare events possibly can help avoid or reduce the risk of damages and allow emergency responders and the affected community enough time to respond appropriately. Considering this, forecasters need better data mining and data driven techniques to identify developing CCs. Prior studies have attempted to predict the formation of TCs using numerical weather prediction models as well as satellite and radar data. However, refined observational data and forecasting techniques are not always available or accurate in areas such as the North Atlantic Ocean where data are sparse. Consequently, this research provides the predictive features that contribute to a CC developing into a TC using only global gridded satellite data that are readily available. This was accomplished by identifying and tracking CCs objectively where no expert knowledge is required to investigate the predictive features of developing CCs. We have applied the proposed oversampling technique named the Selective Clustering based Oversampling Technique (SCOT) to reduce the bias of the non-developing CCs when using standard classifiers. Our approach identifies twelve predictive features for developing CCs and demonstrates predictive skill for 0 - 48 hours prior to development. The results confirm that the proposed technique can satisfactorily identify developing CCs for each of the nine forecasts using standard classifiers such as Classification and Regression Trees (CART), neural networks, and support vector machines (SVM) and ten-fold cross validation. These results are based on the geometric mean values and are further verified using seven case studies such as Hurricane Katrina (2005). These results demonstrate that our proposed approach could potentially improve weather prediction and provide advance notice of a developing CC by using solely gridded satellite data
A Few-Shot Learning-Based Siamese Capsule Network for Intrusion Detection with Imbalanced Training Data
Network intrusion detection remains one of the major challenges in cybersecurity. In recent years, many machine-learning-based methods have been designed to capture the dynamic and complex intrusion patterns to improve the performance of intrusion detection systems. However, two issues, including imbalanced training data and new unknown attacks, still hinder the development of a reliable network intrusion detection system. In this paper, we propose a novel few-shot learning-based Siamese capsule network to tackle the scarcity of abnormal network traffic training data and enhance the detection of unknown attacks. In specific, the well-designed deep learning network excels at capturing dynamic relationships across traffic features. In addition, an unsupervised subtype sampling scheme is seamlessly integrated with the Siamese network to improve the detection of network intrusion attacks under the circumstance of imbalanced training data. Experimental results have demonstrated that the metric learning framework is more suitable to extract subtle and distinctive features to identify both known and unknown attacks after the sampling scheme compared to other supervised learning methods. Compared to the state-of-the-art methods, our proposed method achieves superior performance to effectively detect both types of attacks
- …