1,606 research outputs found
Automatic Unsupervised Tensor Mining with Quality Assessment
A popular tool for unsupervised modelling and mining multi-aspect data is
tensor decomposition. In an exploratory setting, where and no labels or ground
truth are available how can we automatically decide how many components to
extract? How can we assess the quality of our results, so that a domain expert
can factor this quality measure in the interpretation of our results? In this
paper, we introduce AutoTen, a novel automatic unsupervised tensor mining
algorithm with minimal user intervention, which leverages and improves upon
heuristics that assess the result quality. We extensively evaluate AutoTen's
performance on synthetic data, outperforming existing baselines on this very
hard problem. Finally, we apply AutoTen on a variety of real datasets,
providing insights and discoveries. We view this work as a step towards a fully
automated, unsupervised tensor mining tool that can be easily adopted by
practitioners in academia and industry
Improving Decision Analytics with Deep Learning: The Case of Financial Disclosures
Decision analytics commonly focuses on the text mining of financial news
sources in order to provide managerial decision support and to predict stock
market movements. Existing predictive frameworks almost exclusively apply
traditional machine learning methods, whereas recent research indicates that
traditional machine learning methods are not sufficiently capable of extracting
suitable features and capturing the non-linear nature of complex tasks. As a
remedy, novel deep learning models aim to overcome this issue by extending
traditional neural network models with additional hidden layers. Indeed, deep
learning has been shown to outperform traditional methods in terms of
predictive performance. In this paper, we adapt the novel deep learning
technique to financial decision support. In this instance, we aim to predict
the direction of stock movements following financial disclosures. As a result,
we show how deep learning can outperform the accuracy of random forests as a
benchmark for machine learning by 5.66%
Recent Research Advances on Interactive Machine Learning
Interactive Machine Learning (IML) is an iterative learning process that
tightly couples a human with a machine learner, which is widely used by
researchers and practitioners to effectively solve a wide variety of real-world
application problems. Although recent years have witnessed the proliferation of
IML in the field of visual analytics, most recent surveys either focus on a
specific area of IML or aim to summarize a visualization field that is too
generic for IML. In this paper, we systematically review the recent literature
on IML and classify them into a task-oriented taxonomy built by us. We conclude
the survey with a discussion of open challenges and research opportunities that
we believe are inspiring for future work in IML
A literature survey of matrix methods for data science
Efficient numerical linear algebra is a core ingredient in many applications
across almost all scientific and industrial disciplines. With this survey we
want to illustrate that numerical linear algebra has played and is playing a
crucial role in enabling and improving data science computations with many new
developments being fueled by the availability of data and computing resources.
We highlight the role of various different factorizations and the power of
changing the representation of the data as well as discussing topics such as
randomized algorithms, functions of matrices, and high-dimensional problems. We
briefly touch upon the role of techniques from numerical linear algebra used
within deep learning
Walking the Tightrope: An Investigation of the Convolutional Autoencoder Bottleneck
In this paper, we present an in-depth investigation of the convolutional
autoencoder (CAE) bottleneck. Autoencoders (AE), and especially their
convolutional variants, play a vital role in the current deep learning toolbox.
Researchers and practitioners employ CAEs for a variety of tasks, ranging from
outlier detection and compression to transfer and representation learning.
Despite their widespread adoption, we have limited insight into how the
bottleneck shape impacts the emergent properties of the CAE. We demonstrate
that increased height and width of the bottleneck drastically improves
generalization, which in turn leads to better performance of the latent codes
in downstream transfer learning tasks. The number of channels in the
bottleneck, on the other hand, is secondary in importance. Furthermore, we show
empirically that, contrary to popular belief, CAEs do not learn to copy their
input, even when the bottleneck has the same number of neurons as there are
pixels in the input. Copying does not occur, despite training the CAE for 1,000
epochs on a tiny ( 600 images) dataset. We believe that the findings
in this paper are directly applicable and will lead to improvements in models
that rely on CAEs.Comment: code available at https://github.com/IljaManakov/WalkingTheTightrop
Online Machine Learning in Big Data Streams
The area of online machine learning in big data streams covers algorithms
that are (1) distributed and (2) work from data streams with only a limited
possibility to store past data. The first requirement mostly concerns software
architectures and efficient algorithms. The second one also imposes nontrivial
theoretical restrictions on the modeling methods: In the data stream model,
older data is no longer available to revise earlier suboptimal modeling
decisions as the fresh data arrives.
In this article, we provide an overview of distributed software architectures
and libraries as well as machine learning models for online learning. We
highlight the most important ideas for classification, regression,
recommendation, and unsupervised modeling from streaming data, and we show how
they are implemented in various distributed data stream processing systems.
This article is a reference material and not a survey. We do not attempt to
be comprehensive in describing all existing methods and solutions; rather, we
give pointers to the most important resources in the field. All related
sub-fields, online algorithms, online learning, and distributed data processing
are hugely dominant in current research and development with conceptually new
research results and software components emerging at the time of writing. In
this article, we refer to several survey results, both for distributed data
processing and for online machine learning. Compared to past surveys, our
article is different because we discuss recommender systems in extended detail
Transfer Metric Learning: Algorithms, Applications and Outlooks
Distance metric learning (DML) aims to find an appropriate way to reveal the
underlying data relationship. It is critical in many machine learning, pattern
recognition and data mining algorithms, and usually require large amount of
label information (such as class labels or pair/triplet constraints) to achieve
satisfactory performance. However, the label information may be insufficient in
real-world applications due to the high-labeling cost, and DML may fail in this
case. Transfer metric learning (TML) is able to mitigate this issue for DML in
the domain of interest (target domain) by leveraging knowledge/information from
other related domains (source domains). Although achieved a certain level of
development, TML has limited success in various aspects such as selective
transfer, theoretical understanding, handling complex data, big data and
extreme cases. In this survey, we present a systematic review of the TML
literature. In particular, we group TML into different categories according to
different settings and metric transfer strategies, such as direct metric
approximation, subspace approximation, distance approximation, and distribution
approximation. A summarization and insightful discussion of the various TML
approaches and their applications will be presented. Finally, we indicate some
challenges and provide possible future directions.Comment: 14 pages, 5 figure
Event Prediction in the Big Data Era: A Systematic Survey
Events are occurrences in specific locations, time, and semantics that
nontrivially impact either our society or the nature, such as civil unrest,
system failures, and epidemics. It is highly desirable to be able to anticipate
the occurrence of such events in advance in order to reduce the potential
social upheaval and damage caused. Event prediction, which has traditionally
been prohibitively challenging, is now becoming a viable option in the big data
era and is thus experiencing rapid growth. There is a large amount of existing
work that focuses on addressing the challenges involved, including
heterogeneous multi-faceted outputs, complex dependencies, and streaming data
feeds. Most existing event prediction methods were initially designed to deal
with specific application domains, though the techniques and evaluation
procedures utilized are usually generalizable across different domains.
However, it is imperative yet difficult to cross-reference the techniques
across different domains, given the absence of a comprehensive literature
survey for event prediction. This paper aims to provide a systematic and
comprehensive survey of the technologies, applications, and evaluations of
event prediction in the big data era. First, systematic categorization and
summary of existing techniques are presented, which facilitate domain experts'
searches for suitable techniques and help model developers consolidate their
research at the frontiers. Then, comprehensive categorization and summary of
major application domains are provided. Evaluation metrics and procedures are
summarized and standardized to unify the understanding of model performance
among stakeholders, model developers, and domain experts in various application
domains. Finally, open problems and future directions for this promising and
important domain are elucidated and discussed
Machine Learning and Visualization in Clinical Decision Support: Current State and Future Directions
Deep learning, an area of machine learning, is set to revolutionize patient
care. But it is not yet part of standard of care, especially when it comes to
individual patient care. In fact, it is unclear to what extent data-driven
techniques are being used to support clinical decision making (CDS).
Heretofore, there has not been a review of ways in which research in machine
learning and other types of data-driven techniques can contribute effectively
to clinical care and the types of support they can bring to clinicians. In this
paper, we consider ways in which two data driven domains - machine learning and
data visualizations - can contribute to the next generation of clinical
decision support systems. We review the literature regarding the ways heuristic
knowledge, machine learning, and visualization are - and can be - applied to
three types of CDS. There has been substantial research into the use of
predictive modeling for alerts, however current CDS systems are not utilizing
these methods. Approaches that leverage interactive visualizations and
machine-learning inferences to organize and review patient data are gaining
popularity but are still at the prototype stage and are not yet in use. CDS
systems that could benefit from prescriptive machine learning (e.g., treatment
recommendations for specific patients) have not yet been developed. We discuss
potential reasons for the lack of deployment of data-driven methods in CDS and
directions for future research
Targeted Sentiment Analysis: A Data-Driven Categorization
Targeted sentiment analysis (TSA), also known as aspect based sentiment
analysis (ABSA), aims at detecting fine-grained sentiment polarity towards
targets in a given opinion document. Due to the lack of labeled datasets and
effective technology, TSA had been intractable for many years. The newly
released datasets and the rapid development of deep learning technologies are
key enablers for the recent significant progress made in this area. However,
the TSA tasks have been defined in various ways with different understandings
towards basic concepts like `target' and `aspect'. In this paper, we categorize
the different tasks and highlight the differences in the available datasets and
their specific tasks. We then further discuss the challenges related to data
collection and data annotation which are overlooked in many previous studies.Comment: Draf
- …