1,658 research outputs found
Scalable aggregation predictive analytics: a query-driven machine learning approach
We introduce a predictive modeling solution that provides high quality predictive analytics over aggregation queries in Big Data environments. Our predictive methodology is generally applicable in environments in which large-scale data owners may or may not restrict access to their data and allow only aggregation operators like COUNT to be executed over their data. In this context, our methodology is based on historical queries and their answers to accurately predict ad-hoc queries’ answers. We focus on the widely used set-cardinality, i.e., COUNT, aggregation query, as COUNT is a fundamental operator for both internal data system optimizations and for aggregation-oriented data exploration and predictive analytics. We contribute a novel, query-driven Machine Learning (ML) model whose goals are to: (i) learn the query-answer space from past issued queries, (ii) associate the query space with local linear regression & associative function estimators, (iii) define query similarity, and (iv) predict the cardinality of the answer set of unseen incoming queries, referred to the Set Cardinality Prediction (SCP) problem. Our ML model incorporates incremental ML algorithms for ensuring high quality prediction results. The significance of contribution lies in that it (i) is the only query-driven solution applicable over general Big Data environments, which include restricted-access data, (ii) offers incremental learning adjusted for arriving ad-hoc queries, which is well suited for query-driven data exploration, and (iii) offers a performance (in terms of scalability, SCP accuracy, processing time, and memory requirements) that is superior to data-centric approaches. We provide a comprehensive performance evaluation of our model evaluating its sensitivity, scalability and efficiency for quality predictive analytics. In addition, we report on the development and incorporation of our ML model in Spark showing its superior performance compared to the Spark’s COUNT method
Visualizing the semantic content of large text databases using text maps
A methodology for generating text map representations of the semantic content of text databases is presented. Text maps provide a graphical metaphor for conceptualizing and visualizing the contents and data interrelationships of large text databases. Described are a set of experiments conducted against the TIPSTER corpora of Wall Street Journal articles. These experiments provide an introduction to current work in the representation and visualization of documents by way of their semantic content
Data Mining and Machine Learning in Astronomy
We review the current state of data mining and machine learning in astronomy.
'Data Mining' can have a somewhat mixed connotation from the point of view of a
researcher in this field. If used correctly, it can be a powerful approach,
holding the potential to fully exploit the exponentially increasing amount of
available data, promising great scientific advance. However, if misused, it can
be little more than the black-box application of complex computing algorithms
that may give little physical insight, and provide questionable results. Here,
we give an overview of the entire data mining process, from data collection
through to the interpretation of results. We cover common machine learning
algorithms, such as artificial neural networks and support vector machines,
applications from a broad range of astronomy, emphasizing those where data
mining techniques directly resulted in improved science, and important current
and future directions, including probability density functions, parallel
algorithms, petascale computing, and the time domain. We conclude that, so long
as one carefully selects an appropriate algorithm, and is guided by the
astronomical problem at hand, data mining can be very much the powerful tool,
and not the questionable black box.Comment: Published in IJMPD. 61 pages, uses ws-ijmpd.cls. Several extra
figures, some minor additions to the tex
Video browsing interfaces and applications: a review
We present a comprehensive review of the state of the art in video browsing and retrieval systems, with special emphasis on interfaces and applications. There has been a significant increase in activity (e.g., storage, retrieval, and sharing) employing video data in the past decade, both for personal and professional use. The ever-growing amount of video content available for human consumption and the inherent characteristics of video data—which, if presented in its raw format, is rather unwieldy and costly—have become driving forces for the development of more effective solutions to present video contents and allow rich user interaction. As a result, there are many contemporary research efforts toward developing better video browsing solutions, which we summarize. We review more than 40 different video browsing and retrieval interfaces and classify them into three groups: applications that use video-player-like interaction, video retrieval applications, and browsing solutions based on video surrogates. For each category, we present a summary of existing work, highlight the technical aspects of each solution, and compare them against each other
Visualization of Very Large High-Dimensional Data Sets as Minimum Spanning Trees
The chemical sciences are producing an unprecedented amount of large,
high-dimensional data sets containing chemical structures and associated
properties. However, there are currently no algorithms to visualize such data
while preserving both global and local features with a sufficient level of
detail to allow for human inspection and interpretation. Here, we propose a
solution to this problem with a new data visualization method, TMAP, capable of
representing data sets of up to millions of data points and arbitrary high
dimensionality as a two-dimensional tree (http://tmap.gdb.tools).
Visualizations based on TMAP are better suited than t-SNE or UMAP for the
exploration and interpretation of large data sets due to their tree-like
nature, increased local and global neighborhood and structure preservation, and
the transparency of the methods the algorithm is based on. We apply TMAP to the
most used chemistry data sets including databases of molecules such as ChEMBL,
FDB17, the Natural Products Atlas, DSSTox, as well as to the MoleculeNet
benchmark collection of data sets. We also show its broad applicability with
further examples from biology, particle physics, and literature.Comment: 33 pages, 14 figures, 1 table, supplementary information include
Machine Learning
Machine Learning can be defined in various ways related to a scientific domain concerned with the design and development of theoretical and implementation tools that allow building systems with some Human Like intelligent behavior. Machine learning addresses more specifically the ability to improve automatically through experience
- …