338 research outputs found
SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary
The Synthetic Minority Oversampling Technique (SMOTE) preprocessing algorithm is
considered \de facto" standard in the framework of learning from imbalanced data. This
is due to its simplicity in the design of the procedure, as well as its robustness when applied
to di erent type of problems. Since its publication in 2002, SMOTE has proven
successful in a variety of applications from several di erent domains. SMOTE has also inspired
several approaches to counter the issue of class imbalance, and has also signi cantly
contributed to new supervised learning paradigms, including multilabel classi cation, incremental
learning, semi-supervised learning, multi-instance learning, among others. It is
standard benchmark for learning from imbalanced data. It is also featured in a number of
di erent software packages | from open source to commercial. In this paper, marking the
fteen year anniversary of SMOTE, we re
ect on the SMOTE journey, discuss the current
state of a airs with SMOTE, its applications, and also identify the next set of challenges
to extend SMOTE for Big Data problems.This work have been partially supported by the Spanish Ministry of Science and Technology
under projects TIN2014-57251-P, TIN2015-68454-R and TIN2017-89517-P; the Project
887 BigDaP-TOOLS - Ayudas Fundaci on BBVA a Equipos de Investigaci on Cient ca 2016;
and the National Science Foundation (NSF) Grant IIS-1447795
Data geo-Science Approach for Modelling Unconventional Petroleum Ecosystems and their Visual Analytics
Storage, integration and interoperability are critical
challenges in the unconventional exploration data
management. With a quest to explore unconventional
hydrocarbons, in particular, shale gas from fractured shales,
we aim at investigating new petroleum data geoscience
approaches. The data geo-science describes the
integration of geoscience-domain expertise, collaborating
mathematical concepts, computing algorithms, machine learning
tools, including data and business analytics.
Further, to strengthen data-science services among
producing companies, we propose an integrated
multidimensional repository system, for which factual
instances are acquired on gas shales, to store, process and
deliver fractured-data views in new knowledge domains.
Data dimensions are categorized to examine their
suitability in the integrated prototype articulations that use
fracture-networks and attribute dimension model
descriptions. The factual instances are typically from
seismic attributes, seismically interpreted geological
structures and reservoirs, well log, including production
data entities. For designing and developing
multidimensional repository systems, we create various
artefacts, describing conceptual, logical and physical
models. For exploring the connectivity between seismic
and geology entities, multidimensional ontology models
are construed using fracture network attribute dimensions
and their instances. Different data warehousing and mining
are added support to the management of ontologies that can
bring the data instances of fractured shales, to unify and
explore the associativity between high-dense fractured
shales and their orientations.
The models depicting collaboration of geology,
geophysics, reservoir engineering and geo-mechanics
entities and their dimensions can substantially reduce the
risk and uncertainty involved in modelling and interpreting
shale- and tight-gas reservoirs, including traps associated
with Coal Bed Methane (CBM). Anisotropy, Poisson's
ratio and Young's modulus properties corroborate the
interpretation of stress images from the 3D acoustic
characterization of shale reservoirs. The statistical analysis
of data-views, their correlations and patterns further
facilitate us to visualize and interpret geoscientific
metadata meticulously. Data geo-science guided integrated
methodology can be applied in any basin, including frontier
basins
Cashtag piggybacking: uncovering spam and bot activity in stock microblogs on Twitter
Microblogs are increasingly exploited for predicting prices and traded
volumes of stocks in financial markets. However, it has been demonstrated that
much of the content shared in microblogging platforms is created and publicized
by bots and spammers. Yet, the presence (or lack thereof) and the impact of
fake stock microblogs has never systematically been investigated before. Here,
we study 9M tweets related to stocks of the 5 main financial markets in the US.
By comparing tweets with financial data from Google Finance, we highlight
important characteristics of Twitter stock microblogs. More importantly, we
uncover a malicious practice - referred to as cashtag piggybacking -
perpetrated by coordinated groups of bots and likely aimed at promoting
low-value stocks by exploiting the popularity of high-value ones. Among the
findings of our study is that as much as 71% of the authors of suspicious
financial tweets are classified as bots by a state-of-the-art spambot detection
algorithm. Furthermore, 37% of them were suspended by Twitter a few months
after our investigation. Our results call for the adoption of spam and bot
detection techniques in all studies and applications that exploit
user-generated content for predicting the stock market
Advances in knowledge discovery and data mining Part II
19th Pacific-Asia Conference, PAKDD 2015, Ho Chi Minh City, Vietnam, May 19-22, 2015, Proceedings, Part II</p
Multi-Target Prediction: A Unifying View on Problems and Methods
Multi-target prediction (MTP) is concerned with the simultaneous prediction
of multiple target variables of diverse type. Due to its enormous application
potential, it has developed into an active and rapidly expanding research field
that combines several subfields of machine learning, including multivariate
regression, multi-label classification, multi-task learning, dyadic prediction,
zero-shot learning, network inference, and matrix completion. In this paper, we
present a unifying view on MTP problems and methods. First, we formally discuss
commonalities and differences between existing MTP problems. To this end, we
introduce a general framework that covers the above subfields as special cases.
As a second contribution, we provide a structured overview of MTP methods. This
is accomplished by identifying a number of key properties, which distinguish
such methods and determine their suitability for different types of problems.
Finally, we also discuss a few challenges for future research
- …