338 research outputs found

    SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary

    Get PDF
    The Synthetic Minority Oversampling Technique (SMOTE) preprocessing algorithm is considered \de facto" standard in the framework of learning from imbalanced data. This is due to its simplicity in the design of the procedure, as well as its robustness when applied to di erent type of problems. Since its publication in 2002, SMOTE has proven successful in a variety of applications from several di erent domains. SMOTE has also inspired several approaches to counter the issue of class imbalance, and has also signi cantly contributed to new supervised learning paradigms, including multilabel classi cation, incremental learning, semi-supervised learning, multi-instance learning, among others. It is standard benchmark for learning from imbalanced data. It is also featured in a number of di erent software packages | from open source to commercial. In this paper, marking the fteen year anniversary of SMOTE, we re ect on the SMOTE journey, discuss the current state of a airs with SMOTE, its applications, and also identify the next set of challenges to extend SMOTE for Big Data problems.This work have been partially supported by the Spanish Ministry of Science and Technology under projects TIN2014-57251-P, TIN2015-68454-R and TIN2017-89517-P; the Project 887 BigDaP-TOOLS - Ayudas Fundaci on BBVA a Equipos de Investigaci on Cient ca 2016; and the National Science Foundation (NSF) Grant IIS-1447795

    Data geo-Science Approach for Modelling Unconventional Petroleum Ecosystems and their Visual Analytics

    Get PDF
    Storage, integration and interoperability are critical challenges in the unconventional exploration data management. With a quest to explore unconventional hydrocarbons, in particular, shale gas from fractured shales, we aim at investigating new petroleum data geoscience approaches. The data geo-science describes the integration of geoscience-domain expertise, collaborating mathematical concepts, computing algorithms, machine learning tools, including data and business analytics. Further, to strengthen data-science services among producing companies, we propose an integrated multidimensional repository system, for which factual instances are acquired on gas shales, to store, process and deliver fractured-data views in new knowledge domains. Data dimensions are categorized to examine their suitability in the integrated prototype articulations that use fracture-networks and attribute dimension model descriptions. The factual instances are typically from seismic attributes, seismically interpreted geological structures and reservoirs, well log, including production data entities. For designing and developing multidimensional repository systems, we create various artefacts, describing conceptual, logical and physical models. For exploring the connectivity between seismic and geology entities, multidimensional ontology models are construed using fracture network attribute dimensions and their instances. Different data warehousing and mining are added support to the management of ontologies that can bring the data instances of fractured shales, to unify and explore the associativity between high-dense fractured shales and their orientations. The models depicting collaboration of geology, geophysics, reservoir engineering and geo-mechanics entities and their dimensions can substantially reduce the risk and uncertainty involved in modelling and interpreting shale- and tight-gas reservoirs, including traps associated with Coal Bed Methane (CBM). Anisotropy, Poisson's ratio and Young's modulus properties corroborate the interpretation of stress images from the 3D acoustic characterization of shale reservoirs. The statistical analysis of data-views, their correlations and patterns further facilitate us to visualize and interpret geoscientific metadata meticulously. Data geo-science guided integrated methodology can be applied in any basin, including frontier basins

    Cashtag piggybacking: uncovering spam and bot activity in stock microblogs on Twitter

    Full text link
    Microblogs are increasingly exploited for predicting prices and traded volumes of stocks in financial markets. However, it has been demonstrated that much of the content shared in microblogging platforms is created and publicized by bots and spammers. Yet, the presence (or lack thereof) and the impact of fake stock microblogs has never systematically been investigated before. Here, we study 9M tweets related to stocks of the 5 main financial markets in the US. By comparing tweets with financial data from Google Finance, we highlight important characteristics of Twitter stock microblogs. More importantly, we uncover a malicious practice - referred to as cashtag piggybacking - perpetrated by coordinated groups of bots and likely aimed at promoting low-value stocks by exploiting the popularity of high-value ones. Among the findings of our study is that as much as 71% of the authors of suspicious financial tweets are classified as bots by a state-of-the-art spambot detection algorithm. Furthermore, 37% of them were suspended by Twitter a few months after our investigation. Our results call for the adoption of spam and bot detection techniques in all studies and applications that exploit user-generated content for predicting the stock market

    Advances in knowledge discovery and data mining Part II

    Get PDF
    19th Pacific-Asia Conference, PAKDD 2015, Ho Chi Minh City, Vietnam, May 19-22, 2015, Proceedings, Part II</p

    Multi-Target Prediction: A Unifying View on Problems and Methods

    Full text link
    Multi-target prediction (MTP) is concerned with the simultaneous prediction of multiple target variables of diverse type. Due to its enormous application potential, it has developed into an active and rapidly expanding research field that combines several subfields of machine learning, including multivariate regression, multi-label classification, multi-task learning, dyadic prediction, zero-shot learning, network inference, and matrix completion. In this paper, we present a unifying view on MTP problems and methods. First, we formally discuss commonalities and differences between existing MTP problems. To this end, we introduce a general framework that covers the above subfields as special cases. As a second contribution, we provide a structured overview of MTP methods. This is accomplished by identifying a number of key properties, which distinguish such methods and determine their suitability for different types of problems. Finally, we also discuss a few challenges for future research
    corecore