Search CORE

473,743 research outputs found

A Taxonomy of Big Data for Optimal Predictive Machine Learning and Data Mining

Author: Fokoue Ernest
Publication venue
Publication date: 01/01/2014
Field of study

Big data comes in various ways, types, shapes, forms and sizes. Indeed, almost all areas of science, technology, medicine, public health, economics, business, linguistics and social science are bombarded by ever increasing flows of data begging to analyzed efficiently and effectively. In this paper, we propose a rough idea of a possible taxonomy of big data, along with some of the most commonly used tools for handling each particular category of bigness. The dimensionality p of the input space and the sample size n are usually the main ingredients in the characterization of data bigness. The specific statistical machine learning technique used to handle a particular big data set will depend on which category it falls in within the bigness taxonomy. Large p small n data sets for instance require a different set of tools from the large n small p variety. Among other tools, we discuss Preprocessing, Standardization, Imputation, Projection, Regularization, Penalization, Compression, Reduction, Selection, Kernelization, Hybridization, Parallelization, Aggregation, Randomization, Replication, Sequentialization. Indeed, it is important to emphasize right away that the so-called no free lunch theorem applies here, in the sense that there is no universally superior method that outperforms all other methods on all categories of bigness. It is also important to stress the fact that simplicity in the sense of Ockham's razor non plurality principle of parsimony tends to reign supreme when it comes to massive data. We conclude with a comparison of the predictive performance of some of the most commonly used methods on a few data sets.Comment: 18 pages, 2 figures 3 table

arXiv.org e-Print Archive

Bulgarian Digital Mathematics Library at IMI-BAS

Data Mining Applications in Big Data

Author: Wang Guanghui
Wang Lidong
Publication venue: 'Faculty of Computer Science, Sriwijaya University'
Publication date: 20/10/2015
Field of study

Data mining is a process of extracting hidden, unknown, but potentially useful information from massive data. Big Data has great impacts on scientific discoveries and value creation. This paper introduces methods in data mining and technologies in Big Data. Challenges of data mining and data mining with big data are discussed. Some technology progress of data mining and data mining with big data are also presented

ComEngApp-Journal

Directory of Open Access Journals

Computer Engineering and Applications Journal

Computer Engineering and Applications Journal (ComEngApp, Universitas Sriwijaya)

Recommended from our members

Big Chord Data Extraction and Mining

Author: Barthet M.
Dykes J.
Kachkaev A.
Plumbley M. D.
Weyde T.
Wolff D.
Publication venue
Publication date: 01/01/2014
Field of study

Harmonic progression is one of the cornerstones of tonal music composition and is thereby essential to many musical styles and traditions. Previous studies have shown that musical genres and composers could be discriminated based on chord progressions modeled as chord n-grams. These studies were however conducted on small-scale datasets and using symbolic music transcriptions. In this work, we apply pattern mining techniques to over 200,000 chord progression sequences out of 1,000,000 extracted from the I Like Music (ILM) commercial music audio collection. The ILM collection spans 37 musical genres and includes pieces released between 1907 and 2013. We developed a single program multiple data parallel computing approach whereby audio feature extraction tasks are split up and run simultaneously on multiple cores. An audio-based chord recognition model (Vamp plugin Chordino) was used to extract the chord progressions from the ILM set. To keep low-weight feature sets, the chord data were stored using a compact binary format. We used the CM-SPADE algorithm, which performs a vertical mining of sequential patterns using co-occurence information, and which is fast and efﬁcient enough to be applied to big data collections like the ILM set. In orderto derive key-independent frequent patterns, the transition between chords are modeled by changes of qualities (e.g. major, minor, etc.) and root keys (e.g. fourth, ﬁfth, etc.). The resulting key-independent chord progression patterns vary in length (from 2 to 16) and frequency (from 2 to 19,820) across genres. As illustrated by graphs generated to represent frequent 4-chord progressions, some patterns like circle-of-ﬁfths movements are well represented in most genres but in varying degrees. These large-scale results offer the opportunity to uncover similarities and discrepancies between sets of musical pieces and therefore to build classiﬁers for search and recommendation. They also support the empirical testing of music theory. It is however more difﬁcult to derive new hypotheses from such dataset due to its size. This can be addressed by using pattern detection algorithms or suitable visualisation which we present in a companion study

City Research Online

Surrey Research Insight

The Importance and Problems of Big Data

Author: Grudev Anton
Korotenko L.M.
Nechai N.M.
Publication venue: Видавництво НГУ
Publication date: 01/01/2017
Field of study

In the era of high-tech we can hear the term Big Data more and more often. This fact indicates that the importance of Big Data constantly increases. This term is also used with related concepts such as Business Intelligence or data mining. But what does that mean

eLibrary National Mining University

Apache Mahout’s k-Means vs. fuzzy k-Means performance evaluation

Author: Barolli Leonard
Bogza Adriana
Caballé Llobet Santiago
Xhafa Xhafa Fatos
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2016
Field of study

(c) 2016 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works.The emergence of the Big Data as a disruptive technology for next generation of intelligent systems, has brought many issues of how to extract and make use of the knowledge obtained from the data within short times, limited budget and under high rates of data generation. The foremost challenge identified here is the data processing, and especially, mining and analysis for knowledge extraction. As the 'old' data mining frameworks were designed without Big Data requirements, a new generation of such frameworks is being developed fully implemented in Cloud platforms. One such frameworks is Apache Mahout aimed to leverage fast processing and analysis of Big Data. The performance of such new data mining frameworks is yet to be evaluated and potential limitations are to be revealed. In this paper we analyse the performance of Apache Mahout using large real data sets from the Twitter stream. We exemplify the analysis for the case of two clustering algorithms, namely, k-Means and Fuzzy k-Means, using a Hadoop cluster infrastructure for the experimental study.Peer ReviewedPostprint (author's final draft

Crossref

UPCommons. Portal del coneixement obert de la UPC

UPCommons (Universitat Politècnica de Catalunya)