55,040 research outputs found
Early Prediction of Movie Box Office Success based on Wikipedia Activity Big Data
Use of socially generated "big data" to access information about collective
states of the minds in human societies has become a new paradigm in the
emerging field of computational social science. A natural application of this
would be the prediction of the society's reaction to a new product in the sense
of popularity and adoption rate. However, bridging the gap between "real time
monitoring" and "early predicting" remains a big challenge. Here we report on
an endeavor to build a minimalistic predictive model for the financial success
of movies based on collective activity data of online users. We show that the
popularity of a movie can be predicted much before its release by measuring and
analyzing the activity level of editors and viewers of the corresponding entry
to the movie in Wikipedia, the well-known online encyclopedia.Comment: 13 pages, Including Supporting Information, 7 Figures, Download the
dataset from: http://wwm.phy.bme.hu/SupplementaryDataS1.zi
High Accuracy Phishing Detection Based on Convolutional Neural Networks
The persistent growth in phishing and the rising volume of phishing websites has led to individuals and organizations worldwide becoming increasingly exposed to various cyber-attacks. Consequently, more effective phishing detection is required for improved cyber defence. Hence, in this paper we present a deep learning-based approach to enable high accuracy detection of phishing sites. The proposed approach utilizes convolutional neural networks (CNN) for high accuracy classification to distinguish genuine sites from phishing sites. We evaluate the models using a dataset obtained from 6,157 genuine and 4,898 phishing websites. Based on the results of extensive experiments, our CNN based models proved to be highly effective in detecting unknown phishing sites. Furthermore, the CNN based approach performed better than traditional machine learning classifiers evaluated on the same dataset, reaching 98.2% phishing detection rate with an F1-score of 0.976. The method presented in this pa-per compares favourably to the state-of-the art in deep learning based phishing website detection
The Implications of Diverse Applications and Scalable Data Sets in Benchmarking Big Data Systems
Now we live in an era of big data, and big data applications are becoming
more and more pervasive. How to benchmark data center computer systems running
big data applications (in short big data systems) is a hot topic. In this
paper, we focus on measuring the performance impacts of diverse applications
and scalable volumes of data sets on big data systems. For four typical data
analysis applications---an important class of big data applications, we find
two major results through experiments: first, the data scale has a significant
impact on the performance of big data systems, so we must provide scalable
volumes of data sets in big data benchmarks. Second, for the four applications,
even all of them use the simple algorithms, the performance trends are
different with increasing data scales, and hence we must consider not only
variety of data sets but also variety of applications in benchmarking big data
systems.Comment: 16 pages, 3 figure
BigDataBench: a Big Data Benchmark Suite from Internet Services
As architecture, systems, and data management communities pay greater
attention to innovative big data systems and architectures, the pressure of
benchmarking and evaluating these systems rises. Considering the broad use of
big data systems, big data benchmarks must include diversity of data and
workloads. Most of the state-of-the-art big data benchmarking efforts target
evaluating specific types of applications or system software stacks, and hence
they are not qualified for serving the purposes mentioned above. This paper
presents our joint research efforts on this issue with several industrial
partners. Our big data benchmark suite BigDataBench not only covers broad
application scenarios, but also includes diverse and representative data sets.
BigDataBench is publicly available from http://prof.ict.ac.cn/BigDataBench .
Also, we comprehensively characterize 19 big data workloads included in
BigDataBench with varying data inputs. On a typical state-of-practice
processor, Intel Xeon E5645, we have the following observations: First, in
comparison with the traditional benchmarks: including PARSEC, HPCC, and
SPECCPU, big data applications have very low operation intensity; Second, the
volume of data input has non-negligible impact on micro-architecture
characteristics, which may impose challenges for simulation-based big data
architecture research; Last but not least, corroborating the observations in
CloudSuite and DCBench (which use smaller data inputs), we find that the
numbers of L1 instruction cache misses per 1000 instructions of the big data
applications are higher than in the traditional benchmarks; also, we find that
L3 caches are effective for the big data applications, corroborating the
observation in DCBench.Comment: 12 pages, 6 figures, The 20th IEEE International Symposium On High
Performance Computer Architecture (HPCA-2014), February 15-19, 2014, Orlando,
Florida, US
- …