28,008 research outputs found
Data mining for detecting Bitcoin Ponzi schemes
Soon after its introduction in 2009, Bitcoin has been adopted by
cyber-criminals, which rely on its pseudonymity to implement virtually
untraceable scams. One of the typical scams that operate on Bitcoin are the
so-called Ponzi schemes. These are fraudulent investments which repay users
with the funds invested by new users that join the scheme, and implode when it
is no longer possible to find new investments. Despite being illegal in many
countries, Ponzi schemes are now proliferating on Bitcoin, and they keep
alluring new victims, who are plundered of millions of dollars. We apply data
mining techniques to detect Bitcoin addresses related to Ponzi schemes. Our
starting point is a dataset of features of real-world Ponzi schemes, that we
construct by analysing, on the Bitcoin blockchain, the transactions used to
perform the scams. We use this dataset to experiment with various machine
learning algorithms, and we assess their effectiveness through standard
validation protocols and performance metrics. The best of the classifiers we
have experimented can identify most of the Ponzi schemes in the dataset, with a
low number of false positives
Bibliographic Analysis on Research Publications using Authors, Categorical Labels and the Citation Network
Bibliographic analysis considers the author's research areas, the citation
network and the paper content among other things. In this paper, we combine
these three in a topic model that produces a bibliographic model of authors,
topics and documents, using a nonparametric extension of a combination of the
Poisson mixed-topic link model and the author-topic model. This gives rise to
the Citation Network Topic Model (CNTM). We propose a novel and efficient
inference algorithm for the CNTM to explore subsets of research publications
from CiteSeerX. The publication datasets are organised into three corpora,
totalling to about 168k publications with about 62k authors. The queried
datasets are made available online. In three publicly available corpora in
addition to the queried datasets, our proposed model demonstrates an improved
performance in both model fitting and document clustering, compared to several
baselines. Moreover, our model allows extraction of additional useful knowledge
from the corpora, such as the visualisation of the author-topics network.
Additionally, we propose a simple method to incorporate supervision into topic
modelling to achieve further improvement on the clustering task.Comment: Preprint for Journal Machine Learnin
Continuous Representation of Location for Geolocation and Lexical Dialectology using Mixture Density Networks
We propose a method for embedding two-dimensional locations in a continuous
vector space using a neural network-based model incorporating mixtures of
Gaussian distributions, presenting two model variants for text-based
geolocation and lexical dialectology. Evaluated over Twitter data, the proposed
model outperforms conventional regression-based geolocation and provides a
better estimate of uncertainty. We also show the effectiveness of the
representation for predicting words from location in lexical dialectology, and
evaluate it using the DARE dataset.Comment: Conference on Empirical Methods in Natural Language Processing (EMNLP
2017) September 2017, Copenhagen, Denmar
An Overview of the Use of Neural Networks for Data Mining Tasks
In the recent years the area of data mining has experienced a considerable demand for technologies that extract knowledge from large and complex data sources. There is a substantial commercial interest as well as research investigations in the area that aim to develop new and improved approaches for extracting information, relationships, and patterns from datasets. Artificial Neural Networks (NN) are popular biologically inspired intelligent methodologies, whose classification, prediction and pattern recognition capabilities have been utilised successfully in many areas, including science, engineering, medicine, business, banking, telecommunication, and many other fields. This paper highlights from a data mining perspective the implementation of NN, using supervised and unsupervised learning, for pattern recognition, classification, prediction and cluster analysis, and focuses the discussion on their usage in bioinformatics and financial data analysis tasks
Testing stock market convergence: a non-linear factor approach
This paper applies the Phillips and Sul (Econometrica 75(6):1771–1855, 2007) method to test for convergence in stock returns to an extensive dataset including monthly stock price indices for five EU countries (Germany, France, the Netherlands, Ireland and the UK) as well as the US between 1973 and 2008. We carry out the analysis on both sectors and individual industries within sectors. As a first step, we use the Stock and Watson (J Am Stat Assoc 93(441):349–358, 1998) procedure to filter the data in order to extract the long-run component of the series; then, following Phillips and Sul (Econometrica 75(6):1771–1855, 2007), we estimate the relative transition parameters. In the case of sectoral indices we find convergence in the middle of the sample period, followed by divergence, and detect four (two large and two small) clusters. The analysis at a disaggregate, industry level again points to convergence in the middle of the sample, and subsequent divergence, but a much larger number of clusters is now found. Splitting the cross-section into two subgroups including euro area countries, the UK and the US respectively, provides evidence of a global convergence/divergence process not obviously influenced by EU policies
Bayesian nonparametric sparse VAR models
High dimensional vector autoregressive (VAR) models require a large number of
parameters to be estimated and may suffer of inferential problems. We propose a
new Bayesian nonparametric (BNP) Lasso prior (BNP-Lasso) for high-dimensional
VAR models that can improve estimation efficiency and prediction accuracy. Our
hierarchical prior overcomes overparametrization and overfitting issues by
clustering the VAR coefficients into groups and by shrinking the coefficients
of each group toward a common location. Clustering and shrinking effects
induced by the BNP-Lasso prior are well suited for the extraction of causal
networks from time series, since they account for some stylized facts in
real-world networks, which are sparsity, communities structures and
heterogeneity in the edges intensity. In order to fully capture the richness of
the data and to achieve a better understanding of financial and macroeconomic
risk, it is therefore crucial that the model used to extract network accounts
for these stylized facts.Comment: Forthcoming in "Journal of Econometrics" ---- Revised Version of the
paper "Bayesian nonparametric Seemingly Unrelated Regression Models" ----
Supplementary Material available on reques
- …