28,008 research outputs found

    Data mining for detecting Bitcoin Ponzi schemes

    Full text link
    Soon after its introduction in 2009, Bitcoin has been adopted by cyber-criminals, which rely on its pseudonymity to implement virtually untraceable scams. One of the typical scams that operate on Bitcoin are the so-called Ponzi schemes. These are fraudulent investments which repay users with the funds invested by new users that join the scheme, and implode when it is no longer possible to find new investments. Despite being illegal in many countries, Ponzi schemes are now proliferating on Bitcoin, and they keep alluring new victims, who are plundered of millions of dollars. We apply data mining techniques to detect Bitcoin addresses related to Ponzi schemes. Our starting point is a dataset of features of real-world Ponzi schemes, that we construct by analysing, on the Bitcoin blockchain, the transactions used to perform the scams. We use this dataset to experiment with various machine learning algorithms, and we assess their effectiveness through standard validation protocols and performance metrics. The best of the classifiers we have experimented can identify most of the Ponzi schemes in the dataset, with a low number of false positives

    Bibliographic Analysis on Research Publications using Authors, Categorical Labels and the Citation Network

    Full text link
    Bibliographic analysis considers the author's research areas, the citation network and the paper content among other things. In this paper, we combine these three in a topic model that produces a bibliographic model of authors, topics and documents, using a nonparametric extension of a combination of the Poisson mixed-topic link model and the author-topic model. This gives rise to the Citation Network Topic Model (CNTM). We propose a novel and efficient inference algorithm for the CNTM to explore subsets of research publications from CiteSeerX. The publication datasets are organised into three corpora, totalling to about 168k publications with about 62k authors. The queried datasets are made available online. In three publicly available corpora in addition to the queried datasets, our proposed model demonstrates an improved performance in both model fitting and document clustering, compared to several baselines. Moreover, our model allows extraction of additional useful knowledge from the corpora, such as the visualisation of the author-topics network. Additionally, we propose a simple method to incorporate supervision into topic modelling to achieve further improvement on the clustering task.Comment: Preprint for Journal Machine Learnin

    Continuous Representation of Location for Geolocation and Lexical Dialectology using Mixture Density Networks

    Full text link
    We propose a method for embedding two-dimensional locations in a continuous vector space using a neural network-based model incorporating mixtures of Gaussian distributions, presenting two model variants for text-based geolocation and lexical dialectology. Evaluated over Twitter data, the proposed model outperforms conventional regression-based geolocation and provides a better estimate of uncertainty. We also show the effectiveness of the representation for predicting words from location in lexical dialectology, and evaluate it using the DARE dataset.Comment: Conference on Empirical Methods in Natural Language Processing (EMNLP 2017) September 2017, Copenhagen, Denmar

    An Overview of the Use of Neural Networks for Data Mining Tasks

    Get PDF
    In the recent years the area of data mining has experienced a considerable demand for technologies that extract knowledge from large and complex data sources. There is a substantial commercial interest as well as research investigations in the area that aim to develop new and improved approaches for extracting information, relationships, and patterns from datasets. Artificial Neural Networks (NN) are popular biologically inspired intelligent methodologies, whose classification, prediction and pattern recognition capabilities have been utilised successfully in many areas, including science, engineering, medicine, business, banking, telecommunication, and many other fields. This paper highlights from a data mining perspective the implementation of NN, using supervised and unsupervised learning, for pattern recognition, classification, prediction and cluster analysis, and focuses the discussion on their usage in bioinformatics and financial data analysis tasks

    Testing stock market convergence: a non-linear factor approach

    Get PDF
    This paper applies the Phillips and Sul (Econometrica 75(6):1771–1855, 2007) method to test for convergence in stock returns to an extensive dataset including monthly stock price indices for five EU countries (Germany, France, the Netherlands, Ireland and the UK) as well as the US between 1973 and 2008. We carry out the analysis on both sectors and individual industries within sectors. As a first step, we use the Stock and Watson (J Am Stat Assoc 93(441):349–358, 1998) procedure to filter the data in order to extract the long-run component of the series; then, following Phillips and Sul (Econometrica 75(6):1771–1855, 2007), we estimate the relative transition parameters. In the case of sectoral indices we find convergence in the middle of the sample period, followed by divergence, and detect four (two large and two small) clusters. The analysis at a disaggregate, industry level again points to convergence in the middle of the sample, and subsequent divergence, but a much larger number of clusters is now found. Splitting the cross-section into two subgroups including euro area countries, the UK and the US respectively, provides evidence of a global convergence/divergence process not obviously influenced by EU policies

    Bayesian nonparametric sparse VAR models

    Get PDF
    High dimensional vector autoregressive (VAR) models require a large number of parameters to be estimated and may suffer of inferential problems. We propose a new Bayesian nonparametric (BNP) Lasso prior (BNP-Lasso) for high-dimensional VAR models that can improve estimation efficiency and prediction accuracy. Our hierarchical prior overcomes overparametrization and overfitting issues by clustering the VAR coefficients into groups and by shrinking the coefficients of each group toward a common location. Clustering and shrinking effects induced by the BNP-Lasso prior are well suited for the extraction of causal networks from time series, since they account for some stylized facts in real-world networks, which are sparsity, communities structures and heterogeneity in the edges intensity. In order to fully capture the richness of the data and to achieve a better understanding of financial and macroeconomic risk, it is therefore crucial that the model used to extract network accounts for these stylized facts.Comment: Forthcoming in "Journal of Econometrics" ---- Revised Version of the paper "Bayesian nonparametric Seemingly Unrelated Regression Models" ---- Supplementary Material available on reques
    corecore