7,238 research outputs found
Multi-label Ferns for Efficient Recognition of Musical Instruments in Recordings
In this paper we introduce multi-label ferns, and apply this technique for
automatic classification of musical instruments in audio recordings. We compare
the performance of our proposed method to a set of binary random ferns, using
jazz recordings as input data. Our main result is obtaining much faster
classification and higher F-score. We also achieve substantial reduction of the
model size
High-dimensional approximate nearest neighbor: k-d Generalized Randomized Forests
We propose a new data-structure, the generalized randomized kd forest, or
kgeraf, for approximate nearest neighbor searching in high dimensions. In
particular, we introduce new randomization techniques to specify a set of
independently constructed trees where search is performed simultaneously, hence
increasing accuracy. We omit backtracking, and we optimize distance
computations, thus accelerating queries. We release public domain software
geraf and we compare it to existing implementations of state-of-the-art methods
including BBD-trees, Locality Sensitive Hashing, randomized kd forests, and
product quantization. Experimental results indicate that our method would be
the method of choice in dimensions around 1,000, and probably up to 10,000, and
pointsets of cardinality up to a few hundred thousands or even one million;
this range of inputs is encountered in many critical applications today. For
instance, we handle a real dataset of images represented in 960
dimensions with a query time of less than sec on average and 90\% responses
being true nearest neighbors
Making the Most of Tweet-Inherent Features for Social Spam Detection on Twitter
Social spam produces a great amount of noise on social media services such as
Twitter, which reduces the signal-to-noise ratio that both end users and data
mining applications observe. Existing techniques on social spam detection have
focused primarily on the identification of spam accounts by using extensive
historical and network-based data. In this paper we focus on the detection of
spam tweets, which optimises the amount of data that needs to be gathered by
relying only on tweet-inherent features. This enables the application of the
spam detection system to a large set of tweets in a timely fashion, potentially
applicable in a real-time or near real-time setting. Using two large
hand-labelled datasets of tweets containing spam, we study the suitability of
five classification algorithms and four different feature sets to the social
spam detection task. Our results show that, by using the limited set of
features readily available in a tweet, we can achieve encouraging results which
are competitive when compared against existing spammer detection systems that
make use of additional, costly user features. Our study is the first that
attempts at generalising conclusions on the optimal classifiers and sets of
features for social spam detection over different datasets
- …