1,524 research outputs found
Adaptive Matrix Completion for the Users and the Items in Tail
Recommender systems are widely used to recommend the most appealing items to
users. These recommendations can be generated by applying collaborative
filtering methods. The low-rank matrix completion method is the
state-of-the-art collaborative filtering method. In this work, we show that the
skewed distribution of ratings in the user-item rating matrix of real-world
datasets affects the accuracy of matrix-completion-based approaches. Also, we
show that the number of ratings that an item or a user has positively
correlates with the ability of low-rank matrix-completion-based approaches to
predict the ratings for the item or the user accurately. Furthermore, we use
these insights to develop four matrix completion-based approaches, i.e.,
Frequency Adaptive Rating Prediction (FARP), Truncated Matrix Factorization
(TMF), Truncated Matrix Factorization with Dropout (TMF + Dropout) and Inverse
Frequency Weighted Matrix Factorization (IFWMF), that outperforms traditional
matrix-completion-based approaches for the users and the items with few ratings
in the user-item rating matrix.Comment: 7 pages, 3 figures, ACM WWW'1
Text segmentation on multilabel documents: A distant-supervised approach
Segmenting text into semantically coherent segments is an important task with
applications in information retrieval and text summarization. Developing
accurate topical segmentation requires the availability of training data with
ground truth information at the segment level. However, generating such labeled
datasets, especially for applications in which the meaning of the labels is
user-defined, is expensive and time-consuming. In this paper, we develop an
approach that instead of using segment-level ground truth information, it
instead uses the set of labels that are associated with a document and are
easier to obtain as the training data essentially corresponds to a multilabel
dataset. Our method, which can be thought of as an instance of distant
supervision, improves upon the previous approaches by exploiting the fact that
consecutive sentences in a document tend to talk about the same topic, and
hence, probably belong to the same class. Experiments on the text segmentation
task on a variety of datasets show that the segmentation produced by our method
beats the competing approaches on four out of five datasets and performs at par
on the fifth dataset. On the multilabel text classification task, our method
performs at par with the competing approaches, while requiring significantly
less time to estimate than the competing approaches.Comment: Accepted in 2018 IEEE International Conference on Data Mining (ICDM
Extending Input Contexts of Language Models through Training on Segmented Sequences
Effectively training language models on long inputs poses many technical
challenges. As a cost consideration, languages models are pretrained on a fixed
sequence length before being adapted to longer sequences. We explore various
methods for adapting models to longer inputs by training on segmented sequences
and an interpolation-based method for extending absolute positional embeddings.
We develop a training procedure to extend the input context size of pretrained
models with no architectural changes and no additional memory costs than
training on the original input lengths. By sub-sampling segments from long
inputs while maintaining their original position the model is able to learn new
positional interactions. Our method benefits both models trained with absolute
positional embeddings, by extending their input contexts, as well as popular
relative positional embedding methods showing a reduced perplexity on sequences
longer than they were trained on. We demonstrate our method can extend input
contexts by a factor of 4x while improving perplexity.Comment: 11 pages, 3 figure
An efficient algorithm for discovering frequent subgraphs
Abstract — Over the years, frequent itemset discovery algorithms have been used to find interesting patterns in various application areas. However, as data mining techniques are being increasingly applied to non-traditional domains, existing frequent pattern discovery approach cannot be used. This is because the transaction framework that is assumed by these algorithms cannot be used to effectively model the datasets in these domains. An alternate way of modeling the objects in these datasets is to represent them using graphs. Within that model, one way of formulating the frequent pattern discovery problem is as that of discovering subgraphs that occur frequently over the entire set of graphs. In this paper we present a computationally efficient algorithm, called FSG, for finding all frequent subgraphs in large graph datasets. We experimentally evaluate the performance of FSG using a variety of real and synthetic datasets. Our results show that despite the underlying complexity associated with frequent subgraph discovery, FSG is effective in finding all frequently occurring subgraphs in datasets containing over 200,000 graph transactions and scales linearly with respect to the size of the dataset. Index Terms — Data mining, scientific datasets, frequent pattern discovery, chemical compound datasets
Efficient identification of Tanimoto nearest neighbors; All Pairs Similarity Search Using the Extended Jaccard Coefficient
Tanimoto, or extended Jaccard, is an important similarity measure which has seen prominent use in fields such as data mining and chemoinformatics. Many of the existing state-of-the-art methods for market basket analysis, plagiarism and anomaly detection, compound database search, and ligand-based virtual screening rely heavily on identifying Tanimoto nearest neighbors. Given the rapidly increasing size of data that must be analyzed, new algorithms are needed that can speed up nearest neighbor search, while at the same time providing reliable results. While many search algorithms address the complexity of the task by retrieving only some of the nearest neighbors, we propose a method that finds all of the exact nearest neighbors efficiently by leveraging recent advances in similarity search filtering. We provide tighter filtering bounds for the Tanimoto coefficient and show that our method, TAPNN, greatly outperforms existing baselines across a variety of real-world datasets and similarity thresholds
- …