Search CORE

221 research outputs found

Tutorial: Are You My Neighbor?: Bringing Order to Neighbor Computing Problems

Author: Anastasiu David C.
Catanese Helen N.
David
Li Yuliang
Publication venue: SJSU ScholarWorks
Publication date: 01/08/2019
Field of study

Finding nearest neighbors is an important topic that has attracted much attention over the years and has applications in many fields, such as market basket analysis, plagiarism and anomaly detection, community detection, ligand-based virtual screening, etc. As data are easier and easier to collect, finding neighbors has become a potential bottleneck in analysis pipelines. Performing pairwise comparisons given the massive datasets of today is no longer feasible. The high computational complexity of the task has led researchers to develop approximate methods, which find many but not all of the nearest neighbors. Yet, for some types of data, efficient exact solutions have been found by carefully partitioning or filtering the search space in a way that avoids most unnecessary comparisons.In recent years, there have been several fundamental advances in our ability to efficiently identify appropriate neighbors, especially in non-traditional data, such as graphs or document collections. In this tutorial, we provide an in-depth overview of recent methods for finding (nearest) neighbors, focusing on the intuition behind choices made in the design of those algorithms and on the utility of the methods in real-world applications. Our tutorial aims to provide a unifying view of neighbor computing problems, spanning from numerical data to graph data, from categorical data to sequential data, and related application scenarios. For each type of data, we will review the current state-of-the-art approaches used to identify neighbors and discuss how neighbor search methods are used to solve important problems

Crossref

Scholar Commons - Santa Clara University

SJSU ScholarWorks

Conversion Prediction Using Multi-task Conditional Attention Networks to Support the Creation of Effective Ad Creative

Author: Bahdanau Dzmitry
Kingma Diederik P
Kudo Taku
Lin Zhouhan
Luong Thang
Thomaidou Stamatina
Xu Kelvin
Yang Hongxia
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 17/05/2019
Field of study

Accurately predicting conversions in advertisements is generally a challenging task, because such conversions do not occur frequently. In this paper, we propose a new framework to support creating high-performing ad creatives, including the accurate prediction of ad creative text conversions before delivering to the consumer. The proposed framework includes three key ideas: multi-task learning, conditional attention, and attention highlighting. Multi-task learning is an idea for improving the prediction accuracy of conversion, which predicts clicks and conversions simultaneously, to solve the difficulty of data imbalance. Furthermore, conditional attention focuses attention of each ad creative with the consideration of its genre and target gender, thus improving conversion prediction accuracy. Attention highlighting visualizes important words and/or phrases based on conditional attention. We evaluated the proposed framework with actual delivery history data (14,000 creatives displayed more than a certain number of times from Gunosy Inc.), and confirmed that these ideas improve the prediction performance of conversions, and visualize noteworthy words according to the creatives' attributes.Comment: 9 pages, 6 figures. Accepted at The 25th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2019) as an applied data science pape

arXiv.org e-Print Archive

Crossref

Network Density of States

Author: Benson Austin R.
Bindel David
Dong Kun
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 23/05/2019
Field of study

Spectral analysis connects graph structure to the eigenvalues and eigenvectors of associated matrices. Much of spectral graph theory descends directly from spectral geometry, the study of differentiable manifolds through the spectra of associated differential operators. But the translation from spectral geometry to spectral graph theory has largely focused on results involving only a few extreme eigenvalues and their associated eigenvalues. Unlike in geometry, the study of graphs through the overall distribution of eigenvalues - the spectral density - is largely limited to simple random graph models. The interior of the spectrum of real-world graphs remains largely unexplored, difficult to compute and to interpret. In this paper, we delve into the heart of spectral densities of real-world graphs. We borrow tools developed in condensed matter physics, and add novel adaptations to handle the spectral signatures of common graph motifs. The resulting methods are highly efficient, as we illustrate by computing spectral densities for graphs with over a billion edges on a single compute node. Beyond providing visually compelling fingerprints of graphs, we show how the estimation of spectral densities facilitates the computation of many common centrality measures, and use spectral densities to estimate meaningful information about graph structure that cannot be inferred from the extremal eigenpairs alone.Comment: 10 pages, 7 figure

arXiv.org e-Print Archive

Crossref

Pb-Hash: Partitioned b-bit Hashing

Author: Li Ping
Zhao Weijie
Publication venue
Publication date: 28/06/2023
Field of study

Many hashing algorithms including minwise hashing (MinHash), one permutation hashing (OPH), and consistent weighted sampling (CWS) generate integers of

B

bits. With

k

hashes for each data vector, the storage would be

B\times k

bits; and when used for large-scale learning, the model size would be

2^B\times k

, which can be expensive. A standard strategy is to use only the lowest

b

bits out of the

B

bits and somewhat increase

k

, the number of hashes. In this study, we propose to re-use the hashes by partitioning the

B

bits into

m

chunks, e.g.,

b\times m =B

. Correspondingly, the model size becomes

m\times 2^b \times k

, which can be substantially smaller than the original

2^B\times k

. Our theoretical analysis reveals that by partitioning the hash values into

m

chunks, the accuracy would drop. In other words, using

m

chunks of

B/m

bits would not be as accurate as directly using

B

bits. This is due to the correlation from re-using the same hash. On the other hand, our analysis also shows that the accuracy would not drop much for (e.g.,)

m=2\sim 4

. In some regions, Pb-Hash still works well even for

m

much larger than 4. We expect Pb-Hash would be a good addition to the family of hashing methods/applications and benefit industrial practitioners. We verify the effectiveness of Pb-Hash in machine learning tasks, for linear SVM models as well as deep learning models. Since the hashed data are essentially categorical (ID) features, we follow the standard practice of using embedding tables for each hash. With Pb-Hash, we need to design an effective strategy to combine

m

embeddings. Our study provides an empirical evaluation on four pooling schemes: concatenation, max pooling, mean pooling, and product pooling. There is no definite answer which pooling would be always better and we leave that for future study

arXiv.org e-Print Archive

Topic-enhanced memory networks for personalised point-of-interest recommendation

Author: Mascolo C
Zhao Z
Zhou X
Publication venue: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Publication date: 01/01/2019
Field of study

Point-of-Interest (POI) recommender systems play a vital role in people's lives by recommending unexplored POIs to users and have drawn extensive attention from both academia and industry. Despite their value, however, they still suffer from the challenges of capturing complicated user preferences and fine-grained user-POI relationship for spatio-temporal sensitive POI recommendation. Existing recommendation algorithms, including both shallow and deep approaches, usually embed the visiting records of a user into a single latent vector to model user preferences: this has limited power of representation and interpretability. In this paper, we propose a novel topic-enhanced memory network (TEMN), a deep architecture to integrate the topic model and memory network capitalising on the strengths of both the global structure of latent patterns and local neighbourhood-based features in a nonlinear fashion. We further incorporate a geographical module to exploit user-specific spatial preference and POI-specific spatial influence to enhance recommendations. The proposed unified hybrid model is widely applicable to various POI recommendation scenarios. Extensive experiments on real-world WeChat datasets demonstrate its effectiveness (improvement ratio of 3.25% and 29.95% for context-aware and sequential recommendation, respectively). Also, qualitative analysis of the attention weights and topic modeling provides insight into the model's recommendation process and results.China Scholarship Council and Cambridge Trus

arXiv.org e-Print Archive

Apollo (Cambridge)