218 research outputs found

    Soft Seeded SSL Graphs for Unsupervised Semantic Similarity-based Retrieval

    Full text link
    Semantic similarity based retrieval is playing an increasingly important role in many IR systems such as modern web search, question-answering, similar document retrieval etc. Improvements in retrieval of semantically similar content are very significant to applications like Quora, Stack Overflow, Siri etc. We propose a novel unsupervised model for semantic similarity based content retrieval, where we construct semantic flow graphs for each query, and introduce the concept of "soft seeding" in graph based semi-supervised learning (SSL) to convert this into an unsupervised model. We demonstrate the effectiveness of our model on an equivalent question retrieval problem on the Stack Exchange QA dataset, where our unsupervised approach significantly outperforms the state-of-the-art unsupervised models, and produces comparable results to the best supervised models. Our research provides a method to tackle semantic similarity based retrieval without any training data, and allows seamless extension to different domain QA communities, as well as to other semantic equivalence tasks.Comment: Published in Proceedings of the 2017 ACM Conference on Information and Knowledge Management (CIKM '17

    Feature extraction and classification of movie reviews

    Get PDF

    Approximately Minwise Independence with Twisted Tabulation

    Full text link
    A random hash function hh is ε\varepsilon-minwise if for any set SS, S=n|S|=n, and element xSx\in S, Pr[h(x)=minh(S)]=(1±ε)/n\Pr[h(x)=\min h(S)]=(1\pm\varepsilon)/n. Minwise hash functions with low bias ε\varepsilon have widespread applications within similarity estimation. Hashing from a universe [u][u], the twisted tabulation hashing of P\v{a}tra\c{s}cu and Thorup [SODA'13] makes c=O(1)c=O(1) lookups in tables of size u1/cu^{1/c}. Twisted tabulation was invented to get good concentration for hashing based sampling. Here we show that twisted tabulation yields O~(1/u1/c)\tilde O(1/u^{1/c})-minwise hashing. In the classic independence paradigm of Wegman and Carter [FOCS'79] O~(1/u1/c)\tilde O(1/u^{1/c})-minwise hashing requires Ω(logu)\Omega(\log u)-independence [Indyk SODA'99]. P\v{a}tra\c{s}cu and Thorup [STOC'11] had shown that simple tabulation, using same space and lookups yields O~(1/n1/c)\tilde O(1/n^{1/c})-minwise independence, which is good for large sets, but useless for small sets. Our analysis uses some of the same methods, but is much cleaner bypassing a complicated induction argument.Comment: To appear in Proceedings of SWAT 201

    Considerations about multistep community detection

    Full text link
    The problem and implications of community detection in networks have raised a huge attention, for its important applications in both natural and social sciences. A number of algorithms has been developed to solve this problem, addressing either speed optimization or the quality of the partitions calculated. In this paper we propose a multi-step procedure bridging the fastest, but less accurate algorithms (coarse clustering), with the slowest, most effective ones (refinement). By adopting heuristic ranking of the nodes, and classifying a fraction of them as `critical', a refinement step can be restricted to this subset of the network, thus saving computational time. Preliminary numerical results are discussed, showing improvement of the final partition.Comment: 12 page
    corecore