Search CORE

218 research outputs found

Soft Seeded SSL Graphs for Unsupervised Semantic Similarity-based Retrieval

Author: Datt Madhav
Srivastava Avikalp
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 15/12/2017
Field of study

Semantic similarity based retrieval is playing an increasingly important role in many IR systems such as modern web search, question-answering, similar document retrieval etc. Improvements in retrieval of semantically similar content are very significant to applications like Quora, Stack Overflow, Siri etc. We propose a novel unsupervised model for semantic similarity based content retrieval, where we construct semantic flow graphs for each query, and introduce the concept of "soft seeding" in graph based semi-supervised learning (SSL) to convert this into an unsupervised model. We demonstrate the effectiveness of our model on an equivalent question retrieval problem on the Stack Exchange QA dataset, where our unsupervised approach significantly outperforms the state-of-the-art unsupervised models, and produces comparable results to the best supervised models. Our research provides a method to tackle semantic similarity based retrieval without any training data, and allows seamless extension to different domain QA communities, as well as to other semantic equivalence tasks.Comment: Published in Proceedings of the 2017 ACM Conference on Information and Knowledge Management (CIKM '17

arXiv.org e-Print Archive

Crossref

Feature extraction and classification of movie reviews

Author: Awukam Awukam Ojang
Mtetwa Nhamo
Yousefi Mehdi
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 02/05/2019
Field of study

ResearchOnline@GCU

Approximately Minwise Independence with Twisted Tabulation

Author: A. Broder
A.Z. Broder
E. Cohen
M. Datar
M. Pǎtraşcu
R.E. Fan
Y. Bachrach
Publication venue
Publication date: 01/01/2014
Field of study

A random hash function

h

\varepsilon

-minwise if for any set

S

|S|=n

, and element

x\in S

\Pr[h(x)=\min h(S)]=(1\pm\varepsilon)/n

. Minwise hash functions with low bias

\varepsilon

have widespread applications within similarity estimation. Hashing from a universe

[u]

, the twisted tabulation hashing of P\v{a}tra\c{s}cu and Thorup [SODA'13] makes

c=O(1)

lookups in tables of size

u^{1/c}

. Twisted tabulation was invented to get good concentration for hashing based sampling. Here we show that twisted tabulation yields

\tilde O(1/u^{1/c})

-minwise hashing. In the classic independence paradigm of Wegman and Carter [FOCS'79]

\tilde O(1/u^{1/c})

-minwise hashing requires

\Omega(\log u)

-independence [Indyk SODA'99]. P\v{a}tra\c{s}cu and Thorup [STOC'11] had shown that simple tabulation, using same space and lookups yields

\tilde O(1/n^{1/c})

-minwise independence, which is good for large sets, but useless for small sets. Our analysis uses some of the same methods, but is much cleaner bypassing a complicated induction argument.Comment: To appear in Proceedings of SWAT 201

arXiv.org e-Print Archive

CiteSeerX

Crossref

Copenhagen University Research Information System

Methods of the automated search of plagiarism in electronic documents

Author: Bugalskyy D. I.
Lupenko S. A.
Бугальський Дмитро Іванович
Лупенко Сергій Анатолійович
Publication venue: TNTU
Publication date: 19/11/2014
Field of study

Electronic archive of Ternopil National Ivan Puluj Technical University

Considerations about multistep community detection

Author: A Broder
A Clauset
A Lancichinetti
AL Barabási
BH Good
FD Malliaros
HP Kriegel
J Reichardt
JC Bezdek
L Danon
M Belkin
M Girvan
ME Newman
ME Newman
ME Newman
P Krapivsky
R Kannan
S Fortunato
S Fortunato
TF Chan
VD Blondel
W Zhang
Publication venue
Publication date: 27/02/2014
Field of study

The problem and implications of community detection in networks have raised a huge attention, for its important applications in both natural and social sciences. A number of algorithms has been developed to solve this problem, addressing either speed optimization or the quality of the partitions calculated. In this paper we propose a multi-step procedure bridging the fastest, but less accurate algorithms (coarse clustering), with the slowest, most effective ones (refinement). By adopting heuristic ranking of the nodes, and classifying a fraction of them as `critical', a refinement step can be restricted to this subset of the network, thus saving computational time. Preliminary numerical results are discussed, showing improvement of the final partition.Comment: 12 page

arXiv.org e-Print Archive

Crossref

Archivio Istituzionale della Ricerca- Università del Salento

Проблема плагіату у вищих навчальних закладах та шляхи її вирішення

Author: Матасар Євген Ігнатович
Publication venue: Київський Університет імені Бориса Грінченка
Publication date: 23/05/2014
Field of study

Borys Grinchenko Kyiv University Institutional repository