Search CORE

22 research outputs found

Methods for web spam filtering

Author: Csalogány Károly
Publication venue
Publication date: 01/01/2009
Field of study

ELTE Digital Institutional Repository (EDIT)

Link-based similarity search to fight web spam

Author: Benczúr András
Csalogány Károly
Sarlós Tamás
Publication venue: Lehigh Univ.
Publication date: 01/01/2006
Field of study

www.ilab.sztaki.hu/websearch We investigate the usability of similarity search in fighting Web spam based on the assumption that an unknown spam page is more similar to certain known spam pages than to honest pages. In order to be successful, search engine spam never appears in isolation: we observe link farms and alliances for the sole purpose of search engine ranking manipulation. The artificial nature and strong inside connectedness however gave rise to successful algorithms to identify search engine spam. One example is trust and distrust propagation, an idea originating in recommender systems and P2P networks, that yields spam classificators by spreading information along hyperlinks from white and blacklists. While most previous results use PageRank variants for propagation, we form classifiers by investigating similarity top lists of an unknown page along various measures such as co-citation, companion, nearest neighbors in low dimensional projections and SimRank. We test our method over two data sets previously used to measure spam filtering algorithms. 1

CiteSeerX

SZTAKI Publication Repository

Methods for large scale SVD with missing values

Author: Benczúr András
Csalogány Károly
Kurucz Miklós
Publication venue: 'American College of Medical Physics (ACMP)'
Publication date: 01/01/2007
Field of study

SZTAKI Publication Repository

On the feasibility of low-rank approximation for personalized pagerank

Author: Benczúr András
Csalogány Károly
Sarlós Tamás
Publication venue: ACM Pr., New York
Publication date: 01/01/2005
Field of study

SZTAKI Publication Repository

Performing cross-language retrieval with wikipedia

Author: Benczúr András
Bíró István
Csalogány Károly
Schönhofen Péter
Publication venue
Publication date: 01/01/2007
Field of study

Abstract. We demonstrate a twofold use of Wikipedia for cross-lingual information retrieval. As our main contribution, we exploit Wikipedia hyperlinkage for query term disambiguation. We also use bilingual Wikipedia articles for dictionary extension. Our method is based on translation disambiguation; we combine the Wikipedia based technique with a method based on bigram statistics of pairs formed by translations of different source language terms.

CiteSeerX

SZTAKI Publication Repository

Semi-supervised learning: a comparative study for web spam and telephone user churn

Author: Benczúr András
Csalogány Károly
Lukács László
Siklósi Dávid
Publication venue
Publication date: 01/01/2007
Field of study

Abstract. We compare a wide range of semi-supervised learning techniques both for Web spam filtering and for telephone user churn classification. Semisupervised learning has the assumption that the label of a node in a graph is similar to those of its neighbors. In this paper we measure this phenomenon both for Web spam and telco churn. We conclude that spam is often linked to spam while honest pages are linked to honest ones; similarly churn occurs in bursts in groups of a social network.

CiteSeerX

SZTAKI Publication Repository

SpamRank - fully automatic link spam detection. Work in progress

Author: Benczúr András
Csalogány Károly
Sarlós K.
Uher M.
Publication venue: Chiba
Publication date: 01/01/2005
Field of study

SZTAKI Publication Repository

Towards scaling fully personalized pageRank

Author: Csalogány Károly
Fogaras Dániel
Rácz Balázs
Sarlós Tamás
Publication venue
Publication date: 01/01/2005
Field of study

SZTAKI Publication Repository

Cross-language retrieval with wikipedia

Author: Benczúr András
Bíró István
Csalogány Károly
Schönhofen Péter
Publication venue: 'Springer Fachmedien Wiesbaden GmbH'
Publication date: 01/01/2008
Field of study

We describe a method which is able to translate queries extended by narrative informa-tion from one language to another, with help of an appropriate machine readable dic-tionary and the Wikipedia on-line encyclopedia. Processing occurs in three steps: first, we look up possible translations phrase by phrase using both the dictionary and the cross-lingual links provided by Wikipedia; second, improbable translations, detected by a simple language model computed over a large corpus of documents written in the target language, are eliminated; and finally, further filtering is applied by matching Wikipedia concepts against the query narrative and removing translations not related to the overall query topic. Experiments performed on the Los Angeles Times 2002 corpus, translating from Hungarian to English showed that while queries generated at end of the second step were roughly only half as effective as original queries, primarily due to the limitations of our tools, after the third step precision improved significantly, reaching 60 % of the native English level

CiteSeerX

SZTAKI Publication Repository

Spectral clustering in telephone call graphs

Author: Benczúr András
Csalogány Károly
Kurucz Miklós
Lukács László
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2007
Field of study

We evaluate various heuristics for hierarchical spectral clustering in large telephone call graphs. Spectral clustering without additional heuristics often produces very uneven cluster sizes or low quality clusters that may consist of several disconnected components, a fact that appears to be common for several data sources but, to our knowledge, not described in the literature. Divide-and-Merge, a recently described postfiltering procedure may be used to eliminate bad quality branches in a binary tree hierarchy. We propose an alternate solution that enables k-way cuts in each step by immediately filtering unbalanced or low quality clusters before splitting them further. Our experiments are performed on graphs with various weight and normalization built based on call detail records. We investigate a period of eight months of more than two millions of Hungarian landline telephone users. We measure clustering quality both by cluster ratio as well as by the geographic homogeneity of the clusters obtained from telephone location data. Although divide-and-merge optimizes its clusters for cluster ratio, our method produces clusters of similar ratio much faster, furthermore we give geographically much more homogeneous clusters with the size distribution of our clusters resembling to that of the settlement structure

CiteSeerX

Crossref

SZTAKI Publication Repository