22 research outputs found

    Link-based similarity search to fight web spam

    Get PDF
    www.ilab.sztaki.hu/websearch We investigate the usability of similarity search in fighting Web spam based on the assumption that an unknown spam page is more similar to certain known spam pages than to honest pages. In order to be successful, search engine spam never appears in isolation: we observe link farms and alliances for the sole purpose of search engine ranking manipulation. The artificial nature and strong inside connectedness however gave rise to successful algorithms to identify search engine spam. One example is trust and distrust propagation, an idea originating in recommender systems and P2P networks, that yields spam classificators by spreading information along hyperlinks from white and blacklists. While most previous results use PageRank variants for propagation, we form classifiers by investigating similarity top lists of an unknown page along various measures such as co-citation, companion, nearest neighbors in low dimensional projections and SimRank. We test our method over two data sets previously used to measure spam filtering algorithms. 1

    Methods for large scale SVD with missing values

    No full text

    Performing cross-language retrieval with wikipedia

    No full text
    Abstract. We demonstrate a twofold use of Wikipedia for cross-lingual information retrieval. As our main contribution, we exploit Wikipedia hyperlinkage for query term disambiguation. We also use bilingual Wikipedia articles for dictionary extension. Our method is based on translation disambiguation; we combine the Wikipedia based technique with a method based on bigram statistics of pairs formed by translations of different source language terms.

    Semi-supervised learning: a comparative study for web spam and telephone user churn

    No full text
    Abstract. We compare a wide range of semi-supervised learning techniques both for Web spam filtering and for telephone user churn classification. Semisupervised learning has the assumption that the label of a node in a graph is similar to those of its neighbors. In this paper we measure this phenomenon both for Web spam and telco churn. We conclude that spam is often linked to spam while honest pages are linked to honest ones; similarly churn occurs in bursts in groups of a social network.

    Cross-language retrieval with wikipedia

    No full text
    We describe a method which is able to translate queries extended by narrative informa-tion from one language to another, with help of an appropriate machine readable dic-tionary and the Wikipedia on-line encyclopedia. Processing occurs in three steps: first, we look up possible translations phrase by phrase using both the dictionary and the cross-lingual links provided by Wikipedia; second, improbable translations, detected by a simple language model computed over a large corpus of documents written in the target language, are eliminated; and finally, further filtering is applied by matching Wikipedia concepts against the query narrative and removing translations not related to the overall query topic. Experiments performed on the Los Angeles Times 2002 corpus, translating from Hungarian to English showed that while queries generated at end of the second step were roughly only half as effective as original queries, primarily due to the limitations of our tools, after the third step precision improved significantly, reaching 60 % of the native English level

    Spectral clustering in telephone call graphs

    No full text
    We evaluate various heuristics for hierarchical spectral clustering in large telephone call graphs. Spectral clustering without additional heuristics often produces very uneven cluster sizes or low quality clusters that may consist of several disconnected components, a fact that appears to be common for several data sources but, to our knowledge, not described in the literature. Divide-and-Merge, a recently described postfiltering procedure may be used to eliminate bad quality branches in a binary tree hierarchy. We propose an alternate solution that enables k-way cuts in each step by immediately filtering unbalanced or low quality clusters before splitting them further. Our experiments are performed on graphs with various weight and normalization built based on call detail records. We investigate a period of eight months of more than two millions of Hungarian landline telephone users. We measure clustering quality both by cluster ratio as well as by the geographic homogeneity of the clusters obtained from telephone location data. Although divide-and-merge optimizes its clusters for cluster ratio, our method produces clusters of similar ratio much faster, furthermore we give geographically much more homogeneous clusters with the size distribution of our clusters resembling to that of the settlement structure
    corecore