319 research outputs found

    Identification of Web Spam through Clustering of Website Structures

    Get PDF
    Spam websites are domains whose owners are not interested in using them as gates for their activities but they are parked to be sold in the secondary market of web domains. To transform the costs of the annual registration fees in an opportunity of revenues, spam websites most often host a large amount of ads in the hope that someone who lands on the site by chance clicks on some ads. Since parking has become a widespread activity, a large number of specialized companies have come out and made parking a straightforward task that simply requires to set the domain?s name servers appropriately. Although parking is a legal activity, spam websites have a deep negative impact on the information quality of the web and can significantly deteriorate the performances of most web mining tools. For example these websites can influence search engines results or introduce an extra burden for crawling systems. In addition, spam websites represent a cost for ad bidders that are obliged to pay for impressions or clicks that have a negligible probability to produce revenues. In this paper, we experimentally show that spam websites hosted by the same service provider tend to have similar look-and-feel. Exploiting this structural similarity we face the problem of the automatic identification of spam websites. In addition, we use the outcome of the classification for compiling the list of the name servers used by spam websites so that they can be discarded before the first connection just after the first DNS query. A dump of our dataset (including web pages and meta information) and the corresponding manual classification is freely available upon request

    Dynamic User-Defined Similarity Searching in Semi-Structured Text Retrieval

    Get PDF
    Modern text retrieval systems often provide a similarity search utility, that allows the user to find efficiently a fixed number k of documents in the data set that are most similar to a given query (here a query is either a simple sequence of keywords or the identifier of a full document found in previous searches that is considered of interest). We consider the case of a textual database made of semi-structured documents. For example, in a corpus of bibliographic records any record may be structured into three fields: title, authors and abstract, where each field is an unstructured free text. Each field, in turns, is modelled with a specific vector space. The problem is more complex when we also allow each such vector space to have an associated user-defined dynamic weight that influences its contribution to the overall dynamic aggregated and weighted similarity. This dynamic problem has been tackled in a recent paper by Singitham et al. in VLDB 2004. Their proposed solution, which we take as baseline, is a variant of the cluster-pruning technique that has the potential for scaling to very large corpora of documents, and is far more efficient than the naive exhaustive search. We devise an alternative way of embedding weights in the data structure, coupled with a non-trivial application of a clustering algorithm based on the furthest point first heuristic for the metric k-center problem. The validity of our approach is demonstrated experimentally by showing significant performance improvements over the scheme proposed in VLDB 2004 We improve significantly tradeoffs between query time and output quality with respect to the baseline method in VLDB 2004, and also with respect to a novel method by Chierichetti et al. to appear in ACM PODS 2007. We also speed up the pre-processing time by a factor at least thirty

    Lumbricus webis: a parallel and distributed crawling architecture for the Italian web

    Get PDF
    Web crawlers have become popular tools for gattering large portions of the web that can be used for many tasks from statistics to structural analysis of the web. Due to the amount of data and the heterogeneity of tasks to manage, it is essential for crawlers to have a modular and distributed architecture. In this paper we describe Lumbricus webis (short L.webis) a modular crawling infrastructure built to mine data from the web domain ccTLD .it and portions of the web reachable from this domain. Its purpose is to support gathering of advanced statics and advanced analytic tools on the content of the Italian Web. This paper describes the architectural features of L.webis and its performance. L.webis can currently download a mid-sized ccTLD such as ".it" in about one week

    Cluster Generation and Cluster Labelling for Web Snippets: A Fast and Accurate Hierarchical Solution

    Get PDF
    This paper describes Armil, a meta-search engine that groups into disjoint labelled clusters the Web snippets returned by auxiliary search engines. The cluster labels generated by Armil provide the user with a compact guide to assessing the relevance of each cluster to her information need. Strik- ing the right balance between running time and cluster well- formedness was a key point in the design of our system. Both the clustering and the labelling tasks are performed on the ?y by processing only the snippets provided by the auxil- iary search engines, and use no external sources of knowl- edge. Clustering is performed by means of a fast version of the furthest-point-?rst algorithm for metric k-center cluster- ing. Cluster labelling is achieved by combining intra-cluster and inter-cluster term extraction based on a variant of the information gain measure. We have tested the clustering ef- fectiveness of Armil against Vivisimo, the de facto industrial standard in Web snippet clustering, using as benchmark a comprehensive set of snippets obtained from the Open Di- rectory Project hierarchy. According to two widely accepted external\u27 metrics of clustering quality, Armil achieves bet- ter performance levels by 10%. We also report the results of a thorough user evaluation of both the clustering and the cluster labelling algorithms. On a standard 1GHz ma- chine, Armil performs clustering and labelling altogether in less than one second

    Packet Classification via Improved Space Decomposition Techniques

    Get PDF
    P ack et Classification is a common task in moder n Inter net r outers. The goal is to classify pack ets into "classes" or "flo ws" according to some ruleset that looks at multiple fields of each pack et. Differ entiated actions can then be applied to the traffic depending on the r esult of the classification. Ev en though rulesets can be expr essed in a r elati v ely compact way by using high le v el languages, the r esulting decision tr ees can partition the sear ch space (the set of possible attrib ute v alues) in a potentially v ery lar ge ( and mor e) number of r egions. This calls f or methods that scale to such lar ge pr oblem sizes, though the only scalable pr oposal in the literatur e so far is the one based on a F at In v erted Segment T r ee [1 ]. In this paper we pr opose a new geometric technique called G-filter f or pack et classification on dimensions. G-filter is based on an impr o v ed space decomposition technique. In addition to a theor etical analysis sho wing that classification in G-filter has time complexity and slightly super -linear space in the number of rules, we pr o vide thor ough experiments sho wing that the constants in v olv ed ar e extr emely small on a wide range of pr oblem sizes, and that G-filter impr o v e the best r esults in the literatur e f or lar ge pr oblem sizes, and is competiti v e f or small sizes as well

    On the Benefits of Keyword Spreading in Sponsored Search Auctions: An Experimental Analysis

    Get PDF
    Sellers of goods or services wishing to participate in sponsored search auctions must define a pool of keywords that are matched on-line to the queries submitted by the users to a search engine. Sellers must also define the value of their bid to the search engine for showing their advertisements in case of a query-keyword match. In order to optimize its revenue a seller might decide to substitute a keyword with a high cost, thus likely to be the object of intense competition, with sets of related keywords that collectively have lower cost while capturing an equivalent volume of user clicks. This technique is called keyword spreading and has recently attracted the attention of several researchers in the area of sponsored search auctions. In this paper we describe an experimental benchmark that through large scale realistic simulations allows us to pin-point the potential benefits/drawbacks of keyword spreading for the players using this technique, for those not using it, and for the search engine itself. Experimental results reveal that keyword spreading is generally convenient (or non-damaging) to all parties involved

    Fast exact computation of betweenness centrality in social networks

    Get PDF
    Abstract-Social networks have demonstrated in the last few years to be a powerful and flexible concept useful to represent and analyze data. They borrow some basic concepts from sociology in order to model how people (or data items) establish relationships with each other. The study of these relationships can provide a deeper understanding of many emergent global phenomena. The amount of data available in the form of social networks data is growing by the day, and this poses many computational challenging problems for their analysis. In fact many analysis tools suitable to analyze small to medium sized networks are inefficient for large social networks. In this paper we present a novel approach for the computation of the betweenness centrality, which speeds up considerably Brandes\u27 algorithm, in the context of social networking. Our algorithm exploits the natural sparsity of the data to algebraically (and efficiently) determine the betweenness of those nodes organized as trees embedded in the social network. Moreover, for the residual network, which is often of much smaller size we modify the Brandes\u27 algorithm so that we can remove the nodes already processed and perform the computation of the shortest paths only for the remaining nodes. We tested our algorithm using a set of 18 real sparse large social networks provided by Sistemi Territoriali which is an Italian ICT company specialized in Business Intelligence. Our tests show that our algorithm consistently runs more than an order of magnitude faster than the Brandes\u27 procedure on such sparse networks
    • …
    corecore