85 research outputs found

    Metadata indexing in a structured peer-to-peer network

    Get PDF
    Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2007.Includes bibliographical references (p. 69-77).Peer-to-peer networks require an efficient means for performing searches for files by metadata keywords. Unfortunately, current methods usually sacrifice either scalability or recall. Arpeggio is a peer-to-peer file-sharing network that uses the Chord lookup primitive as a basis for constructing a distributed keyword-set index, augmented with index-side filtering, to address this problem. We introduce index gateways, a technique for minimizing index maintenance overhead. Arpeggio also includes a content distribution system for finding source peers for a file; we present a novel system that uses Chord subrings to track live source peers without the cost of inserting the data itself into the network, and supports postfetching: using information in the index to improve the availability of rare files. The result is a system that provides efficient query operations with the scalability and reliability advantages of full decentralization. We use analysis and simulation results to show that our indexing system has reasonable storage and bandwidth costs, and improves load distribution.by Dan R.K. Ports.M.Eng

    CHORUS Deliverable 2.1: State of the Art on Multimedia Search Engines

    Get PDF
    Based on the information provided by European projects and national initiatives related to multimedia search as well as domains experts that participated in the CHORUS Think-thanks and workshops, this document reports on the state of the art related to multimedia content search from, a technical, and socio-economic perspective. The technical perspective includes an up to date view on content based indexing and retrieval technologies, multimedia search in the context of mobile devices and peer-to-peer networks, and an overview of current evaluation and benchmark inititiatives to measure the performance of multimedia search engines. From a socio-economic perspective we inventorize the impact and legal consequences of these technical advances and point out future directions of research

    Web service QoS prediction using improved software source code metrics

    Get PDF
    Due to the popularity of Web-based applications, various developers have provided an abundance of Web services with similar functionality. Such similarity makes it challenging for users to discover, select, and recommend appropriate Web services for the service-oriented systems. Quality of Service (QoS) has become a vital criterion for service discovery, selection, and recommendation. Unfortunately, service registries cannot ensure the validity of the available quality values of the Web services provided online. Consequently, predicting the Web services' QoS values has become a vital way to find the most appropriate services. In this paper, we propose a novel methodology for predicting Web service QoS using source code metrics. The core component is aggregating software metrics using inequality distribution from micro level of individual class to the macro level of the entire Web service. We used correlation between QoS and software metrics to train the learning machine. We validate and evaluate our approach using three sets of software quality metrics. Our results show that the proposed methodology can help improve the efficiency for the prediction of QoS properties using its source code metrics

    Parallel and Distributed Statistical-based Extraction of Relevant Multiwords from Large Corpora

    Get PDF
    The amount of information available through the Internet has been showing a significant growth in the last decade. The information can result from various sources such as scientific experiments resulting from particle acceleration, recording the flight data of a commercial aircraft, or sets of documents from a given domain such as medical articles, news headlines from a newspaper, or social networks contents. Due to the volume of data that must be analyzed, it is necessary to endow the search engines with new tools that allow the user to obtain the desired information in a timely and accurate manner. One approach is the annotation of documents with their relevant expressions. The extraction of relevant expressions from natural language text documents can be accomplished by the use of semantic, syntactic, or statistical techniques. Although the latter tend to be not so accurate, they have the advantage of being independent of the language. This investigation was performed in the context of LocalMaxs, which is a statistical method, thus language-independent, capable of extracting relevant expressions from natural language corpora. However, due to the large volume of data involved, the sequential implementations of the above techniques have severe limitations both in terms of execution time and memory space. In this thesis we propose a distributed architecture and strategies for parallel implementations of statistical-based extraction of relevant expressions from large corpora. A methodology was developed for modeling and evaluating those strategies based on empirical and theoretical approaches to estimate the statistical distribution of n-grams in natural language corpora. These approaches were applied to guide the design and evaluation of the behavior of LocalMaxs parallel and distributed implementations on cluster and cloud computing platforms. The implementation alternatives were compared regarding their precision and recall, and their performance metrics, namely, execution time, parallel speedup and sizeup. The performance results indicate almost linear speedup and sizeup for the range of large corpora sizes

    Location based services in wireless ad hoc networks

    Get PDF
    In this dissertation, we investigate location based services in wireless ad hoc networks from four different aspects - i) location privacy in wireless sensor networks (privacy), ii) end-to-end secure communication in randomly deployed wireless sensor networks (security), iii) quality versus latency trade-off in content retrieval under ad hoc node mobility (performance) and iv) location clustering based Sybil attack detection in vehicular ad hoc networks (trust). The first contribution of this dissertation is in addressing location privacy in wireless sensor networks. We propose a non-cooperative sensor localization algorithm showing how an external entity can stealthily invade into the location privacy of sensors in a network. We then design a location privacy preserving tracking algorithm for defending against such adversarial localization attacks. Next we investigate secure end-to-end communication in randomly deployed wireless sensor networks. Here, due to lack of control on sensors\u27 locations post deployment, pre-fixing pairwise keys between sensors is not feasible especially under larger scale random deployments. Towards this premise, we propose differentiated key pre-distribution for secure end-to-end secure communication, and show how it improves existing routing algorithms. Our next contribution is in addressing quality versus latency trade-off in content retrieval under ad hoc node mobility. We propose a two-tiered architecture for efficient content retrieval in such environment. Finally we investigate Sybil attack detection in vehicular ad hoc networks. A Sybil attacker can create and use multiple counterfeit identities risking trust of a vehicular ad hoc network, and then easily escape the location of the attack avoiding detection. We propose a location based clustering of nodes leveraging vehicle platoon dispersion for detection of Sybil attacks in vehicular ad hoc networks --Abstract, page iii

    Managing cache for efficient query processing

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Click fraud : how to spot it, how to stop it?

    Get PDF
    Online search advertising is currently the greatest source of revenue for many Internet giants such as Google™, Yahoo!™, and Bing™. The increased number of specialized websites and modern profiling techniques have all contributed to an explosion of the income of ad brokers from online advertising. The single biggest threat to this growth is however click fraud. Trained botnets and even individuals are hired by click-fraud specialists in order to maximize the revenue of certain users from the ads they publish on their websites, or to launch an attack between competing businesses. Most academics and consultants who study online advertising estimate that 15% to 35% of ads in pay per click (PPC) online advertising systems are not authentic. In the first two quarters of 2010, US marketers alone spent 5.7billiononPPCads,wherePPCadsarebetween45and50percentofallonlineadspending.Onaverageabout5.7 billion on PPC ads, where PPC ads are between 45 and 50 percent of all online ad spending. On average about 1.5 billion is wasted due to click-fraud. These fraudulent clicks are believed to be initiated by users in poor countries, or botnets, who are trained to click on specific ads. For example, according to a 2010 study from Information Warfare Monitor, the operators of Koobface, a program that installed malicious software to participate in click fraud, made over $2 million in just over a year. The process of making such illegitimate clicks to generate revenue is called click-fraud. Search engines claim they filter out most questionable clicks and either not charge for them or reimburse advertisers that have been wrongly billed. However this is a hard task, despite the claims that brokers\u27 efforts are satisfactory. In the simplest scenario, a publisher continuously clicks on the ads displayed on his own website in order to make revenue. In a more complicated scenario. a travel agent may hire a large, globally distributed, botnet to click on its competitor\u27s ads, hence depleting their daily budget. We analyzed those different types of click fraud methods and proposed new methodologies to detect and prevent them real time. While traditional commercial approaches detect only some specific types of click fraud, Collaborative Click Fraud Detection and Prevention (CCFDP) system, an architecture that we have implemented based on the proposed methodologies, can detect and prevents all major types of click fraud. The proposed solution analyzes the detailed user activities on both, the server side and client side collaboratively to better describe the intention of the click. Data fusion techniques are developed to combine evidences from several data mining models and to obtain a better estimation of the quality of the click traffic. Our ideas are experimented through the development of the Collaborative Click Fraud Detection and Prevention (CCFDP) system. Experimental results show that the CCFDP system is better than the existing commercial click fraud solution in three major aspects: 1) detecting more click fraud especially clicks generated by software; 2) providing prevention ability; 3) proposing the concept of click quality score for click quality estimation. In the CCFDP initial version, we analyzed the performances of the click fraud detection and prediction model by using a rule base algorithm, which is similar to most of the existing systems. We have assigned a quality score for each click instead of classifying the click as fraud or genuine, because it is hard to get solid evidence of click fraud just based on the data collected, and it is difficult to determine the real intention of users who make the clicks. Results from initial version revealed that the diversity of CF attack Results from initial version revealed that the diversity of CF attack types makes it hard for a single counter measure to prevent click fraud. Therefore, it is important to be able to combine multiple measures capable of effective protection from click fraud. Therefore, in the CCFDP improved version, we provide the traffic quality score as a combination of evidence from several data mining algorithms. We have tested the system with a data from an actual ad campaign in 2007 and 2008. We have compared the results with Google Adwords reports for the same campaign. Results show that a higher percentage of click fraud present even with the most popular search engine. The multiple model based CCFDP always estimated less valid traffic compare to Google. Sometimes the difference is as high as 53%. Detection of duplicates, fast and efficient, is one of the most important requirement in any click fraud solution. Usually duplicate detection algorithms run in real time. In order to provide real time results, solution providers should utilize data structures that can be updated in real time. In addition, space requirement to hold data should be minimum. In this dissertation, we also addressed the problem of detecting duplicate clicks in pay-per-click streams. We proposed a simple data structure, Temporal Stateful Bloom Filter (TSBF), an extension to the regular Bloom Filter and Counting Bloom Filter. The bit vector in the Bloom Filter was replaced with a status vector. Duplicate detection results of TSBF method is compared with Buffering, FPBuffering, and CBF methods. False positive rate of TSBF is less than 1% and it does not have false negatives. Space requirement of TSBF is minimal among other solutions. Even though Buffering does not have either false positives or false negatives its space requirement increases exponentially with the size of the stream data size. When the false positive rate of the FPBuffering is set to 1% its false negative rate jumps to around 5%, which will not be tolerated by most of the streaming data applications. We also compared the TSBF results with CBF. TSBF uses only half the space or less than standard CBF with the same false positive probability. One of the biggest successes with CCFDP is the discovery of new mercantile click bot, the Smart ClickBot. We presented a Bayesian approach for detecting the Smart ClickBot type clicks. The system combines evidence extracted from web server sessions to determine the final class of each click. Some of these evidences can be used alone, while some can be used in combination with other features for the click bot detection. During training and testing we also addressed the class imbalance problem. Our best classifier shows recall of 94%. and precision of 89%, with F1 measure calculated as 92%. The high accuracy of our system proves the effectiveness of the proposed methodology. Since the Smart ClickBot is a sophisticated click bot that manipulate every possible parameters to go undetected, the techniques that we discussed here can lead to detection of other types of software bots too. Despite the enormous capabilities of modern machine learning and data mining techniques in modeling complicated problems, most of the available click fraud detection systems are rule-based. Click fraud solution providers keep the rules as a secret weapon and bargain with others to prove their superiority. We proposed validation framework to acquire another model of the clicks data that is not rule dependent, a model that learns the inherent statistical regularities of the data. Then the output of both models is compared. Due to the uniqueness of the CCFDP system architecture, it is better than current commercial solution and search engine/ISP solution. The system protects Pay-Per-Click advertisers from click fraud and improves their Return on Investment (ROI). The system can also provide an arbitration system for advertiser and PPC publisher whenever the click fraud argument arises. Advertisers can gain their confidence on PPC advertisement by having a channel to argue the traffic quality with big search engine publishers. The results of this system will booster the internet economy by eliminating the shortcoming of PPC business model. General consumer will gain their confidence on internet business model by reducing fraudulent activities which are numerous in current virtual internet world

    Approximate algorithms for efficient indexing, clustering, and classification in Peer-to-peer networks

    Get PDF
    [no abstract

    Photorealistic physically based render engines: a comparative study

    Full text link
    Pérez Roig, F. (2012). Photorealistic physically based render engines: a comparative study. http://hdl.handle.net/10251/14797.Archivo delegad
    corecore