14 research outputs found

    Web-Scale Bayesian click-through rate prediction for sponsored search advertising in Microsoft's Bing search engine

    Get PDF
    We describe a new Bayesian click-through rate (CTR) prediction algorithm used for Sponsored Search in Microsoft's Bing search engine. The algorithm is based on a probit regression model that maps discrete or real-valued input features to probabilities. It maintains Gaussian beliefs over weights of the model and performs Gaussian online updates derived from approximate message passing. Scalability of the algorithm is ensured through a principled weight pruning procedure and an approximate parallel implementation. We discuss the challenges arising from evaluating and tuning the predictor as part of the complex system of sponsored search where the predictions made by the algorithm decide about future training sample composition. Finally, we show experimental results from the production system and compare to a calibrated Naïve Bayes algorithm

    User Response Learning for Directly Optimizing Campaign Performance in Display Advertising

    Get PDF
    Learning and predicting user responses, such as clicks and conversions, are crucial for many Internet-based businesses including web search, e-commerce, and online advertising. Typically, a user response model is established by optimizing the prediction accuracy, e.g., minimizing the error between the prediction and the ground truth user response. However, in many practical cases, predicting user responses is only part of a rather larger predictive or optimization task, where on one hand, the accuracy of a user response prediction determines the final (expected) utility to be optimized, but on the other hand, its learning may also be influenced from the follow-up stochastic process. It is, thus, of great interest to optimize the entire process as a whole rather than treat them independently or sequentially. In this paper, we take real-time display advertising as an example, where the predicted user's ad click-through rate (CTR) is employed to calculate a bid for an ad impression in the second price auction. We reformulate a common logistic regression CTR model by putting it back into its subsequent bidding context: rather than minimizing the prediction error, the model parameters are learned directly by optimizing campaign profit. The gradient update resulted from our formulations naturally fine-tunes the cases where the market competition is high, leading to a more cost-effective bidding. Our experiments demonstrate that, while maintaining comparable CTR prediction accuracy, our proposed user response learning leads to campaign profit gains as much as 78.2% for offline test and 25.5% for online A/B test over strong baselines

    Click fraud : how to spot it, how to stop it?

    Get PDF
    Online search advertising is currently the greatest source of revenue for many Internet giants such as Google™, Yahoo!™, and Bing™. The increased number of specialized websites and modern profiling techniques have all contributed to an explosion of the income of ad brokers from online advertising. The single biggest threat to this growth is however click fraud. Trained botnets and even individuals are hired by click-fraud specialists in order to maximize the revenue of certain users from the ads they publish on their websites, or to launch an attack between competing businesses. Most academics and consultants who study online advertising estimate that 15% to 35% of ads in pay per click (PPC) online advertising systems are not authentic. In the first two quarters of 2010, US marketers alone spent 5.7billiononPPCads,wherePPCadsarebetween45and50percentofallonlineadspending.Onaverageabout5.7 billion on PPC ads, where PPC ads are between 45 and 50 percent of all online ad spending. On average about 1.5 billion is wasted due to click-fraud. These fraudulent clicks are believed to be initiated by users in poor countries, or botnets, who are trained to click on specific ads. For example, according to a 2010 study from Information Warfare Monitor, the operators of Koobface, a program that installed malicious software to participate in click fraud, made over $2 million in just over a year. The process of making such illegitimate clicks to generate revenue is called click-fraud. Search engines claim they filter out most questionable clicks and either not charge for them or reimburse advertisers that have been wrongly billed. However this is a hard task, despite the claims that brokers\u27 efforts are satisfactory. In the simplest scenario, a publisher continuously clicks on the ads displayed on his own website in order to make revenue. In a more complicated scenario. a travel agent may hire a large, globally distributed, botnet to click on its competitor\u27s ads, hence depleting their daily budget. We analyzed those different types of click fraud methods and proposed new methodologies to detect and prevent them real time. While traditional commercial approaches detect only some specific types of click fraud, Collaborative Click Fraud Detection and Prevention (CCFDP) system, an architecture that we have implemented based on the proposed methodologies, can detect and prevents all major types of click fraud. The proposed solution analyzes the detailed user activities on both, the server side and client side collaboratively to better describe the intention of the click. Data fusion techniques are developed to combine evidences from several data mining models and to obtain a better estimation of the quality of the click traffic. Our ideas are experimented through the development of the Collaborative Click Fraud Detection and Prevention (CCFDP) system. Experimental results show that the CCFDP system is better than the existing commercial click fraud solution in three major aspects: 1) detecting more click fraud especially clicks generated by software; 2) providing prevention ability; 3) proposing the concept of click quality score for click quality estimation. In the CCFDP initial version, we analyzed the performances of the click fraud detection and prediction model by using a rule base algorithm, which is similar to most of the existing systems. We have assigned a quality score for each click instead of classifying the click as fraud or genuine, because it is hard to get solid evidence of click fraud just based on the data collected, and it is difficult to determine the real intention of users who make the clicks. Results from initial version revealed that the diversity of CF attack Results from initial version revealed that the diversity of CF attack types makes it hard for a single counter measure to prevent click fraud. Therefore, it is important to be able to combine multiple measures capable of effective protection from click fraud. Therefore, in the CCFDP improved version, we provide the traffic quality score as a combination of evidence from several data mining algorithms. We have tested the system with a data from an actual ad campaign in 2007 and 2008. We have compared the results with Google Adwords reports for the same campaign. Results show that a higher percentage of click fraud present even with the most popular search engine. The multiple model based CCFDP always estimated less valid traffic compare to Google. Sometimes the difference is as high as 53%. Detection of duplicates, fast and efficient, is one of the most important requirement in any click fraud solution. Usually duplicate detection algorithms run in real time. In order to provide real time results, solution providers should utilize data structures that can be updated in real time. In addition, space requirement to hold data should be minimum. In this dissertation, we also addressed the problem of detecting duplicate clicks in pay-per-click streams. We proposed a simple data structure, Temporal Stateful Bloom Filter (TSBF), an extension to the regular Bloom Filter and Counting Bloom Filter. The bit vector in the Bloom Filter was replaced with a status vector. Duplicate detection results of TSBF method is compared with Buffering, FPBuffering, and CBF methods. False positive rate of TSBF is less than 1% and it does not have false negatives. Space requirement of TSBF is minimal among other solutions. Even though Buffering does not have either false positives or false negatives its space requirement increases exponentially with the size of the stream data size. When the false positive rate of the FPBuffering is set to 1% its false negative rate jumps to around 5%, which will not be tolerated by most of the streaming data applications. We also compared the TSBF results with CBF. TSBF uses only half the space or less than standard CBF with the same false positive probability. One of the biggest successes with CCFDP is the discovery of new mercantile click bot, the Smart ClickBot. We presented a Bayesian approach for detecting the Smart ClickBot type clicks. The system combines evidence extracted from web server sessions to determine the final class of each click. Some of these evidences can be used alone, while some can be used in combination with other features for the click bot detection. During training and testing we also addressed the class imbalance problem. Our best classifier shows recall of 94%. and precision of 89%, with F1 measure calculated as 92%. The high accuracy of our system proves the effectiveness of the proposed methodology. Since the Smart ClickBot is a sophisticated click bot that manipulate every possible parameters to go undetected, the techniques that we discussed here can lead to detection of other types of software bots too. Despite the enormous capabilities of modern machine learning and data mining techniques in modeling complicated problems, most of the available click fraud detection systems are rule-based. Click fraud solution providers keep the rules as a secret weapon and bargain with others to prove their superiority. We proposed validation framework to acquire another model of the clicks data that is not rule dependent, a model that learns the inherent statistical regularities of the data. Then the output of both models is compared. Due to the uniqueness of the CCFDP system architecture, it is better than current commercial solution and search engine/ISP solution. The system protects Pay-Per-Click advertisers from click fraud and improves their Return on Investment (ROI). The system can also provide an arbitration system for advertiser and PPC publisher whenever the click fraud argument arises. Advertisers can gain their confidence on PPC advertisement by having a channel to argue the traffic quality with big search engine publishers. The results of this system will booster the internet economy by eliminating the shortcoming of PPC business model. General consumer will gain their confidence on internet business model by reducing fraudulent activities which are numerous in current virtual internet world

    ATTRIBUTION MODELING AND MARKETING RESOURCE ALLOCATION IN AN ONLINE ENVIRONMENT

    Get PDF
    This dissertation contains one conceptual framework and two essays on the attribution modeling and marketing resource allocation in digital marketing. Chapter II presents the conceptual framework for attribution modeling and hypotheses related to the carryover effects and spillover effects of the information collected during the customer's prior visits through different marketing channels to a firm's website on subsequent visits and purchases. In Chapter III, I propose a method to measure the incremental value of individual marketing channels in an online multi-channel environment. The method includes a three-level measurement model of customers' consideration of online channels, their visits through these channels and subsequent purchase at the firm's website. Based on the analysis of customers' visits and purchases at a hospitality firm's website, I find significant carryover and spillover effects across different marketing channels. According to the estimation results, the relative contributions of each channel are significantly different as compared to the estimates from the widely-used "last-click" metric. A field study was conducted where the firm turned off paid search for a week to validate the ability of the proposed approach in estimating the incremental impact of a channel on conversions. This method can also be applied in targeting customers with different patterns of touches and identifying cases where e-mail retargeting may actually decrease conversion probabilities. Chapter IV analyzes the impact of attribution metric on the overall effectiveness of keyword investments in search campaigns. Different attribution metrics assign different conversion credits to search keywords clicked through the consumers' purchase journey, and the attribution-based credits affect the advertiser's future bidding and budget allocation for keywords, and in turn affect the overall return-on-investment (ROI) of future search campaigns. Using a six-month panel data of 476 keywords from an online jewelry retailer, I empirically model the relationship among the advertiser's bidding decision, the search engine's ranking decision, and the click-through rate and conversion rate, and analyze the impact of the attribution metric on the overall ROI of search campaigns. The focal advertiser changed the attribution metric from last-click to first-click half-way through the data window. This allows me to estimate the impact of the two attribution metrics on budget allocation, which in turn influences the realized ROI under different attribution regimes. Given the mix of the keywords bid by the advertiser, the results show that first-click leads to lower overall revenues and this impact is stronger for the more specific keywords. The policy simulation shows that the advertiser would be able to improve their overall revenue by more than 5% by appropriately changing the attribution metric for individual keywords to account for their actual contribution

    DESIGN WITH EMOTION: IMPROVING WEB SEARCH EXPERIENCE FOR OLDER ADULTS

    Get PDF
    Research indicates that older adults search for information all together about 15% less than younger adults prior to making decisions. Prior research findings associated such behavior mainly with age-related cognitive difficulties. However, recent studies indicate that emotion is linked to influence search decision quality. This research approaches questions about why older adults search less and how this search behavior could be improved. The research is motivated by the broader issues of older users\u27 search behavior, while focusing on the emotional usability of search engine user interfaces. Therefore, this research attempts to accomplish the following three objectives: a) to explore the usage of low level design elements as emotion manipulation tools b) to seamlessly integrate these design elements into currently existing search engine interfaces, and finally c) to evaluate the impact of emotional design elements on search performance and user satisfaction. To achieve these objectives, two usability studies were conducted. The aim of the first study was to explore emotion induction capabilities of colors, shapes, and combination of both. The study was required to determine if the proposed design elements have strong mood induction capabilities. The results demonstrated that low level design elements such as color and shape have high visceral effects that could be used as potentially viable alternatives to induce the emotional states of users without the users having knowledge of their presence. The purpose of the second study was to evaluate alternative search engine user interfaces, derived from this research, for search thoroughness and user preference. In general, search based performance variables showed that participants searched more thoroughly using interface types that integrate angular shape features. In addition, user preference variables also indicated that participants seemed to enjoy search tasks using search engine interfaces that used color/shape combinations. Overall, the results indicated that seamless integration of low level emotional design elements into currently existing search engine interfaces could potentially improve web search experience

    Interactive Temporal Feature Construction: A User-Driven Approach to Predictive Model Development

    Get PDF
    Predictive modeling with visualization techniques can revolutionize the way businesses operate. Analyzing large datasets on high compute machines makes it possible to utilize advance technologies to support data-driven decision making. A wide range of domains deal with data that have random sequence of events (such as real-time verification or health care). Temporal relationship between these events can be highly predictive in nature. However, existing methods of feature selection makes it difficult to identify temporal relationships to enhance the predictive power of models. Often, it requires domain expert’s knowledge to identify realistic patterns. Interactive Temporal Feature Construction (ITFC), a visual analytics workflow is designed to enable effective data-driven temporal feature construction. This application provides a new interactive workflow for model building and refinement, and visual representations to support that workflow. Use cases demonstrate how ITFC can result in more accurate predictive models when applied to complex cohorts of electronic health data.Master of Science in Information Scienc

    Prospects of Mobile Search

    Get PDF
    Search faces (at least) two major challenges. One is to improve efficiency of retrieving relevant content for all digital formats (images, audio, video, 3D shapes, etc). The second is making relevant information retrievable in a range of platforms, particularly in high diffusion ones as mobiles. The two challenges are interrelated but distinct. This report aims at assessing the potential of future Mobile Search. Two broad groups of search-based applications can be identified. The first one is the adaptation and emulation of web search processes and services to the mobile environment. The second one is services exploiting the unique features of the mobile devices and the mobile environments. Examples of these context-aware services include location-based services or interfacing to the internet of things (RFID networks). The report starts by providing an introduction to mobile search. It highlights differences and commonalities with search technologies on other platforms (Chapter 1). Chapter 2 is devoted to the supply side of mobile search markets. It describes mobile markets, presents key figures and gives an outline of main business models and players. Chapter 3 is dedicated to the demand side of the market. It studies users¿ acceptance and demand using the results on a case study in Sweden. Chapter 4 presents emerging trends in technology and markets that could shape mobile search. It is the author's view after discussing with many experts. One input to this discussion was the analysis of on forward-looking scenarios for mobile developed by the authors (Chapter 5). Experts were asked to evaluate these scenarios. Another input was a questionnaire to which 61 experts responded. Drivers, barriers and enablers for mobile search have been synthesised into SWOT analysis. The report concludes with some policy recommendations in view of the likely socio-economic implications of mobile search in Europe.JRC.DG.J.4-Information Societ

    What’s wrong with Automated Influence

    Get PDF
    Automated Influence is the use of AI to collect, integrate and analyse people's data in order to deliver targeted interventions that shape their behaviour. We consider three central objections against Automated Influence, focusing on privacy, exploitation, and manipulation, showing in each case how a structural version of that objection has more purchase than its interactional counterpart. By rejecting the interactional focus of 'AI Ethics', in favour of a more structural, political philosophy of AI, we show that the real problem with Automated Influence is the crisis of legitimacy that it precipitates

    Contribuciones para la Detección de Ataques Distribuidos de Denegación de Servicio (DDoS) en la Capa de Aplicación

    Get PDF
    Se analizaron seis aspectos sobre la detección de ataques DDoS: técnicas, variables, herramientas, ubicación de implementación, punto en el tiempo y precisión de detección. Este análisis permitió realizar una contribución útil al diseño de una estrategia adecuada para neutralizar estos ataques. En los últimos años, estos ataques se han dirigido hacia la capa de aplicación. Este fenómeno se debe principalmente a la gran cantidad de herramientas para la generación de este tipo de ataque. Por ello, además, en este trabajo se propone una alternativa de detección basada en el dinamismo del usuario web. Para esto, se evaluaron las características del dinamismo del usuario extraídas de las funciones del mouse y del teclado. Finalmente, el presente trabajo propone un enfoque de detección de bajo costo que consta de dos pasos: primero, las características del usuario se extraen en tiempo real mientras se navega por la aplicación web; en segundo lugar, cada característica extraída es utilizada por un algoritmo de orden (O1) para diferenciar a un usuario real de un ataque DDoS. Los resultados de las pruebas con las herramientas de ataque LOIC, OWASP y GoldenEye muestran que el método propuesto tiene una eficacia de detección del 100% y que las características del dinamismo del usuario de la web permiten diferenciar entre un usuario real y un robot

    Quantitative Assessment of Factors in Sentiment Analysis

    Get PDF
    Sentiment can be defined as a tendency to experience certain emotions in relation to a particular object or person. Sentiment may be expressed in writing, in which case determining that sentiment algorithmically is known as sentiment analysis. Sentiment analysis is often applied to Internet texts such as product reviews, websites, blogs, or tweets, where automatically determining published feeling towards a product, or service is very useful to marketers or opinion analysts. The main goal of sentiment analysis is to identify the polarity of natural language text. This thesis sets out to examine quantitatively the factors that have an effect on sentiment analysis. The factors that are commonly used in sentiment analysis are text features, sentiment lexica or resources, and the machine learning algorithms employed. The main aim of this thesis is to investigate systematically the interaction between sentiment analysis factors and machine learning algorithms in order to improve sentiment analysis performance as compared to the opinions of human assessors. A software system known as TJP was designed and developed to support this investigation. The research reported here has three main parts. Firstly, the role of data pre-processing was investigated with TJP using a combination of features together with publically available datasets. This considers the relationship and relative importance of superficial text features such as emoticons, n-grams, negations, hashtags, repeated letters, special characters, slang, and stopwords. The resulting statistical analysis suggests that a combination of all of these features achieves better accuracy with the dataset, and had a considerable effect on system performance. Secondly, the effect of human marked up training data was considered, since this is required by supervised machine learning algorithms. The results gained from TJP suggest that training data greatly augments sentiment analysis performance. However, the combination of training data and sentiment lexica seems to provide optimal performance. Nevertheless, one particular sentiment lexicon, AFINN, contributed better than others in the absence of training data, and therefore would be appropriate for unsupervised approaches to sentiment analysis. Finally, the performance of two sophisticated ensemble machine learning algorithms was investigated. Both the Arbiter Tree and Combiner Tree were chosen since neither of them has previously been used with sentiment analysis. The objective here was to demonstrate their applicability and effectiveness compared to that of the leading single machine learning algorithms, Naïve Bayes, and Support Vector Machines. The results showed that whilst either can be applied to sentiment analysis, the Arbiter Tree ensemble algorithm achieved better accuracy performance than either the Combiner Tree or any single machine learning algorithm
    corecore