3,929 research outputs found

    Profiling user activities with minimal traffic traces

    Full text link
    Understanding user behavior is essential to personalize and enrich a user's online experience. While there are significant benefits to be accrued from the pursuit of personalized services based on a fine-grained behavioral analysis, care must be taken to address user privacy concerns. In this paper, we consider the use of web traces with truncated URLs - each URL is trimmed to only contain the web domain - for this purpose. While such truncation removes the fine-grained sensitive information, it also strips the data of many features that are crucial to the profiling of user activity. We show how to overcome the severe handicap of lack of crucial features for the purpose of filtering out the URLs representing a user activity from the noisy network traffic trace (including advertisement, spam, analytics, webscripts) with high accuracy. This activity profiling with truncated URLs enables the network operators to provide personalized services while mitigating privacy concerns by storing and sharing only truncated traffic traces. In order to offset the accuracy loss due to truncation, our statistical methodology leverages specialized features extracted from a group of consecutive URLs that represent a micro user action like web click, chat reply, etc., which we call bursts. These bursts, in turn, are detected by a novel algorithm which is based on our observed characteristics of the inter-arrival time of HTTP records. We present an extensive experimental evaluation on a real dataset of mobile web traces, consisting of more than 130 million records, representing the browsing activities of 10,000 users over a period of 30 days. Our results show that the proposed methodology achieves around 90% accuracy in segregating URLs representing user activities from non-representative URLs

    Precursors and Laggards: An Analysis of Semantic Temporal Relationships on a Blog Network

    Full text link
    We explore the hypothesis that it is possible to obtain information about the dynamics of a blog network by analysing the temporal relationships between blogs at a semantic level, and that this type of analysis adds to the knowledge that can be extracted by studying the network only at the structural level of URL links. We present an algorithm to automatically detect fine-grained discussion topics, characterized by n-grams and time intervals. We then propose a probabilistic model to estimate the temporal relationships that blogs have with one another. We define the precursor score of blog A in relation to blog B as the probability that A enters a new topic before B, discounting the effect created by asymmetric posting rates. Network-level metrics of precursor and laggard behavior are derived from these dyadic precursor score estimations. This model is used to analyze a network of French political blogs. The scores are compared to traditional link degree metrics. We obtain insights into the dynamics of topic participation on this network, as well as the relationship between precursor/laggard and linking behaviors. We validate and analyze results with the help of an expert on the French blogosphere. Finally, we propose possible applications to the improvement of search engine ranking algorithms

    Modeling Temporal Evidence from External Collections

    Full text link
    Newsworthy events are broadcast through multiple mediums and prompt the crowds to produce comments on social media. In this paper, we propose to leverage on this behavioral dynamics to estimate the most relevant time periods for an event (i.e., query). Recent advances have shown how to improve the estimation of the temporal relevance of such topics. In this approach, we build on two major novelties. First, we mine temporal evidences from hundreds of external sources into topic-based external collections to improve the robustness of the detection of relevant time periods. Second, we propose a formal retrieval model that generalizes the use of the temporal dimension across different aspects of the retrieval process. In particular, we show that temporal evidence of external collections can be used to (i) infer a topic's temporal relevance, (ii) select the query expansion terms, and (iii) re-rank the final results for improved precision. Experiments with TREC Microblog collections show that the proposed time-aware retrieval model makes an effective and extensive use of the temporal dimension to improve search results over the most recent temporal models. Interestingly, we observe a strong correlation between precision and the temporal distribution of retrieved and relevant documents.Comment: To appear in WSDM 201

    Fast Data in the Era of Big Data: Twitter's Real-Time Related Query Suggestion Architecture

    Full text link
    We present the architecture behind Twitter's real-time related query suggestion and spelling correction service. Although these tasks have received much attention in the web search literature, the Twitter context introduces a real-time "twist": after significant breaking news events, we aim to provide relevant results within minutes. This paper provides a case study illustrating the challenges of real-time data processing in the era of "big data". We tell the story of how our system was built twice: our first implementation was built on a typical Hadoop-based analytics stack, but was later replaced because it did not meet the latency requirements necessary to generate meaningful real-time results. The second implementation, which is the system deployed in production, is a custom in-memory processing engine specifically designed for the task. This experience taught us that the current typical usage of Hadoop as a "big data" platform, while great for experimentation, is not well suited to low-latency processing, and points the way to future work on data analytics platforms that can handle "big" as well as "fast" data
    • …
    corecore