58,634 research outputs found

    Toward Entity-Aware Search

    Get PDF
    As the Web has evolved into a data-rich repository, with the standard "page view," current search engines are becoming increasingly inadequate for a wide range of query tasks. While we often search for various data "entities" (e.g., phone number, paper PDF, date), today's engines only take us indirectly to pages. In my Ph.D. study, we focus on a novel type of Web search that is aware of data entities inside pages, a significant departure from traditional document retrieval. We study the various essential aspects of supporting entity-aware Web search. To begin with, we tackle the core challenge of ranking entities, by distilling its underlying conceptual model Impression Model and developing a probabilistic ranking framework, EntityRank, that is able to seamlessly integrate both local and global information in ranking. We also report a prototype system built to show the initial promise of the proposal. Then, we aim at distilling and abstracting the essential computation requirements of entity search. From the dual views of reasoning--entity as input and entity as output, we propose a dual-inversion framework, with two indexing and partition schemes, towards efficient and scalable query processing. Further, to recognize more entity instances, we study the problem of entity synonym discovery through mining query log data. The results we obtained so far have shown clear promise of entity-aware search, in its usefulness, effectiveness, efficiency and scalability

    DOBBS: Towards a Comprehensive Dataset to Study the Browsing Behavior of Online Users

    Full text link
    The investigation of the browsing behavior of users provides useful information to optimize web site design, web browser design, search engines offerings, and online advertisement. This has been a topic of active research since the Web started and a large body of work exists. However, new online services as well as advances in Web and mobile technologies clearly changed the meaning behind "browsing the Web" and require a fresh look at the problem and research, specifically in respect to whether the used models are still appropriate. Platforms such as YouTube, Netflix or last.fm have started to replace the traditional media channels (cinema, television, radio) and media distribution formats (CD, DVD, Blu-ray). Social networks (e.g., Facebook) and platforms for browser games attracted whole new, particularly less tech-savvy audiences. Furthermore, advances in mobile technologies and devices made browsing "on-the-move" the norm and changed the user behavior as in the mobile case browsing is often being influenced by the user's location and context in the physical world. Commonly used datasets, such as web server access logs or search engines transaction logs, are inherently not capable of capturing the browsing behavior of users in all these facets. DOBBS (DERI Online Behavior Study) is an effort to create such a dataset in a non-intrusive, completely anonymous and privacy-preserving way. To this end, DOBBS provides a browser add-on that users can install, which keeps track of their browsing behavior (e.g., how much time they spent on the Web, how long they stay on a website, how often they visit a website, how they use their browser, etc.). In this paper, we outline the motivation behind DOBBS, describe the add-on and captured data in detail, and present some first results to highlight the strengths of DOBBS

    A Brief History of Web Crawlers

    Full text link
    Web crawlers visit internet applications, collect data, and learn about new web pages from visited pages. Web crawlers have a long and interesting history. Early web crawlers collected statistics about the web. In addition to collecting statistics about the web and indexing the applications for search engines, modern crawlers can be used to perform accessibility and vulnerability checks on the application. Quick expansion of the web, and the complexity added to web applications have made the process of crawling a very challenging one. Throughout the history of web crawling many researchers and industrial groups addressed different issues and challenges that web crawlers face. Different solutions have been proposed to reduce the time and cost of crawling. Performing an exhaustive crawl is a challenging question. Additionally capturing the model of a modern web application and extracting data from it automatically is another open question. What follows is a brief history of different technique and algorithms used from the early days of crawling up to the recent days. We introduce criteria to evaluate the relative performance of web crawlers. Based on these criteria we plot the evolution of web crawlers and compare their performanc

    Trusting (and Verifying) Online Intermediaries\u27 Policing

    Get PDF
    All is not well in the land of online self-regulation. However competently internet intermediaries police their sites, nagging questions will remain about their fairness and objectivity in doing so. Is Comcast blocking BitTorrent to stop infringement, to manage traffic, or to decrease access to content that competes with its own for viewers? How much digital due process does Google need to give a site it accuses of harboring malware? If Facebook censors a video of war carnage, is that a token of respect for the wounded or one more reflexive effort of a major company to ingratiate itself with the Washington establishment? Questions like these will persist, and erode the legitimacy of intermediary self-policing, as long as key operations of leading companies are shrouded in secrecy. Administrators must develop an institutional competence for continually monitoring rapidly-changing business practices. A trusted advisory council charged with assisting the Federal Trade Commission (FTC) and Federal Communications Commission (FCC) could help courts and agencies adjudicate controversies concerning intermediary practices. An Internet Intermediary Regulatory Council (IIRC) would spur the development of expertise necessary to understand whether companies’ controversial decisions are socially responsible or purely self-interested. Monitoring is a prerequisite for assuring a level playing field online

    FLASH: Randomized Algorithms Accelerated over CPU-GPU for Ultra-High Dimensional Similarity Search

    Full text link
    We present FLASH (\textbf{F}ast \textbf{L}SH \textbf{A}lgorithm for \textbf{S}imilarity search accelerated with \textbf{H}PC), a similarity search system for ultra-high dimensional datasets on a single machine, that does not require similarity computations and is tailored for high-performance computing platforms. By leveraging a LSH style randomized indexing procedure and combining it with several principled techniques, such as reservoir sampling, recent advances in one-pass minwise hashing, and count based estimations, we reduce the computational and parallelization costs of similarity search, while retaining sound theoretical guarantees. We evaluate FLASH on several real, high-dimensional datasets from different domains, including text, malicious URL, click-through prediction, social networks, etc. Our experiments shed new light on the difficulties associated with datasets having several million dimensions. Current state-of-the-art implementations either fail on the presented scale or are orders of magnitude slower than FLASH. FLASH is capable of computing an approximate k-NN graph, from scratch, over the full webspam dataset (1.3 billion nonzeros) in less than 10 seconds. Computing a full k-NN graph in less than 10 seconds on the webspam dataset, using brute-force (n2Dn^2D), will require at least 20 teraflops. We provide CPU and GPU implementations of FLASH for replicability of our results
    • …
    corecore