1,085 research outputs found

    BlogForever: D2.5 Weblog Spam Filtering Report and Associated Methodology

    Get PDF
    This report is written as a first attempt to define the BlogForever spam detection strategy. It comprises a survey of weblog spam technology and approaches to their detection. While the report was written to help identify possible approaches to spam detection as a component within the BlogForver software, the discussion has been extended to include observations related to the historical, social and practical value of spam, and proposals of other ways of dealing with spam within the repository without necessarily removing them. It contains a general overview of spam types, ready-made anti-spam APIs available for weblogs, possible methods that have been suggested for preventing the introduction of spam into a blog, and research related to spam focusing on those that appear in the weblog context, concluding in a proposal for a spam detection workflow that might form the basis for the spam detection component of the BlogForever software

    Improving the evaluation of web search systems

    Get PDF
    Linkage analysis as an aid to web search has been assumed to be of significant benefit and we know that it is being implemented by many major Search Engines. Why then have few TREC participants been able to scientifically prove the benefits of linkage analysis over the past three years? In this paper we put forward reasons why disappointing results have been found and we identify the linkage density requirements of a dataset to faithfully support experiments into linkage analysis. We also report a series of linkage-based retrieval experiments on a more densely linked dataset culled from the TREC web documents

    Intelligent Web Crawler for Semantic Search Engine

    Get PDF
    A Semantic Search Engine (SSE) is a program that produces semantic-oriented concepts from the Internet. A web crawler is the front end of our SSE; its primary goal is to supply important and necessary information to the data analysis component of SSE. The main function of the analysis component is to produce the concepts (moderately frequent finite sequences of keywords) from the input; it uses some variants of TF-IDF as a primary tool to remove stop words. However, it is a very expensive way to filter out stop words using the idea of TF-IDF. The goal of this project is to improve the efficiency of the SSE by avoiding feeding junk data (stop words) to the SSE. In this project, we classify formally three classes of stop words: English-grammar-based stop words, Metadata stop words, and Topic-specific stop words. To remove English-grammar-based stop words, we simply use a list of stop words that can be found on the Internet. For Metadata stop words, we create a simple web crawler and add a modified HTML parser to it. The HTML parser is used to identify and remove Metadata stop words. So, our web crawler can remove most of the Metadata stop words and reduce the processing time of SSE. However, we do not know much about Topic-specific stop words. So, Topic-specific stop words are identified by a randomly selected sample of documents, instead of identifying all keywords (equal or above a threshold) and all stop words (below the threshold) on the whole set of documents. MapReduce is applied to reduce the complexity and find Topic- specific stop words such as “acm” (Association for Computing Machinery) that we find on IEEE data mining papers. Then, we create a Topic-specific stop word list and use it to reduce the processing time of SSE

    Twitter Malware Collection System: An Automated URL Extraction and Examination Platform

    Get PDF
    As the world becomes more interconnected through various technological services and methods, the threat of malware is increasingly looming overhead. One avenue in particular that is examined in this research is the social networking service Twitter. This research develops the Twitter Malware Collection System (TMCS). This system gathers Uniform Resource Locators (URLs) posted on Twitter and scans them to determine if any are hosting malware. This scanning process is performed by a cluster of Virtual Machines (VMs) running a specified software configuration and the execution prevention system known as ESCAPE, which detects malicious code. When a URL is detected by a TMCS VM instance to be hosting malware, a dump of the web browser is created to determine what kind of malicious activity has taken place and also how this activity was allowed. After collecting over a period of 40 days, and processing a total of 466,237 URLs twice in two different configurations, one consisting of a vulnerable Windows XP SP2 setup and the other consisting of a fully patched and updated Windows Vista setup, a total of 2,989 dumps were created by TMCS based on the results generated by ESCAPE
    • 

    corecore