1,528 research outputs found

    BlogForever: D2.5 Weblog Spam Filtering Report and Associated Methodology

    Get PDF
    This report is written as a first attempt to define the BlogForever spam detection strategy. It comprises a survey of weblog spam technology and approaches to their detection. While the report was written to help identify possible approaches to spam detection as a component within the BlogForver software, the discussion has been extended to include observations related to the historical, social and practical value of spam, and proposals of other ways of dealing with spam within the repository without necessarily removing them. It contains a general overview of spam types, ready-made anti-spam APIs available for weblogs, possible methods that have been suggested for preventing the introduction of spam into a blog, and research related to spam focusing on those that appear in the weblog context, concluding in a proposal for a spam detection workflow that might form the basis for the spam detection component of the BlogForever software

    A novel defense mechanism against web crawler intrusion

    Get PDF
    Web robots also known as crawlers or spiders are used by search engines, hackers and spammers to gather information about web pages. Timely detection and prevention of unwanted crawlers increases privacy and security of websites. In this research, a novel method to identify web crawlers is proposed to prevent unwanted crawler to access websites. The proposed method suggests a five-factor identification process to detect unwanted crawlers. This study provides the pretest and posttest results along with a systematic evaluation of web pages with the proposed identification technique versus web pages without the proposed identification process. An experiment was performed with repeated measures for two groups with each group containing ninety web pages. The outputs of the logistic regression analysis of treatment and control groups confirm the novel five-factor identification process as an effective mechanism to prevent unwanted web crawlers. This study concluded that the proposed five distinct identifier process is a very effective technique as demonstrated by a successful outcome

    Forming Within-site Topical Information Space to Facilitate Online Free-Choice Learning

    Get PDF
    Locating specific and structured information in the World Wide Web (WWW) is becoming increasingly difficult, because of the rapid growth of the Web and the distributed nature of information. Although existing search engines do a good job in ranking web pages based on topical relevance, they provide limited assistance for free-choice learners to leverage the nonlinear nature of information spaces for knowledge acquisition. We hypothesize that free-choice learners would benefit more from structured topical information spaces than a list of individual pages across multiple websites. We conceptualize a within-site topical information space as a sphere formed by linked pages centering on a web page. In this paper, we investigate techniques and heuristics to form the space. In particular, we propose a hybrid method that relies on not only content-based characteristics and user queries, but also a site\u27s global structure. Experimental results show that consideration of website topology provides good improvement to page relevance estimation, indicating the clustering tendency of relevant pages

    Implementation Steps to Optimize Search Engine Marketing (SEM) Results for Small and Medium Sized E-Commerce Companies

    Get PDF
    In terms of Internet marketing, search engines are the channel of choice for most of the internet advertisers in nowadays online-market. Considering its growth as an online advertising media, online marketers exploit this media using Search Engine Marketing (SEM) together with its strategies and implementation steps. This paper suggests some implementation steps for SEM to facilitate startups websites to be visible and competitive throughout this media. Delicate search engine mechanisms regarding indexing and web crawling develop Search Engine Marketing into the implementation steps including short-term and long-term marketing strategies. Search engine mechanisms schemes with the aid of author’s experience in organizing SEM contribute to the steps implication and conceptualization. In the light of implementation steps, marketers are competent to constantly advance their websites with the direction

    The big five: Discovering linguistic characteristics that typify distinct personality traits across Yahoo! answers members

    Get PDF
    Indexación: Scopus.This work was partially supported by the project FONDECYT “Bridging the Gap between Askers and Answers in Community Question Answering Services” (11130094) funded by the Chilean Government.In psychology, it is widely believed that there are five big factors that determine the different personality traits: Extraversion, Agreeableness, Conscientiousness and Neuroticism as well as Openness. In the last years, researchers have started to examine how these factors are manifested across several social networks like Facebook and Twitter. However, to the best of our knowledge, other kinds of social networks such as social/informational question-answering communities (e.g., Yahoo! Answers) have been left unexplored. Therefore, this work explores several predictive models to automatically recognize these factors across Yahoo! Answers members. As a means of devising powerful generalizations, these models were combined with assorted linguistic features. Since we do not have access to ask community members to volunteer for taking the personality test, we built a study corpus by conducting a discourse analysis based on deconstructing the test into 112 adjectives. Our results reveal that it is plausible to lessen the dependency upon answered tests and that effective models across distinct factors are sharply different. Also, sentiment analysis and dependency parsing proven to be fundamental to deal with extraversion, agreeableness and conscientiousness. Furthermore, medium and low levels of neuroticism were found to be related to initial stages of depression and anxiety disorders. © 2018 Lithuanian Institute of Philosophy and Sociology. All rights reserved.https://www.cys.cic.ipn.mx/ojs/index.php/CyS/article/view/275

    BlogForever D2.6: Data Extraction Methodology

    Get PDF
    This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform
    corecore