72,626 research outputs found

    Web Mining Evolution & Comparative Study with Data Mining

    Get PDF
    Web Technology is evolving very fast and Internet Users are growing much faster than estimated. The website users are using a wide range of websites leaving back a variety of information. This information must be used by the websites administrator to manipulate their websites according to the users of the websites. Aim of research in web mining is to develop a new technique for extracting and mining useful information or knowledge from web pages. Thus it?s a challenging task for automated discovery of targeted or unexpected knowledge due to heterogeneity and lack of structure of web data. In this paper we will discuss about the evolution of web mining. This paper will contain detailed description about the other parts of web mining. Paper also analyse data mining and made a comparison between data mining and web mining on the basis of various parameters

    Automatically Extract Information from Web Documents

    Get PDF
    The Internet could be considered to be a reservoir of useful information in textual form — product catalogs, airline schedules, stock market quotations, weather forecast etc. There has been much interest in building systems that gather such information on a user\u27s behalf. But because these information resources are formatted differently, mechanically extracting their content is difficult. Systems using such resources typically use hand-coded wrappers, customized procedures for information extraction. Structured data objects are a very important type of information on the Web. Such data objects are often records from underlying databases and displayed in Web pages with some fixed templates. Mining data records in Web pages is useful because they typically present their host pages\u27 essential information, such as lists of products and services. Extracting these structured data objects enables one to integrate data/information from multiple Web pages to provide value-added services, e.g., comparative shopping, meta-querying and search. Web content mining has thus become an area of interest for many researchers because of the phenomenal growth of the Web contents and the economic benefits associated with it. However, due to the heterogeneity of Web pages, automated discovery of targeted information is still posing as a challenging problem

    Concept Based Search Engine: Concept Creation

    Get PDF
    Data on the internet is increasing exponentially every single second. There are billions and billions of documents on the World Wide Web (The Internet). Each document on the internet contains multiple concepts (an abstract or general idea inferred from specific instances). In this paper, we show how we created and implemented an algorithm for extracting concepts from a set of documents. These concepts can be used by a search engine for generating search results to cater the needs of the user. The search result will then be more targeted than the usual keyword search. The main problem was to extract concepts from a set of documents. Each page could have thousands of combinations that could be potential concepts. An average document could have millions of concepts. Combine that to the vast amount of data on the web, we are talking about an enormous amount of dataset and samples. As a result, the main areas of concern are the main memory constraints and the time complexity of the algorithm. This paper introduces an algorithm which is scalable, independent of the main memory and has a linear time complexity

    A TWITTER-INTEGRATED WEB SYSTEM TO AGGREGATE AND PROCESS EMERGENCY-RELATED DATA

    Get PDF
    A major challenge when encountering time-sensitive, information critical emergencies is to source raw volunteered data from on-site public sources and extract information which can enhance awareness on the emergency itself from a geographical context. This research explores the use of Twitter in the emergency domain by developing a Twitter-integrated web system capable of aggregating and processing emergency-related tweet data. The objectives of the project are to collect volunteered tweet data on emergencies by public citizen sources via the Twitter API, process the data based on geo-location information and syntax into organized informational entities relevant to an emergency, and subsequently deliver the information on a map-like interface. The web system framework is targeted for use by organizations which seek to transform volunteered emergency-related data available on the Twitter platform into timely, useful emergency alerts which can enhance situational awareness, and is intended to be accessible to the public through a user-friendly web interface. Rapid Application Development (RAD) is the methodology of choice for project development. The developed system has a system usability scale score of 84.25, after results were tabulated from a usability survey on 20 respondents. Said system is best for use in emergencies where the transmission timely, quantitative data is of paramount importance, and is a useful framework on extracting and displaying useful emergency alerts with a geographical perspective based on volunteered citizen Tweets. It is hoped that the project can ultimately contribute to the existing domain of knowledge on social media-assisted emergency applications

    Abusive Language Detection in Online Conversations by Combining Content-and Graph-based Features

    Full text link
    In recent years, online social networks have allowed worldwide users to meet and discuss. As guarantors of these communities, the administrators of these platforms must prevent users from adopting inappropriate behaviors. This verification task, mainly done by humans, is more and more difficult due to the ever growing amount of messages to check. Methods have been proposed to automatize this moderation process, mainly by providing approaches based on the textual content of the exchanged messages. Recent work has also shown that characteristics derived from the structure of conversations, in the form of conversational graphs, can help detecting these abusive messages. In this paper, we propose to take advantage of both sources of information by proposing fusion methods integrating content-and graph-based features. Our experiments on raw chat logs show that the content of the messages, but also of their dynamics within a conversation contain partially complementary information, allowing performance improvements on an abusive message classification task with a final F-measure of 93.26%

    Extracting Web Information using Representation Patterns

    Get PDF
    Feeding decision support systems with Web information typically requires sifting through an unwieldy amount of information that is available in human-friendly formats only. Our focus is on a scalable proposal to extract information from semi-structured documents in a structured format, with an emphasis on it being scalable and open. By semi-structured we mean that it must focus on informa tion that is rendered using regular formats, not free text; by scal able, we mean that the system must require a minimum amount of human intervention and it must not be targeted to extracting in formation from a particular domain or web site; by open, we mean that it must extract as much useful information as possible and not be subject to any pre-defined data model. In the literature, there is only one open but not scalable proposal, since it requires human supervision on a per-domain basis. In this paper, we present a new proposal that relies on a number of heuristics to identify patterns that are typically used to represent the information in a web docu ment. Our experimental results confirm that our proposal is very competitive in terms of effectiveness and efficiency.Ministerio de Economía y Competitividad TIN2016-75394-RMinisterio de Economía y Competitividad TIN2013-40848-
    • …
    corecore