Search CORE

72,626 research outputs found

Web Mining Evolution & Comparative Study with Data Mining

Author: Anu
Publication venue: 'Auricle Technologies, Pvt., Ltd.'
Publication date: 31/05/2017
Field of study

Web Technology is evolving very fast and Internet Users are growing much faster than estimated. The website users are using a wide range of websites leaving back a variety of information. This information must be used by the websites administrator to manipulate their websites according to the users of the websites. Aim of research in web mining is to develop a new technique for extracting and mining useful information or knowledge from web pages. Thus it?s a challenging task for automated discovery of targeted or unexpected knowledge due to heterogeneity and lack of structure of web data. In this paper we will discuss about the evolution of web mining. This paper will contain detailed description about the other parts of web mining. Paper also analyse data mining and made a comparison between data mining and web mining on the basis of various parameters

International Journal on Recent and Innovation Trends in Computing and Communication

Automatically Extract Information from Web Documents

Author: Sharma Dipesh
Publication venue: TopSCHOLAR®
Publication date: 01/12/2007
Field of study

The Internet could be considered to be a reservoir of useful information in textual form — product catalogs, airline schedules, stock market quotations, weather forecast etc. There has been much interest in building systems that gather such information on a user\u27s behalf. But because these information resources are formatted differently, mechanically extracting their content is difficult. Systems using such resources typically use hand-coded wrappers, customized procedures for information extraction. Structured data objects are a very important type of information on the Web. Such data objects are often records from underlying databases and displayed in Web pages with some fixed templates. Mining data records in Web pages is useful because they typically present their host pages\u27 essential information, such as lists of products and services. Extracting these structured data objects enables one to integrate data/information from multiple Web pages to provide value-added services, e.g., comparative shopping, meta-querying and search. Web content mining has thus become an area of interest for many researchers because of the phenomenal growth of the Web contents and the economic benefits associated with it. However, due to the heterogeneity of Web pages, automated discovery of targeted information is still posing as a challenging problem

TopSCHOLAR

Concept Based Search Engine: Concept Creation

Author: Rastogi Aishwarya
Publication venue: SJSU ScholarWorks
Publication date: 02/03/2016
Field of study

Data on the internet is increasing exponentially every single second. There are billions and billions of documents on the World Wide Web (The Internet). Each document on the internet contains multiple concepts (an abstract or general idea inferred from specific instances). In this paper, we show how we created and implemented an algorithm for extracting concepts from a set of documents. These concepts can be used by a search engine for generating search results to cater the needs of the user. The search result will then be more targeted than the usual keyword search. The main problem was to extract concepts from a set of documents. Each page could have thousands of combinations that could be potential concepts. An average document could have millions of concepts. Combine that to the vast amount of data on the web, we are talking about an enormous amount of dataset and samples. As a result, the main areas of concern are the main memory constraints and the time complexity of the algorithm. This paper introduces an algorithm which is scalable, independent of the main memory and has a linear time complexity

SJSU ScholarWorks

A TWITTER-INTEGRATED WEB SYSTEM TO AGGREGATE AND PROCESS EMERGENCY-RELATED DATA

Author: YONG CHANG YI
Publication venue: UNIVERSITI TEKNOLOGI PETRONAS
Publication date: 01/09/2012
Field of study

A major challenge when encountering time-sensitive, information critical emergencies is to source raw volunteered data from on-site public sources and extract information which can enhance awareness on the emergency itself from a geographical context. This research explores the use of Twitter in the emergency domain by developing a Twitter-integrated web system capable of aggregating and processing emergency-related tweet data. The objectives of the project are to collect volunteered tweet data on emergencies by public citizen sources via the Twitter API, process the data based on geo-location information and syntax into organized informational entities relevant to an emergency, and subsequently deliver the information on a map-like interface. The web system framework is targeted for use by organizations which seek to transform volunteered emergency-related data available on the Twitter platform into timely, useful emergency alerts which can enhance situational awareness, and is intended to be accessible to the public through a user-friendly web interface. Rapid Application Development (RAD) is the methodology of choice for project development. The developed system has a system usability scale score of 84.25, after results were tabulated from a usability survey on 20 respondents. Said system is best for use in emergencies where the transmission timely, quantitative data is of paramount importance, and is a useful framework on extracting and displaying useful emergency alerts with a geographical perspective based on volunteered citizen Tweets. It is hoped that the project can ultimately contribute to the existing domain of knowledge on social media-assisted emergency applications

UTPedia

Abusive Language Detection in Online Conversations by Combining Content-and Graph-based Features

Author: Cecillon Noé
Dufour Richard
Labatut Vincent
Linarès Georges
Publication venue: 'Frontiers Media SA'
Publication date: 20/05/2019
Field of study

In recent years, online social networks have allowed worldwide users to meet and discuss. As guarantors of these communities, the administrators of these platforms must prevent users from adopting inappropriate behaviors. This verification task, mainly done by humans, is more and more difficult due to the ever growing amount of messages to check. Methods have been proposed to automatize this moderation process, mainly by providing approaches based on the textual content of the exchanged messages. Recent work has also shown that characteristics derived from the structure of conversations, in the form of conversational graphs, can help detecting these abusive messages. In this paper, we propose to take advantage of both sources of information by proposing fusion methods integrating content-and graph-based features. Our experiments on raw chat logs show that the content of the messages, but also of their dynamics within a conversation contain partially complementary information, allowing performance improvements on an abusive message classification task with a final F-measure of 93.26%

arXiv.org e-Print Archive

Hal-Diderot

Extracting Web Information using Representation Patterns

Author: Corchuelo Gil Rafael
Jiménez Aguirre Patricia
Roldán Salvador Juan Carlos
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2017
Field of study

Feeding decision support systems with Web information typically requires sifting through an unwieldy amount of information that is available in human-friendly formats only. Our focus is on a scalable proposal to extract information from semi-structured documents in a structured format, with an emphasis on it being scalable and open. By semi-structured we mean that it must focus on informa tion that is rendered using regular formats, not free text; by scal able, we mean that the system must require a minimum amount of human intervention and it must not be targeted to extracting in formation from a particular domain or web site; by open, we mean that it must extract as much useful information as possible and not be subject to any pre-defined data model. In the literature, there is only one open but not scalable proposal, since it requires human supervision on a per-domain basis. In this paper, we present a new proposal that relies on a number of heuristics to identify patterns that are typically used to represent the information in a web docu ment. Our experimental results confirm that our proposal is very competitive in terms of effectiveness and efficiency.Ministerio de Economía y Competitividad TIN2016-75394-RMinisterio de Economía y Competitividad TIN2013-40848-

Crossref

idUS. Depósito de Investigación Universidad de Sevilla