72,626 research outputs found
Web Mining Evolution & Comparative Study with Data Mining
Web Technology is evolving very fast and Internet Users are growing much faster than estimated. The website users are using a wide range of websites leaving back a variety of information. This information must be used by the websites administrator to manipulate their websites according to the users of the websites. Aim of research in web mining is to develop a new technique for extracting and mining useful information or knowledge from web pages. Thus it?s a challenging task for automated discovery of targeted or unexpected knowledge due to heterogeneity and lack of structure of web data. In this paper we will discuss about the evolution of web mining. This paper will contain detailed description about the other parts of web mining. Paper also analyse data mining and made a comparison between data mining and web mining on the basis of various parameters
Automatically Extract Information from Web Documents
The Internet could be considered to be a reservoir of useful information in textual form — product catalogs, airline schedules, stock market quotations, weather forecast etc. There has been much interest in building systems that gather such information on a user\u27s behalf. But because these information resources are formatted differently, mechanically extracting their content is difficult. Systems using such resources typically use hand-coded wrappers, customized procedures for information extraction. Structured data objects are a very important type of information on the Web. Such data objects are often records from underlying databases and displayed in Web pages with some fixed templates. Mining data records in Web pages is useful because they typically present their host pages\u27 essential information, such as lists of products and services. Extracting these structured data objects enables one to integrate data/information from multiple Web pages to provide value-added services, e.g., comparative shopping, meta-querying and search. Web content mining has thus become an area of interest for many researchers because of the phenomenal growth of the Web contents and the economic benefits associated with it. However, due to the heterogeneity of Web pages, automated discovery of targeted information is still posing as a challenging problem
Concept Based Search Engine: Concept Creation
Data on the internet is increasing exponentially every single second. There are billions and billions of documents on the World Wide Web (The Internet). Each document on the internet contains multiple concepts (an abstract or general idea inferred from specific instances).
In this paper, we show how we created and implemented an algorithm for extracting concepts from a set of documents. These concepts can be used by a search engine for generating search results to cater the needs of the user. The search result will then be more targeted than the usual keyword search.
The main problem was to extract concepts from a set of documents. Each page could have thousands of combinations that could be potential concepts. An average document could have millions of concepts. Combine that to the vast amount of data on the web, we are talking about an enormous amount of dataset and samples. As a result, the main areas of concern are the main memory constraints and the time complexity of the algorithm.
This paper introduces an algorithm which is scalable, independent of the main memory and has a linear time complexity
A TWITTER-INTEGRATED WEB SYSTEM TO AGGREGATE AND PROCESS EMERGENCY-RELATED DATA
A major challenge when encountering time-sensitive, information critical
emergencies is to source raw volunteered data from on-site public sources and
extract information which can enhance awareness on the emergency itself from a
geographical context. This research explores the use of Twitter in the emergency
domain by developing a Twitter-integrated web system capable of aggregating and
processing emergency-related tweet data. The objectives of the project are to collect
volunteered tweet data on emergencies by public citizen sources via the Twitter API,
process the data based on geo-location information and syntax into organized
informational entities relevant to an emergency, and subsequently deliver the
information on a map-like interface. The web system framework is targeted for use
by organizations which seek to transform volunteered emergency-related data
available on the Twitter platform into timely, useful emergency alerts which can
enhance situational awareness, and is intended to be accessible to the public through
a user-friendly web interface. Rapid Application Development (RAD) is the
methodology of choice for project development. The developed system has a system
usability scale score of 84.25, after results were tabulated from a usability survey on
20 respondents. Said system is best for use in emergencies where the transmission
timely, quantitative data is of paramount importance, and is a useful framework on
extracting and displaying useful emergency alerts with a geographical perspective
based on volunteered citizen Tweets. It is hoped that the project can ultimately
contribute to the existing domain of knowledge on social media-assisted emergency
applications
Abusive Language Detection in Online Conversations by Combining Content-and Graph-based Features
In recent years, online social networks have allowed worldwide users to meet
and discuss. As guarantors of these communities, the administrators of these
platforms must prevent users from adopting inappropriate behaviors. This
verification task, mainly done by humans, is more and more difficult due to the
ever growing amount of messages to check. Methods have been proposed to
automatize this moderation process, mainly by providing approaches based on the
textual content of the exchanged messages. Recent work has also shown that
characteristics derived from the structure of conversations, in the form of
conversational graphs, can help detecting these abusive messages. In this
paper, we propose to take advantage of both sources of information by proposing
fusion methods integrating content-and graph-based features. Our experiments on
raw chat logs show that the content of the messages, but also of their dynamics
within a conversation contain partially complementary information, allowing
performance improvements on an abusive message classification task with a final
F-measure of 93.26%
Extracting Web Information using Representation Patterns
Feeding decision support systems with Web information typically
requires sifting through an unwieldy amount of information that is
available in human-friendly formats only. Our focus is on a scalable
proposal to extract information from semi-structured documents
in a structured format, with an emphasis on it being scalable and
open. By semi-structured we mean that it must focus on informa tion that is rendered using regular formats, not free text; by scal able, we mean that the system must require a minimum amount of
human intervention and it must not be targeted to extracting in formation from a particular domain or web site; by open, we mean
that it must extract as much useful information as possible and not
be subject to any pre-defined data model. In the literature, there is
only one open but not scalable proposal, since it requires human
supervision on a per-domain basis. In this paper, we present a new
proposal that relies on a number of heuristics to identify patterns
that are typically used to represent the information in a web docu ment. Our experimental results confirm that our proposal is very
competitive in terms of effectiveness and efficiency.Ministerio de EconomÃa y Competitividad TIN2016-75394-RMinisterio de EconomÃa y Competitividad TIN2013-40848-
- …