16,679 research outputs found
The contribution of data mining to information science
The information explosion is a serious challenge for current information institutions. On the other hand, data mining, which is the search for valuable information in large volumes of data, is one of the solutions to face this challenge. In the past several years, data mining has made a significant contribution to the field of information science. This paper examines the impact of data mining by reviewing existing applications, including personalized environments, electronic commerce, and search engines. For these three types of application, how data mining can enhance their functions is discussed. The reader of this paper is expected to get an overview of the state of the art research associated with these applications. Furthermore, we identify the limitations of current work and raise several directions for future research
FilteredWeb: A Framework for the Automated Search-Based Discovery of Blocked URLs
Various methods have been proposed for creating and maintaining lists of
potentially filtered URLs to allow for measurement of ongoing internet
censorship around the world. Whilst testing a known resource for evidence of
filtering can be relatively simple, given appropriate vantage points,
discovering previously unknown filtered web resources remains an open
challenge.
We present a new framework for automating the process of discovering filtered
resources through the use of adaptive queries to well-known search engines. Our
system applies information retrieval algorithms to isolate characteristic
linguistic patterns in known filtered web pages; these are then used as the
basis for web search queries. The results of these queries are then checked for
evidence of filtering, and newly discovered filtered resources are fed back
into the system to detect further filtered content.
Our implementation of this framework, applied to China as a case study, shows
that this approach is demonstrably effective at detecting significant numbers
of previously unknown filtered web pages, making a significant contribution to
the ongoing detection of internet filtering as it develops.
Our tool is currently deployed and has been used to discover 1355 domains
that are poisoned within China as of Feb 2017 - 30 times more than are
contained in the most widely-used public filter list. Of these, 759 are outside
of the Alexa Top 1000 domains list, demonstrating the capability of this
framework to find more obscure filtered content. Further, our initial analysis
of filtered URLs, and the search terms that were used to discover them, gives
further insight into the nature of the content currently being blocked in
China.Comment: To appear in "Network Traffic Measurement and Analysis Conference
2017" (TMA2017
Open issues in semantic query optimization in relational DBMS
After two decades of research into Semantic Query Optimization (SQO) there is clear agreement as to the efficacy of SQO. However, although there are some experimental implementations there are still no commercial implementations. We
first present a thorough analysis of research into SQO. We identify three problems which inhibit the effective use of SQO in Relational Database Management Systems(RDBMS). We then propose solutions to these problems and describe first steps towards the implementation of an effective semantic query optimizer for relational databases
Recommended from our members
Enterprise application reuse: Semantic discovery of business grid services
Web services have emerged as a prominent paradigm for the development of distributed software systems as they provide the potential for software to be modularized in a way that functionality can be described, discovered and deployed in a platform independent manner over a network (e.g., intranets, extranets and the Internet). This paper examines an extension of this paradigm to encompass ‘Grid Services’, which enables software capabilities to be recast with an operational focus and support a heterogeneous mix of business software and data, termed a Business Grid - "the grid of semantic services". The current industrial representation of services is predominantly syntactic however, lacking the fundamental semantic underpinnings required to fulfill the goals of any semantically-oriented Grid. Consequently, the use of semantic technology in support of business software heterogeneity is investigated as a likely tool to support a diverse and distributed software inventory and user. Service discovery architecture is therefore developed that is (a) distributed in form, (2) supports distributed service knowledge and (3) automatically extends service knowledge (as greater descriptive precision is inferred from the operating application system). This discovery engine is used to execute several real-word scenarios in order to develop and test a framework for engineering such grid service knowledge. The examples presented comprise software components taken from a group of Investment Banking systems. Resulting from the research is a framework for engineering servic
Recommended from our members
Digital Creativity Support for Original Journalism
The decline in circulations and revenues resulting from the digitalization of news production and consumption has led to a crisis in journalism.Journalists have less time to research, investigate and write original stories, leading to problems for our democratic processes and holding the powerful to account. This paper reports the architecture, features and rationale for new digital creativity support designed to support journalists to discover more original angles onstories. It also summarises the evaluation of the tool’s use in 3 newsrooms
How Much of the Web Is Archived?
Although the Internet Archive's Wayback Machine is the largest and most
well-known web archive, there have been a number of public web archives that
have emerged in the last several years. With varying resources, audiences and
collection development policies, these archives have varying levels of overlap
with each other. While individual archives can be measured in terms of number
of URIs, number of copies per URI, and intersection with other archives, to
date there has been no answer to the question "How much of the Web is
archived?" We study the question by approximating the Web using sample URIs
from DMOZ, Delicious, Bitly, and search engine indexes; and, counting the
number of copies of the sample URIs exist in various public web archives. Each
sample set provides its own bias. The results from our sample sets indicate
that range from 35%-90% of the Web has at least one archived copy, 17%-49% has
between 2-5 copies, 1%-8% has 6-10 copies, and 8%-63% has more than 10 copies
in public web archives. The number of URI copies varies as a function of time,
but no more than 31.3% of URIs are archived more than once per month.Comment: This is the long version of the short paper by the same title
published at JCDL'11. 10 pages, 5 figures, 7 tables. Version 2 includes minor
typographical correction
- …