10,950 research outputs found
Ontology Based Approach for Services Information Discovery using Hybrid Self Adaptive Semantic Focused Crawler
Focused crawling is aimed at specifically searching out pages that are relevant to a predefined set of topics. Since ontology is an all around framed information representation, ontology based focused crawling methodologies have come into exploration. Crawling is one of the essential systems for building information stockpiles. The reason for semantic focused crawler is naturally finding, commenting and ordering the administration data with the Semantic Web advances. Here, a framework of a hybrid self-adaptive semantic focused crawler â HSASF crawler, with the inspiration driving viably discovering, and sorting out administration organization information over the Internet, by considering the three essential issues has been displayed. A semi-supervised system has been planned with the inspiration driving subsequently selecting the ideal limit values for each idea, while considering the optimal performance without considering the constraint of the preparation of data set.
DOI: 10.17762/ijritcc2321-8169.15072
iCrawl: Improving the Freshness of Web Collections by Integrating Social Web and Focused Web Crawling
Researchers in the Digital Humanities and journalists need to monitor,
collect and analyze fresh online content regarding current events such as the
Ebola outbreak or the Ukraine crisis on demand. However, existing focused
crawling approaches only consider topical aspects while ignoring temporal
aspects and therefore cannot achieve thematically coherent and fresh Web
collections. Especially Social Media provide a rich source of fresh content,
which is not used by state-of-the-art focused crawlers. In this paper we
address the issues of enabling the collection of fresh and relevant Web and
Social Web content for a topic of interest through seamless integration of Web
and Social Media in a novel integrated focused crawler. The crawler collects
Web and Social Media content in a single system and exploits the stream of
fresh Social Media content for guiding the crawler.Comment: Published in the Proceedings of the 15th ACM/IEEE-CS Joint Conference
on Digital Libraries 201
Focused browsing: Providing topical feedback for link selection in hypertext browsing
When making decisions about whether to navigate to a linked page, users of standard browsers of hypertextual documents returned by an information retrieval search engine are entirely reliant on the content of the anchortext
associated with links and the surrounding text. This information is often insufïŹcient for them to make reliable decisions about whether to open a linked page, and they can ïŹnd themselves following many links to pages which are not helpful with subsequent return to the previous page. We describe a prototype focusing browsing application which provides feedback on the likely usefulness of each page linked from the current one, and a term cloud preview of the contents of each linked page. Results from an exploratory experiment suggest that users can ïŹnd this useful in improving their search efïŹciency
Building a domain-specific document collection for evaluating metadata effects on information retrieval
This paper describes the development of a structured document collection containing user-generated text and numerical metadata for exploring the exploitation of metadata in information retrieval (IR). The collection consists of more than 61,000 documents extracted from YouTube video pages on basketball in general and NBA (National Basketball Association) in particular, together with a set of 40 topics and their relevance judgements. In addition, a collection of nearly 250,000 user profiles related to the NBA collection is available. Several baseline IR experiments report the effect of using video-associated metadata on retrieval effectiveness. The results
surprisingly show that searching the videos titles only performs significantly better than searching additional metadata text fields of the videos such as the tags or the description
LiveRank: How to Refresh Old Datasets
This paper considers the problem of refreshing a dataset. More precisely ,
given a collection of nodes gathered at some time (Web pages, users from an
online social network) along with some structure (hyperlinks, social
relationships), we want to identify a significant fraction of the nodes that
still exist at present time. The liveness of an old node can be tested through
an online query at present time. We call LiveRank a ranking of the old pages so
that active nodes are more likely to appear first. The quality of a LiveRank is
measured by the number of queries necessary to identify a given fraction of the
active nodes when using the LiveRank order. We study different scenarios from a
static setting where the Liv-eRank is computed before any query is made, to
dynamic settings where the LiveRank can be updated as queries are processed.
Our results show that building on the PageRank can lead to efficient LiveRanks,
for Web graphs as well as for online social networks
Applying digital content management to support localisation
The retrieval and presentation of digital content such as that on the World Wide Web (WWW) is a substantial area of research. While recent years have seen huge expansion in the size of web-based archives that can be searched efficiently by commercial search engines, the presentation of potentially relevant content is still limited to ranked document lists represented by simple text snippets or image keyframe surrogates. There is expanding interest in techniques to personalise the presentation of content to improve the richness and effectiveness of the user experience. One of the most significant challenges to achieving this is the increasingly multilingual nature of this data, and the need to provide suitably localised responses to users based on this content. The Digital Content Management (DCM) track of the Centre for Next Generation Localisation (CNGL) is seeking to develop technologies to support advanced personalised access and presentation of information by combining elements from the existing research areas of Adaptive Hypermedia and Information Retrieval. The combination of these technologies is intended to produce significant improvements in the way users access information. We review key features of these technologies and introduce early ideas for how these technologies can support localisation and localised content before concluding with some impressions of future directions in DCM
Fame for sale: efficient detection of fake Twitter followers
are those Twitter accounts specifically created to
inflate the number of followers of a target account. Fake followers are
dangerous for the social platform and beyond, since they may alter concepts
like popularity and influence in the Twittersphere - hence impacting on
economy, politics, and society. In this paper, we contribute along different
dimensions. First, we review some of the most relevant existing features and
rules (proposed by Academia and Media) for anomalous Twitter accounts
detection. Second, we create a baseline dataset of verified human and fake
follower accounts. Such baseline dataset is publicly available to the
scientific community. Then, we exploit the baseline dataset to train a set of
machine-learning classifiers built over the reviewed rules and features. Our
results show that most of the rules proposed by Media provide unsatisfactory
performance in revealing fake followers, while features proposed in the past by
Academia for spam detection provide good results. Building on the most
promising features, we revise the classifiers both in terms of reduction of
overfitting and cost for gathering the data needed to compute the features. The
final result is a novel classifier, general enough to thwart
overfitting, lightweight thanks to the usage of the less costly features, and
still able to correctly classify more than 95% of the accounts of the original
training set. We ultimately perform an information fusion-based sensitivity
analysis, to assess the global sensitivity of each of the features employed by
the classifier. The findings reported in this paper, other than being supported
by a thorough experimental methodology and interesting on their own, also pave
the way for further investigation on the novel issue of fake Twitter followers
A Brief History of Web Crawlers
Web crawlers visit internet applications, collect data, and learn about new
web pages from visited pages. Web crawlers have a long and interesting history.
Early web crawlers collected statistics about the web. In addition to
collecting statistics about the web and indexing the applications for search
engines, modern crawlers can be used to perform accessibility and vulnerability
checks on the application. Quick expansion of the web, and the complexity added
to web applications have made the process of crawling a very challenging one.
Throughout the history of web crawling many researchers and industrial groups
addressed different issues and challenges that web crawlers face. Different
solutions have been proposed to reduce the time and cost of crawling.
Performing an exhaustive crawl is a challenging question. Additionally
capturing the model of a modern web application and extracting data from it
automatically is another open question. What follows is a brief history of
different technique and algorithms used from the early days of crawling up to
the recent days. We introduce criteria to evaluate the relative performance of
web crawlers. Based on these criteria we plot the evolution of web crawlers and
compare their performanc
- âŠ