45,084 research outputs found
Combination of Multiple Bipartite Ranking for Web Content Quality Evaluation
Web content quality estimation is crucial to various web content processing
applications. Our previous work applied Bagging + C4.5 to achive the best
results on the ECML/PKDD Discovery Challenge 2010, which is the comibination of
many point-wise rankinig models. In this paper, we combine multiple pair-wise
bipartite ranking learner to solve the multi-partite ranking problems for the
web quality estimation. In encoding stage, we present the ternary encoding and
the binary coding extending each rank value to (L is the number of the
different ranking value). For the decoding, we discuss the combination of
multiple ranking results from multiple bipartite ranking models with the
predefined weighting and the adaptive weighting. The experiments on ECML/PKDD
2010 Discovery Challenge datasets show that \textit{binary coding} +
\textit{predefined weighting} yields the highest performance in all four
combinations and furthermore it is better than the best results reported in
ECML/PKDD 2010 Discovery Challenge competition.Comment: 17 pages, 8 figures, 2 table
Technique for proficiently yielding Deep-Web Interfaces Using Smart Crawler
Now a days, world web has most famous because of web as well as internet increased development and its effect is that there are more requirements of the techniques that are used to improve the effectiveness of locating the deep-web interface. A technique called as a web crawler that surfs the World Wide Web in automatic manner. This is also called as Web crawling or spidering. In proposed system, initial phase is Smart Crawler works upon site-based scanning for mediatory pages by implementing search engines. It prevents the traffic that colliding with huge amount of pages. Accurate outcomes are taken due to focus upon crawl. Ranking of websites is done on the basis of arrangements on the basis of the priority valuable individuals and quick in-site finding through designing most suitable links with an adaptive link-ranking. There is always trying to search the deep web databases that doesnât connected with any of the web search tools. They are continuous insignificantly distributed as well as they are constantly modifying. This issue is overcome by implementing two crawlers such as generic crawlers and focused crawlers. Generic crawlers aggregate every frame that may be found as well as it not concentrate over a particular subject. Focused crawlers such as Form-Focused Crawler (FFC) and Adaptive Crawler for Hidden-web Entries (ACHE) may continuous to find for online databases on a specific subject. FFC is designed to work with connections, pages as well as from classifiers for focused crawling of web forms and it is extended through adding ACHE with more components for filtering and adaptive link learner. This system implements Naive Bayes classifier instead of SVM for searchable structure classifier (SFC) and a domain-specific form classifier (DSFC). Naive Bayes classifiers in machine learning are a bunch of clear probabilistic classifiers determine by implementing Bayes theorem with solid (gullible) freedom assumptions from the components. In proposed system we contribute a novel module user login for selection of authorized user who may surf the particular domain on the basis of provided data the client and that is also used for filtering the results. In this system additionally implemented the concept of pre-query as well as post-query. Pre-query works only with the form and with the pages that included it and Post-query is utilizes data collected outcomes from form submissions
Enhance Crawler For Efficiently Harvesting Deep Web Interfaces
Scenario in web is varying quickly and size of web resources is rising, efficiency has become a challenging problem for crawling such data. The hidden web content is the data that cannot be indexed by search engines as they always stay behind searchable web interfaces. The proposed system purposes to develop a framework for focused crawler for efficient gathering hidden web interfaces. Firstly Crawler performs site-based searching for getting center pages with the help of web search tools to avoid from visiting additional number of pages. To get more specific results for a focused crawler, projected crawler ranks websites by giving high priority to more related ones for a given search. Crawler accomplishes fast in-site searching via watching for more relevant links with an adaptive link ranking. Here we have incorporated spell checker for giving correct input and apply reverse searching with incremental site prioritizing for wide-ranging coverage of hidden web sites
LiveRank: How to Refresh Old Datasets
This paper considers the problem of refreshing a dataset. More precisely ,
given a collection of nodes gathered at some time (Web pages, users from an
online social network) along with some structure (hyperlinks, social
relationships), we want to identify a significant fraction of the nodes that
still exist at present time. The liveness of an old node can be tested through
an online query at present time. We call LiveRank a ranking of the old pages so
that active nodes are more likely to appear first. The quality of a LiveRank is
measured by the number of queries necessary to identify a given fraction of the
active nodes when using the LiveRank order. We study different scenarios from a
static setting where the Liv-eRank is computed before any query is made, to
dynamic settings where the LiveRank can be updated as queries are processed.
Our results show that building on the PageRank can lead to efficient LiveRanks,
for Web graphs as well as for online social networks
Ranking with Submodular Valuations
We study the problem of ranking with submodular valuations. An instance of
this problem consists of a ground set , and a collection of monotone
submodular set functions , where each .
An additional ingredient of the input is a weight vector . The
objective is to find a linear ordering of the ground set elements that
minimizes the weighted cover time of the functions. The cover time of a
function is the minimal number of elements in the prefix of the linear ordering
that form a set whose corresponding function value is greater than a unit
threshold value.
Our main contribution is an -approximation algorithm
for the problem, where is the smallest non-zero marginal value that
any function may gain from some element. Our algorithm orders the elements
using an adaptive residual updates scheme, which may be of independent
interest. We also prove that the problem is -hard to
approximate, unless P = NP. This implies that the outcome of our algorithm is
optimal up to constant factors.Comment: 16 pages, 3 figure
Integrated content presentation for multilingual and multimedia information access
For multilingual and multimedia information retrieval from
multiple potentially distributed collections generating the
output in the form of standard ranked lists may often mean
that a user has to explore the contents of many lists before
finding sufficient relevant or linguistically accessible material to satisfy their information need. In some situations delivering an integrated multilingual multimedia presentation could enable the user to explore a topic allowing them to select from among a range of available content based on suitably chosen displayed metadata. A presentation of this type has similarities with the outputs of existing adaptive hypermedia systems. However, such systems are generated based on âclosedâ content with sophisticated user and domain models. Extending them to âopenâ domain information retrieval applications would raise many issues. We present an outline exploration of what will form a challenging new direction for research in multilingual information access
BlogForever: D2.5 Weblog Spam Filtering Report and Associated Methodology
This report is written as a first attempt to define the BlogForever spam detection strategy. It comprises a survey of weblog spam technology and approaches to their detection. While the report was written to help identify possible approaches to spam detection as a component within the BlogForver software, the discussion has been extended to include observations related to the historical, social and practical value of spam, and proposals of other ways of dealing with spam within the repository without necessarily removing them. It contains a general overview of spam types, ready-made anti-spam APIs available for weblogs, possible methods that have been suggested for preventing the introduction of spam into a blog, and research related to spam focusing on those that appear in the weblog context, concluding in a proposal for a spam detection workflow that might form the basis for the spam detection component of the BlogForever software
- âŠ