548 research outputs found
An Improved PageRank Method based on Genetic Algorithm for Web Search
AbstractWeb search engine has become a very important tool for finding information efficiently from the massive Web data. Based on PageRank algorithm, a genetic PageRank algorithm (GPRA) is proposed. With the condition of preserving PageRank algorithm advantages, GPRA takes advantage of genetic algorithm so as to solve web search. Experimental results have shown that GPRA is superior to PageRank algorithm and genetic algorithm on performance
A Word Embedding Based Approach for Focused Web Crawling Using the Recurrent Neural Network
Learning-based focused crawlers download relevant uniform resource locators (URLs) from the web for a specific topic. Several studies have used the term frequency-inverse document frequency (TF-IDF) weighted cosine vector as an input feature vector for learning algorithms. TF-IDF-based crawlers calculate the relevance of a web page only if a topic word co-occurs on the said page, failing which it is considered irrelevant. Similarity is not considered even if a synonym of a term co-occurs on a web page. To resolve this challenge, this paper proposes a new methodology that integrates the Adagrad-optimized Skip Gram Negative Sampling (A-SGNS)-based word embedding and the Recurrent Neural Network (RNN).The cosine similarity is calculated from the word embedding matrix to form a feature vector that is given as an input to the RNN to predict the relevance of the website. The performance of the proposed method is evaluated using the harvest rate (hr) and irrelevance ratio (ir). The proposed methodology outperforms existing methodologies with an average harvest rate of 0.42 and irrelevance ratio of 0.58
Enhancing E-learning platforms with social networks mining
Social Networks appeared as an Internet application that offers several tools to create a personal virtual profile, add other users as friends, and interact with them through messages. These networks quickly evolved and won particular importance in people lives. Now, everyday, people use social networks to share news, interests, and discuss topics that in some way are
important to them.
Together with social networks, e-learning platforms and related technologies have evolved in the recent years. Both platforms and technologies (social networks and e-learning) enable access to specific
information and are able to redirect specific content to an individual person.
This dissertation is motivated on social networks data mining over e-learning
platforms. It considers the following four social networks: Facebook, Twitter, Google Plus, and Delicious. In order to acquire, analyze, and make a correct and precise implementation of data, two different approaches were followed: enhancement of a current e-learning platform and improvement of search engines. The first approach proposes and elaborates a recommendation tool for Web documents using, as main criterion, social information to support a custom Learning Management System (LMS). In order to create the proposed system, three distinct applications (the Crawler, the SocialRank, and the Recommender) were proposed. Such data
will be then incorporated into an LMS system, such as the Personal Learning Environment Box (PLEBOX). PLEBOX is a custom platform based on operating systems layout, and also, provides a software development kit (SDK), a
group of tools, to create and manage modules. The results of recommendation tool about ten course units are presented.
The second part presents an approach to improve a search engine based on social networks content. Subsequently, a depth analysis to justify the abovementioned procedures in order to create the SocialRank is presented.
Finally, the results are presented and validated together with a custom search engine. Then, a solution to integrate and offer an order improvement of Web contents in a search engine was proposed, created, demonstrated, and validated, and it is ready for use.As redes sociais surgiram como um serviço Web com funcionalidades de
criação de perfil, criação e interação de amigos. Estas redes evoluíram
rapidamente e ganharam uma determinada importância na vida das pessoas.
Agora, todos os dias, as pessoas usam as redes sociais para partilhar
notícias, interesses e discutir temas que de alguma forma são importantes
para elas.
Juntamente com as redes sociais, as plataformas de aprendizagem baseadas
em tecnologias, conhecidas como plataformas E-learning têm evoluído
muito nos últimos anos. Ambas as plataformas e tecnologias (redes sociais e
E-learning) fornecem acesso a informações específicas e são capazes de
redirecionar determinado conteúdo para um ou vários indivíduos
(personalização).
O tema desta dissertação é motivado pela mineração do conteúdo das redes
sociais em plataformas E-learning. Neste sentido, foram selecionadas quatro
redes sociais, Facebook, Twitter, Google Plus, e Delicious para servir de
estudo de caso à solução proposta. A fim de adquirir, analisar e concretizar
uma aplicação correta e precisa dos dados, duas abordagens diferentes
foram seguidas: enriquecimento de uma plataforma E-learning atual e
melhoria dos motores de busca. A primeira abordagem propõe e elaboração
de uma ferramenta de recomendação de documentos Web usando, como
principal critério, a informação social para apoiar um sistema de gestão de
aprendizagem (LMS). Desta forma, foram construídas três aplicações
distintas, designadas por Crawler, SocialRank e Recommender. As
informações extraídas serão incorporadas num sistema E-learning, tendo
sido escolhida a PLEBOX (Personal Learning Environment Box). A PLEBOX é
uma plataforma personalizada baseada numa interface inspirada nos
sistemas operativos, fornecendo um conjunto de ferramentas (os conhecidos
SDK - software development kit), para a criação e gestão de módulos. Dez
unidades curriculares foram avaliadas e os resultados do sistema de
recomendação são apresentados.
A segunda abordagem apresenta uma proposta para melhorar um motor de
busca com base no conteúdo das redes sociais. Subsequentemente, uma
análise profunda é apresentada, justificando os procedimentos de avaliação,
afim de criar o ranking de resultados (o SocialRank). Por último, os
resultados são apresentados e validados em conjunto com um motor de
busca. Assim, foi proposta, construída, demonstrada e avaliada uma solução
para integrar e oferecer uma melhoria na ordenação de conteúdos Web
dentro de um motor de busca. A solução está pronta para ser utilizad
Look back, look around:A systematic analysis of effective predictors for new outlinks in focused Web crawling
Small and medium enterprises rely on detailed Web analytics to be informed
about their market and competition. Focused crawlers meet this demand by
crawling and indexing specific parts of the Web. Critically, a focused crawler
must quickly find new pages that have not yet been indexed. Since a new page
can be discovered only by following a new outlink, predicting new outlinks is
very relevant in practice. In the literature, many feature designs have been
proposed for predicting changes in the Web. In this work we provide a
structured analysis of this problem, using new outlinks as our running
prediction target. Specifically, we unify earlier feature designs in a
taxonomic arrangement of features along two dimensions: static versus dynamic
features, and features of a page versus features of the network around it.
Within this taxonomy, complemented by our new (mainly, dynamic network)
features, we identify best predictors for new outlinks. Our main conclusion is
that most informative features are the recent history of new outlinks on a page
itself, and of its content-related pages. Hence, we propose a new 'look back,
look around' (LBLA) model, that uses only these features. With the obtained
predictions, we design a number of scoring functions to guide a focused crawler
to pages with most new outlinks, and compare their performance. The LBLA
approach proved extremely effective, outperforming other models including those
that use a most complete set of features. One of the learners we use, is the
recent NGBoost method that assumes a Poisson distribution for the number of new
outlinks on a page, and learns its parameters. This connects the two so far
unrelated avenues in the literature: predictions based on features of a page,
and those based on probabilistic modelling. All experiments were carried out on
an original dataset, made available by a commercial focused crawler.Comment: 23 pages, 15 figures, 4 tables, uses arxiv.sty, added new title,
heuristic features and their results added, figures 7, 14, and 15 updated,
accepted versio
COMPUTING APPROXIMATE CUSTOMIZED RANKING
As the amount of information grows and as users become more
sophisticated, ranking techniques become important building blocks
to meet user needs when answering queries. PageRank is one of the
most successful link-based ranking methods, which iteratively
computes the importance scores for web pages based on the importance scores of incoming pages. Due to its success, PageRank has been applied in a number of applications that require customization.
We address the scalability challenges for two types of customized
ranking. The first challenge is to compute the ranking of a
subgraph. Various Web applications focus on identifying a
subgraph, such as focused crawlers and localized search engines.
The second challenge is to compute online personalized ranking.
Personalized search improves the quality of search results for each
user. The user needs are represented by a personalized set of pages
or personalized link importance in an entity relationship graph.
This requires an efficient online computation.
To solve the subgraph ranking problem efficiently, we estimate the
ranking scores for a subgraph. We propose a framework of an exact
solution (IdealRank) and an approximate solution (ApproxRank) for
computing ranking on a subgraph. Both IdealRank and ApproxRank
represent the set of external pages with an external node
and modify the PageRank-style transition matrix with respect to . The IdealRank algorithm assumes that the scores of external pages are known. We prove that the IdealRank scores for pages in the subgraph converge to the true PageRank scores. Since the PageRank-style scores of external pages may not typically be available, we propose the ApproxRank algorithm to estimate scores for the subgraph. We analyze the distance between IdealRank scores and ApproxRank scores of the subgraph and show that it is within a
constant factor of the distance of the external pages. We demonstrate with real and synthetic data that ApproxRank provides a good approximation to PageRank for a variety of subgraphs.
We consider online personalization using ObjectRank; it is an
authority flow based ranking for entity relationship graphs. We formalize the concept of an aggregate surfer on a data graph; the surfer's behavior is controlled by multiple personalized rankings. We prove a linearity
theorem over these rankings which can be used as a tool to scale
this type of personalization. DataApprox uses a repository of precomputed rankings for a given set of link weights assignments. We define DataApprox as an optimization problem; it selects a subset of the precomputed rankings from the repository and produce a weighted combination of these rankings. We analyze the distance between the DataApprox scores and the real authority flow ranking scores and show that DataApprox has a theoretical bound. Our experiments on the DBLP data graph show that DataApprox performs well in practice and allows fast and accurate personalized authority flow ranking
Exploring the Relevance of Search Engines: An Overview of Google as a Case Study
The huge amount of data on the Internet and the diverse list of strategies used to try to link this information with relevant searches through Linked Data have generated a revolution in data treatment and its representation. Nevertheless, the conventional search engines like Google are kept as strategies with good reception to do search processes. The following article presents a study of the development and evolution of search engines, more specifically, to analyze the relevance of findings based on the number of results displayed in paging systems with Google as a case study. Finally, it is intended to contribute to indexing criteria in search results, based on an approach to Semantic Web as a stage in the evolution of the Web
Replicating web structure in small-scale test collections
Linkage analysis as an aid to web search has been assumed to be of significant benefit and we know that it is being implemented by many major Search Engines. Why then have few TREC participants been able to scientifically prove the benefits of linkage analysis in recent years? In this paper we put forward reasons why many disappointing results have been found in TREC experiments and we identify the linkage density requirements of a dataset to faithfully support experiments into linkage-based retrieval by examining the linkage structure of the WWW. Based on these requirements we report on methodologies for synthesising such a test collection
- …