548 research outputs found

    An Improved PageRank Method based on Genetic Algorithm for Web Search

    Get PDF
    AbstractWeb search engine has become a very important tool for finding information efficiently from the massive Web data. Based on PageRank algorithm, a genetic PageRank algorithm (GPRA) is proposed. With the condition of preserving PageRank algorithm advantages, GPRA takes advantage of genetic algorithm so as to solve web search. Experimental results have shown that GPRA is superior to PageRank algorithm and genetic algorithm on performance

    A Word Embedding Based Approach for Focused Web Crawling Using the Recurrent Neural Network

    Get PDF
    Learning-based focused crawlers download relevant uniform resource locators (URLs) from the web for a specific topic. Several studies have used the term frequency-inverse document frequency (TF-IDF) weighted cosine vector as an input feature vector for learning algorithms. TF-IDF-based crawlers calculate the relevance of a web page only if a topic word co-occurs on the said page, failing which it is considered irrelevant. Similarity is not considered even if a synonym of a term co-occurs on a web page. To resolve this challenge, this paper proposes a new methodology that integrates the Adagrad-optimized Skip Gram Negative Sampling (A-SGNS)-based word embedding and the Recurrent Neural Network (RNN).The cosine similarity is calculated from the word embedding matrix to form a feature vector that is given as an input to the RNN to predict the relevance of the website. The performance of the proposed method is evaluated using the harvest rate (hr) and irrelevance ratio (ir). The proposed methodology outperforms existing methodologies with an average harvest rate of 0.42 and irrelevance ratio of 0.58

    Enhancing E-learning platforms with social networks mining

    Get PDF
    Social Networks appeared as an Internet application that offers several tools to create a personal virtual profile, add other users as friends, and interact with them through messages. These networks quickly evolved and won particular importance in people lives. Now, everyday, people use social networks to share news, interests, and discuss topics that in some way are important to them. Together with social networks, e-learning platforms and related technologies have evolved in the recent years. Both platforms and technologies (social networks and e-learning) enable access to specific information and are able to redirect specific content to an individual person. This dissertation is motivated on social networks data mining over e-learning platforms. It considers the following four social networks: Facebook, Twitter, Google Plus, and Delicious. In order to acquire, analyze, and make a correct and precise implementation of data, two different approaches were followed: enhancement of a current e-learning platform and improvement of search engines. The first approach proposes and elaborates a recommendation tool for Web documents using, as main criterion, social information to support a custom Learning Management System (LMS). In order to create the proposed system, three distinct applications (the Crawler, the SocialRank, and the Recommender) were proposed. Such data will be then incorporated into an LMS system, such as the Personal Learning Environment Box (PLEBOX). PLEBOX is a custom platform based on operating systems layout, and also, provides a software development kit (SDK), a group of tools, to create and manage modules. The results of recommendation tool about ten course units are presented. The second part presents an approach to improve a search engine based on social networks content. Subsequently, a depth analysis to justify the abovementioned procedures in order to create the SocialRank is presented. Finally, the results are presented and validated together with a custom search engine. Then, a solution to integrate and offer an order improvement of Web contents in a search engine was proposed, created, demonstrated, and validated, and it is ready for use.As redes sociais surgiram como um serviço Web com funcionalidades de criação de perfil, criação e interação de amigos. Estas redes evoluíram rapidamente e ganharam uma determinada importância na vida das pessoas. Agora, todos os dias, as pessoas usam as redes sociais para partilhar notícias, interesses e discutir temas que de alguma forma são importantes para elas. Juntamente com as redes sociais, as plataformas de aprendizagem baseadas em tecnologias, conhecidas como plataformas E-learning têm evoluído muito nos últimos anos. Ambas as plataformas e tecnologias (redes sociais e E-learning) fornecem acesso a informações específicas e são capazes de redirecionar determinado conteúdo para um ou vários indivíduos (personalização). O tema desta dissertação é motivado pela mineração do conteúdo das redes sociais em plataformas E-learning. Neste sentido, foram selecionadas quatro redes sociais, Facebook, Twitter, Google Plus, e Delicious para servir de estudo de caso à solução proposta. A fim de adquirir, analisar e concretizar uma aplicação correta e precisa dos dados, duas abordagens diferentes foram seguidas: enriquecimento de uma plataforma E-learning atual e melhoria dos motores de busca. A primeira abordagem propõe e elaboração de uma ferramenta de recomendação de documentos Web usando, como principal critério, a informação social para apoiar um sistema de gestão de aprendizagem (LMS). Desta forma, foram construídas três aplicações distintas, designadas por Crawler, SocialRank e Recommender. As informações extraídas serão incorporadas num sistema E-learning, tendo sido escolhida a PLEBOX (Personal Learning Environment Box). A PLEBOX é uma plataforma personalizada baseada numa interface inspirada nos sistemas operativos, fornecendo um conjunto de ferramentas (os conhecidos SDK - software development kit), para a criação e gestão de módulos. Dez unidades curriculares foram avaliadas e os resultados do sistema de recomendação são apresentados. A segunda abordagem apresenta uma proposta para melhorar um motor de busca com base no conteúdo das redes sociais. Subsequentemente, uma análise profunda é apresentada, justificando os procedimentos de avaliação, afim de criar o ranking de resultados (o SocialRank). Por último, os resultados são apresentados e validados em conjunto com um motor de busca. Assim, foi proposta, construída, demonstrada e avaliada uma solução para integrar e oferecer uma melhoria na ordenação de conteúdos Web dentro de um motor de busca. A solução está pronta para ser utilizad

    Look back, look around:A systematic analysis of effective predictors for new outlinks in focused Web crawling

    Get PDF
    Small and medium enterprises rely on detailed Web analytics to be informed about their market and competition. Focused crawlers meet this demand by crawling and indexing specific parts of the Web. Critically, a focused crawler must quickly find new pages that have not yet been indexed. Since a new page can be discovered only by following a new outlink, predicting new outlinks is very relevant in practice. In the literature, many feature designs have been proposed for predicting changes in the Web. In this work we provide a structured analysis of this problem, using new outlinks as our running prediction target. Specifically, we unify earlier feature designs in a taxonomic arrangement of features along two dimensions: static versus dynamic features, and features of a page versus features of the network around it. Within this taxonomy, complemented by our new (mainly, dynamic network) features, we identify best predictors for new outlinks. Our main conclusion is that most informative features are the recent history of new outlinks on a page itself, and of its content-related pages. Hence, we propose a new 'look back, look around' (LBLA) model, that uses only these features. With the obtained predictions, we design a number of scoring functions to guide a focused crawler to pages with most new outlinks, and compare their performance. The LBLA approach proved extremely effective, outperforming other models including those that use a most complete set of features. One of the learners we use, is the recent NGBoost method that assumes a Poisson distribution for the number of new outlinks on a page, and learns its parameters. This connects the two so far unrelated avenues in the literature: predictions based on features of a page, and those based on probabilistic modelling. All experiments were carried out on an original dataset, made available by a commercial focused crawler.Comment: 23 pages, 15 figures, 4 tables, uses arxiv.sty, added new title, heuristic features and their results added, figures 7, 14, and 15 updated, accepted versio

    COMPUTING APPROXIMATE CUSTOMIZED RANKING

    Get PDF
    As the amount of information grows and as users become more sophisticated, ranking techniques become important building blocks to meet user needs when answering queries. PageRank is one of the most successful link-based ranking methods, which iteratively computes the importance scores for web pages based on the importance scores of incoming pages. Due to its success, PageRank has been applied in a number of applications that require customization. We address the scalability challenges for two types of customized ranking. The first challenge is to compute the ranking of a subgraph. Various Web applications focus on identifying a subgraph, such as focused crawlers and localized search engines. The second challenge is to compute online personalized ranking. Personalized search improves the quality of search results for each user. The user needs are represented by a personalized set of pages or personalized link importance in an entity relationship graph. This requires an efficient online computation. To solve the subgraph ranking problem efficiently, we estimate the ranking scores for a subgraph. We propose a framework of an exact solution (IdealRank) and an approximate solution (ApproxRank) for computing ranking on a subgraph. Both IdealRank and ApproxRank represent the set of external pages with an external node Λ\Lambda and modify the PageRank-style transition matrix with respect to Λ\Lambda. The IdealRank algorithm assumes that the scores of external pages are known. We prove that the IdealRank scores for pages in the subgraph converge to the true PageRank scores. Since the PageRank-style scores of external pages may not typically be available, we propose the ApproxRank algorithm to estimate scores for the subgraph. We analyze the L1L_1 distance between IdealRank scores and ApproxRank scores of the subgraph and show that it is within a constant factor of the L1L_1 distance of the external pages. We demonstrate with real and synthetic data that ApproxRank provides a good approximation to PageRank for a variety of subgraphs. We consider online personalization using ObjectRank; it is an authority flow based ranking for entity relationship graphs. We formalize the concept of an aggregate surfer on a data graph; the surfer's behavior is controlled by multiple personalized rankings. We prove a linearity theorem over these rankings which can be used as a tool to scale this type of personalization. DataApprox uses a repository of precomputed rankings for a given set of link weights assignments. We define DataApprox as an optimization problem; it selects a subset of the precomputed rankings from the repository and produce a weighted combination of these rankings. We analyze the L1L_1 distance between the DataApprox scores and the real authority flow ranking scores and show that DataApprox has a theoretical bound. Our experiments on the DBLP data graph show that DataApprox performs well in practice and allows fast and accurate personalized authority flow ranking

    The structure of broad topics on the web

    Get PDF

    Exploring the Relevance of Search Engines: An Overview of Google as a Case Study

    Get PDF
    The huge amount of data on the Internet and the diverse list of strategies used to try to link this information with relevant searches through Linked Data have generated a revolution in data treatment and its representation. Nevertheless, the conventional search engines like Google are kept as strategies with good reception to do search processes. The following article presents a study of the development and evolution of search engines, more specifically, to analyze the relevance of findings based on the number of results displayed in paging systems with Google as a case study. Finally, it is intended to contribute to indexing criteria in search results, based on an approach to Semantic Web as a stage in the evolution of the Web

    Replicating web structure in small-scale test collections

    Get PDF
    Linkage analysis as an aid to web search has been assumed to be of significant benefit and we know that it is being implemented by many major Search Engines. Why then have few TREC participants been able to scientifically prove the benefits of linkage analysis in recent years? In this paper we put forward reasons why many disappointing results have been found in TREC experiments and we identify the linkage density requirements of a dataset to faithfully support experiments into linkage-based retrieval by examining the linkage structure of the WWW. Based on these requirements we report on methodologies for synthesising such a test collection
    corecore