16 research outputs found

    Arquivo e Medição da Web Portuguesa

    Get PDF
    Nacional. O projecto visa preservar a informação publicada na Web para as gerações vindouras à semelhança do que é feito com as publicações impressas nacionais. A disponibilização de serviços eficientes de pesquisa e análise da informação arquivada é essencial para que o Arquivo se torne uma ferramenta usada por todos os cidadãos. Em Fevereiro de 2008 realizou-se a primeira recolha da Web portuguesa, tendo sido realizadas medições quantitativas. Segundo os resultados obtidos, a Web portuguesa é constituída pelo menos por 56 milhões de conteúdos, o que corresponde a 2,8 TB de informação

    An Updated Portrait of the Portuguese Web

    Get PDF
    This study presents an updated characterization of the Portuguese Web derived from a crawl of 48 million contents belonging to all media types (2.5 TB of data), performed in March, 2008. The resulting data was analyzed to characterize contents, sites and domains. This study was performed within the scope of the Portuguese Web Archive.POSC/EU, UMI

    Characterizing the portuguese blogosphere

    Get PDF
    Tese de mestrado integrado. Engenharia Informática e Computação. Faculdade de Engenharia. Universidade do Porto. 200

    Modelos y algoritmos de búsqueda + redes sociales para aplicaciones verticales de recuperación de información

    Get PDF
    El espacio web no es solamente un enorme repositorio de información de todo tipo, sino - además – es una plataforma para soportar servicios globales de naturaleza diversa. El incremento exponencial de contenido y de usuarios (por ejemplo: en las redes sociales), junto con la constante aparición de nuevas aplicaciones, exceden largamente la visión de la web como un mero repositorio de contenidos. En todos los casos, existe como común denominador la necesidad de realizar “búsquedas” de diferente tipo y con objetivos también diversos. En la actualidad, las redes sociales son unas de las aplicaciones más populares, incluso han modificado la forma en que los usuarios se vinculan, relacionan, interactuan e intercambian información. De forma implícita, generan estructuras sociales con propiedades emergentes que surgen del comportamiento global y, se estima, pueden aportar a mejorar los procesos de búsquedas. En este documento se presenta un nuevo proyecto de investigación, donde se propone abordar algunas de las problemáticas relacionadas con las búsquedas en Internet. Para ello, se integrarán técnicas de recuperación de información y construcción de motores de búsqueda, junto con información proveniente de redes sociales, para brindar mayor eficiencia en la tarea de búsqueda, abarcando múltiples escenarios como: porciones específicas de la web, información científica y/o geográfica, búsquedas en dispositivos móviles, entre otras.Eje: Base de datos y Minería de datosRed de Universidades con Carreras en Informática (RedUNCI

    Caracterización del espacio Web de Perú

    Get PDF
    The WWW is a public space used by different users with diverse objectives. Originally, it was a distributed repository which allowed to share informtion and –though this goal has not been forgetted- nowdays is a mean of publication and service for several kind of uses like commerce, publicity, education, entertainment and social contacts, among others. While the web is under a ceasless growing the study of its characteristics and tendences bring valuable information, both to understand its structure as to develope tools which make easier the use of its ressources. The present paper aims mainly to characterize the web space of Peru, at the frame of the new tendences and evolution. It presents features of the sites, url, technologies used and descriptive elements

    On URL and content persistence

    Get PDF
    This report presents a study of URL and content persistence among 51 million pages from a national web harvested 8 times over almost 3 years. This study differs from previous ones because it describes the evolution of a large set of web pages for several years, studying in depth the characteristics of persistent data. We found that the persistence of URLs and contents follows a logarithmic distribution. We characterized persistent URLs and contents, and identified reasons for URL death. We found that lasting contents tend to be referenced by different URLs during their lifetime. On the other hand, half of the contents referenced by persistent URLs did not chang

    The Viuva Negra crawler

    Get PDF
    This report discusses architectural aspects of web crawlers and details the design, implementation and evaluation of the Viuva Negra (VN) crawler. VN has been used for 4 years, feeding a search engine and an archive of the Portuguese web. In our experiments it crawled over 2 million documents per day, correspondent to 63 GB of data. We describe hazardous situations to crawling found on the web and the adopted solutions to mitigate their effects. The gathered information was integrated in a web warehouse that provides support for its automatic processing by text mining applications