29 research outputs found

    The Archive Query Log: Mining Millions of Search Result Pages of Hundreds of Search Engines from 25 Years of Web Archives

    Full text link
    The Archive Query Log (AQL) is a previously unused, comprehensive query log collected at the Internet Archive over the last 25 years. Its first version includes 356 million queries, 166 million search result pages, and 1.7 billion search results across 550 search providers. Although many query logs have been studied in the literature, the search providers that own them generally do not publish their logs to protect user privacy and vital business data. Of the few query logs publicly available, none combines size, scope, and diversity. The AQL is the first to do so, enabling research on new retrieval models and (diachronic) search engine analyses. Provided in a privacy-preserving manner, it promotes open research as well as more transparency and accountability in the search industry.Comment: SIGIR 2023 resource paper, 13 page

    Inter-relaão das técnicas Term Extration e Query Expansion aplicadas na recuperação de documentos textuais

    Get PDF
    Tese (doutorado) - Universidade Federal de Santa Catarina, Centro Tecnológico. Programa de Pós-graduação em Engenharia e Gestão do ConhecimentoConforme Sighal (2006) as pessoas reconhecem a importância do armazenamento e busca da informação e, com o advento dos computadores, tornou-se possível o armazenamento de grandes quantidades dela em bases de dados. Em conseqüência, catalogar a informação destas bases tornou-se imprescindível. Nesse contexto, o campo da Recuperação da Informação, surgiu na década de 50, com a finalidade de promover a construção de ferramentas computacionais que permitissem aos usuários utilizar de maneira mais eficiente essas bases de dados. O principal objetivo da presente pesquisa é desenvolver um Modelo Computacional que possibilite a recuperação de documentos textuais ordenados pela similaridade semântica, baseado na intersecção das técnicas de Term Extration e Query Expansion

    How important is computing technology for library and information science research?

    Get PDF
    © 2015 Elsevier Inc. Computers in library and information science (LIS) research have been an object of study or a tool for research for at least fifty years, but how central are computers to the discipline now? This research analyses the titles, abstracts, and keywords of forty years of articles in LIS-classified journals for trends related to computing technologies. The proportion of Scopus LIS articles mentioning some aspect of computing in their title, abstract, or keywords increased steadily from 1986 to 2000, then stabilised at about two thirds, indicating a continuing dominance of computers in most LIS research. Within this general trend, many computer-related terms have peaked and then declined in popularity. For example, the proportion of Scopus LIS article titles, abstracts, or keywords that included the terms "computer" or "computing" decreased fairly steadily from about 20% in 1975 to 5% in 2013, and the proportion explicitly mentioning the web peaked at 18% in 2002. Parallel analyses suggest that computing is substantially less important in two related disciplines: education and communication, and so it should be seen as a key aspect of the LIS identity.Published versio

    The Latent Relation Mapping Engine: Algorithm and Experiments

    Full text link
    Many AI researchers and cognitive scientists have argued that analogy is the core of cognition. The most influential work on computational modeling of analogy-making is Structure Mapping Theory (SMT) and its implementation in the Structure Mapping Engine (SME). A limitation of SME is the requirement for complex hand-coded representations. We introduce the Latent Relation Mapping Engine (LRME), which combines ideas from SME and Latent Relational Analysis (LRA) in order to remove the requirement for hand-coded representations. LRME builds analogical mappings between lists of words, using a large corpus of raw text to automatically discover the semantic relations among the words. We evaluate LRME on a set of twenty analogical mapping problems, ten based on scientific analogies and ten based on common metaphors. LRME achieves human-level performance on the twenty problems. We compare LRME with a variety of alternative approaches and find that they are not able to reach the same level of performance.Comment: related work available at http://purl.org/peter.turney

    Eight Biennial Report : April 2005 – March 2007

    No full text

    Novelty detection in video retrieval: finding new news in TV news stories

    Get PDF
    Novelty detection is defined as the detection of documents that provide "new" or previously unseen information. "New information" in a search result list is defined as the incremental information found in a document based on what the user has already learned from reviewing previous documents in a given ranked list of documents. It is assumed that, as a user views a list of documents, their information need changes or evolves, and their state of knowledge increases as they gain new information from the documents they see. The automatic detection of "novelty" , or newness, as part of an information retrieval system could greatly improve a searcher’s experience by presenting "documents" in order of how much extra information they add to what is already known, instead of how similar they are to a user’s query. This could be particularly useful in applications such as the search of broadcast news and automatic summary generation. There are many different aspects of information management, however, this thesis, presents research into the area of novelty detection within the content based video domain. It explores the benefits of integrating the many multi modal resources associated with video content those of low level feature detection evidences such as colour and edge, automatic concepts detections such as face, commercials, and anchor person, automatic speech recognition transcripts and manually annotated MPEG7 concepts into a novelty detection model. The effectiveness of this novelty detection model is evaluated on a collection of TV new data

    Information search in web archives

    Get PDF
    Tese de doutoramento, Informática (Engenharia Informática), Universidade de Lisboa, Faculdade de Ciências, 2014Web archives preserve information that was published on the web or digitized from printed publications. Many of that information is unique and historically valuable. However, users do not have dedicated tools to find the desired information, which hampers the usefulness of web archives. This dissertation investigates solutions towards the advance of web archive information retrieval (WAIR) and contributes to the increase of knowledge about its technology and users. The thesis underlying this work is that the search results can be improved by exploiting temporal information intrinsic to web archives. This temporal information was leveraged from two different angles. First, the long-term persistence of web documents was analyzed and modeled to better estimate their relevance to a query. Second, a temporal-dependent ranking framework that learns and combines ranking models specific for each period was devised. This approach contrasts with a typical single-model approach that ignores the variance of web characteristics over time. The proposed approach was empirically validated through various controlled experiments that demonstrated their superiority over the state-of-the-art in WAIR.Os arquivos da web preservam informação que foi publicada na web ou digitalizada de publicações impressas. Muita dessa informação é única e historicamente valiosa. Contudo, os utilizadores não dispõem de ferramentas dedicadas para encontrar a informação desejada, o que limita a utilidade dos arquivos da web. Esta dissertação investiga soluções para o avanço da recuperação de informação em arquivos da web (WAIR) e contribui para o aumento de conhecimento acerca da sua tecnologia e dos seus utilizadores. A tese subjacente a este trabalho é a de que os resultados de pesquisa podem ser melhorados através da exploração de informação temporal intrínseca aos arquivos da web. Esta informação temporal foi explorada de dois ângulos diferentes. Primeiro, a longa persistência dos documentos web foi analisada e modelada para melhor estimar a relevância destes em função da pesquisa. Segundo, foi concebido um enquadramento (framework) para ordenação de resultados dependente do tempo, que aprende e combina modelos específicos para cada período. Esta abordagem contrasta com a abordagem de um modelo único que ignora a variação das características da web ao longo do tempo. A abordagem proposta foi validada empiricamente através de várias experiências controladas que demonstraram a sua superioridade em relação ao estado da arte em WAIR
    corecore