9 research outputs found

    Collecte orientée sur le Web pour la recherche d'information spécialisée

    Get PDF
    Les moteurs de recherche verticaux, qui se concentrent sur des segments spécifiques du Web, deviennent aujourd'hui de plus en plus présents dans le paysage d'Internet. Les moteurs de recherche thématiques, notamment, peuvent obtenir de très bonnes performances en limitant le corpus indexé à un thème connu. Les ambiguïtés de la langue sont alors d'autant plus contrôlables que le domaine est bien ciblé. De plus, la connaissance des objets et de leurs propriétés rend possible le développement de techniques d'analyse spécifiques afin d'extraire des informations pertinentes.Dans le cadre de cette thèse, nous nous intéressons plus précisément à la procédure de collecte de documents thématiques à partir du Web pour alimenter un moteur de recherche thématique. La procédure de collecte peut être réalisée en s'appuyant sur un moteur de recherche généraliste existant (recherche orientée) ou en parcourant les hyperliens entre les pages Web (exploration orientée).Nous étudions tout d'abord la recherche orientée. Dans ce contexte, l'approche classique consiste à combiner des mot-clés du domaine d'intérêt, à les soumettre à un moteur de recherche et à télécharger les meilleurs résultats retournés par ce dernier.Après avoir évalué empiriquement cette approche sur 340 thèmes issus de l'OpenDirectory, nous proposons de l'améliorer en deux points. En amont du moteur de recherche, nous proposons de formuler des requêtes thématiques plus pertinentes pour le thème afin d'augmenter la précision de la collecte. Nous définissons une métrique fondée sur un graphe de cooccurrences et un algorithme de marche aléatoire, dans le but de prédire la pertinence d'une requête thématique. En aval du moteur de recherche, nous proposons de filtrer les documents téléchargés afin d'améliorer la qualité du corpus produit. Pour ce faire, nous modélisons la procédure de collecte sous la forme d'un graphe triparti et appliquons un algorithme de marche aléatoire biaisé afin d'ordonner par pertinence les documents et termes apparaissant dans ces derniers.Dans la seconde partie de cette thèse, nous nous focalisons sur l'exploration orientée du Web. Au coeur de tout robot d'exploration orientée se trouve une stratégie de crawl qui lui permet de maximiser le rapatriement de pages pertinentes pour un thème, tout en minimisant le nombre de pages visitées qui ne sont pas en rapport avec le thème. En pratique, cette stratégie définit l'ordre de visite des pages. Nous proposons d'apprendre automatiquement une fonction d'ordonnancement indépendante du thème à partir de données existantes annotées automatiquement.Vertical search engines, which focus on a specific segment of the Web, become more and more present in the Internet landscape. Topical search engines, notably, can obtain a significant performance boost by limiting their index on a specific topic. By doing so, language ambiguities are reduced, and both the algorithms and the user interface can take advantage of domain knowledge, such as domain objects or characteristics, to satisfy user information needs.In this thesis, we tackle the first inevitable step of a all topical search engine : focused document gathering from the Web. A thorough study of the state of art leads us to consider two strategies to gather topical documents from the Web: either relying on an existing search engine index (focused search) or directly crawling the Web (focused crawling).The first part of our research has been dedicated to focused search. In this context, a standard approach consists in combining domain-specific terms into queries, submitting those queries to a search engine and down- loading top ranked documents. After empirically evaluating this approach over 340 topics, we propose to enhance it in two different ways: Upstream of the search engine, we aim at formulating more relevant queries in or- der to increase the precision of the top retrieved documents. To do so, we define a metric based on a co-occurrence graph and a random walk algorithm, which aims at predicting the topical relevance of a query. Downstream of the search engine, we filter the retrieved documents in order to improve the document collection quality. We do so by modeling our gathering process as a tripartite graph and applying a random walk with restart algorithm so as to simultaneously order by relevance the documents and terms appearing in our corpus.In the second part of this thesis, we turn to focused crawling. We describe our focused crawler implementation that was designed to scale horizontally. Then, we consider the problem of crawl frontier ordering, which is at the very heart of a focused crawler. Such ordering strategy allows the crawler to prioritize its fetches, maximizing the number of in-domain documents retrieved while minimizing the non relevant ones. We propose to apply learning to rank algorithms to efficiently order the crawl frontier, and define a method to learn a ranking function from existing crawls.PARIS11-SCD-Bib. électronique (914719901) / SudocSudocFranceF

    A Predictive Model for the Parallel Processing of Digital Libraries

    Get PDF
    The computing world is facing the problem of a seemingly exponential increase in the amount of raw digital data, and the speed at which it is being collected, is eclipsing our ability to manage it manually. Combine this with the increasing expectations of a growing number of experienced computer users—including real-time access and a demand for expensive-to-process file types such as multimedia—and it is not hard to understand why managing data of this scale and providing timely access to useful information requires specialized algorithms, techniques, and software. Digital libraries are being used to help address these challenges. Drawing upon knowledge learned through traditional library science, digital libraries excel in providing structured user access to a wide variety of documents. They increasingly include tools for managing, moderating, and marking up these documents. Furthermore, they often feature phases where documents are independently processed and so can benefit from the application of parallel processing techniques—the focus of this thesis. Whether a digital library collection can benefit from parallel processing depends on considerations such as document type, processing cost per document, number of documents, and file-system input/output. To aid in deciding when to apply parallel processing techniques to digital libraries, this thesis explores the creation a model for predicting key outcomes of leveraging such techniques. It does so by implementing parallel processing in three distinct open-source digital library tools, undertaking experiments designed to measure key processing features (such as processing time versus number of compute nodes), and applying machine learning techniques to these features in order to derive a predictive model. The model created predicts parallel processing performance at 96% accuracy (adjusted r-squared) for a number of exemplar collection types. The result is a generally applicable tool for estimating the benefits of applying parallel processing to a wide range of digital collections

    The Future(s) of Web Archive Research Across Ireland.

    Get PDF
    The central aim of this thesis is to investigate the current state of web archive research in Ireland in line with international developments. Integrating desk research, survey studies, and case studies, and using a combination of research methods, qualitative and quantitative, drawn from disciplines across the humanities and information sciences, this thesis focuses on bridging the gaps between the creation of web archives and the use of archived web materials for current and future research in an Irish context. The thesis describes web archive research to be representative of the web archiving life cycle model (Bragg & Hanna, 2013) which is inclusive of appraisal, selection, capture, storage, quality assurance, preservation and maintenance, replay/playback, access, use, and reuse. Through a synthesis of relevant literature, the thesis examines the causes for the loss of digital heritage and how this relates to Ireland and explores the challenges for participation in web archive research from creation to end use. A survey study is used to explore the challenges for the creation and use of web archives, and the overlaps, and intersections of such challenges across communities of practice within web archive research. A qualitative survey is used to provide an overview of the availability and accessibility of web archives based in Ireland, and their usefulness as resources for conducting research on Irish topics. It further discusses the influence of copyright and legal deposit legislation, or lack thereof, on their abilities to preserve digital heritage for future generations. An online survey is used to investigate awareness of, and engagement/non-engagement with, web archives as resources for research in Irish academic institutions. Overall, the findings show that due to advances in internet, web, and software technologies, there is a need for the continual evaluation of skills, tools, and methods associated with the full web archiving lifecycle. As technologies keep evolving, so too will the challenges. The findings also highlight the need for creators and users/researchers to keep moving forward as collaborators to guide the next generation of web archive research. At the same time, there is also the need for the continual evaluation of legal deposit legislation in line with the fragility of born digital heritage and the technological advances in publishing and communication technologies

    An Extension Interface Concept for Multilayered Applications

    Get PDF
    Extensibility is an important feature of modern software applications. In the context of business applications it is one of the major selection criteria from the customer perspective. Software extensions enable developers to integrate new features to a software system for supporting new requirements. However, there are many open challenges concerning the software provider and the extension developer. A software provider must provide extension interfaces that define the software artifacts of the base application that are allowed to be extended, where and when the extension code will run, and what resources of the base application an extension is allowed to access. While concepts for such interfaces are still a challenging research topic for ``traditional'' software constructed using a single programming language, they are completely missing for complex systems consisting of several abstraction layers. In addition, state-of-the-art approaches do not support providing different extension interfaces for different stakeholders. To develop an extension for a certain software system, the extension developer has to understand what extension possibilities exist, which software artifacts provide these possibilities, the constraints and dependencies between the extensible software artifacts, and how to correctly implement an extension. For example, a simple user interface extension in a business application can require a developer to consider extensible artifacts from underlying business processes, database tables, and business objects. In commercial applications, extension developers can depend on classical means like application programming interfaces, frameworks, documentation, tutorials, and example code provided by the software provider to understand the extension possibilities and how to successfully implement, deploy, and run an extension. For complex multilayered applications, relying on such classical means can be very hard and time-consuming for the extension developers. In integrated development environments, various program comprehension tools and approaches have helped developers in carrying out development tasks. However, most of the tools focus on the code level, lack the support for multilayered applications, and do not particularly focus on extensibility. In this dissertation I aim to provide better means for defining, implementing, and consuming extension interfaces for multilayered applications. I claim that explicit extension interfaces are required for multilayered applications and they are needed for simplifying the implementation (i.e., the concrete realization) and maintainability of extension interfaces on the side of the software provider as well as the consumption of these interfaces by the extension developers. To support this thesis, I first analyze problems with extension interfaces from the perspectives of both the software provider through an example business application and an analysis of a corpus of software systems. I then analyze the problems with the consumption of extension interfaces (i.e., extension development) through a user study involving extension developers performing extension development tasks for a complex business application. Next, I present XPoints, an approach and a language for the specification of extension possibilities for multilayered applications. I develop an instantiation of XPoints evaluate it against current state-of-the-art works and its usability through a user study. I finally show how XPoints can be applied to simplify the extension development through the implementation of a recommender system for extension possibilities for multilayered applications. The advantages of the recommender system are illustrated through an example as well through a comparison between the current state-of-the-art tools for program comprehension. Topics like extension validation, monitoring, and conflict detection are left for future work

    Atti del IX Convegno Annuale dell'Associazione per l'Informatica Umanistica e la Cultura Digitale (AIUCD). La svolta inevitabile: sfide e prospettive per l'Informatica Umanistica

    Get PDF
    Proceedings of the IX edition of the annual AIUCD conferenc

    Atti del IX Convegno Annuale AIUCD. La svolta inevitabile: sfide e prospettive per l'Informatica Umanistica.

    Get PDF
    La nona edizione del convegno annuale dell'Associazione per l'Informatica Umanistica e la Cultura Digitale (AIUCD 2020; Milano, 15-17 gennaio 2020) ha come tema “La svolta inevitabile: sfide e prospettive per l'Informatica Umanistica”, con lo specifico obiettivo di fornire un'occasione per riflettere sulle conseguenze della crescente diffusione dell’approccio computazionale al trattamento dei dati connessi all’ambito umanistico. Questo volume raccoglie gli articoli i cui contenuti sono stati presentati al convegno. A diversa stregua, essi affrontano il tema proposto da un punto di vista ora più teorico-metodologico, ora più empirico-pratico, presentando i risultati di lavori e progetti (conclusi o in corso) che considerino centrale il trattamento computazionale dei dati
    corecore