9 research outputs found

    Webcrawling clustering en espacio multidimensional basado en distancia y su aplicación a Opinion Mining

    Get PDF
    La explosión multimedial y la revolución surgida a partir de la Web 2.0 donde los consumidores de información son a su vez productores de contenido han reflejado un cambio de paradigma en la comunicación. Este cambio vuelve a las herramientas de sondeo de opinión tales como encuestas, focus group y sondeos telefónicos limitadas en su alcance, imprecisas en sus resultados y sesgadas por sus métodos convirtiendo a las mismas en prácticamente obsoletas. Los medios de comunicación se han hecho eco de esto y en los medios más representativos se permite a los lectores participar de las noticias por medio de las redes sociales. Son necesarias para explorar y descubrir de manera continua y sistematizada estos nuevos canales de expresión, analizar los contenidos de las expresiones en los mismos y poder extraer conocimiento de estos flujos de información a escala masiva. En este trabajo se parte de un nuevo concepto de la minería de datos, se analiza una nueva estrategia para descubrir nuevos canales mediante el Webcrawling inteligente, se proponen nuevos formas de modelado de conceptos y opiniones para poder sintetizarlos y cuantificarlos para su posterior análisis, se da a conocer un método para realizar este análisis de las percepciones y finalmente se demuestran las posibilidades de clusterización de la información obtenida

    Federating Heterogeneous Digital Libraries by Metadata Harvesting

    Get PDF
    This dissertation studies the challenges and issues faced in federating heterogeneous digital libraries (DLs) by metadata harvesting. The objective of federation is to provide high-level services (e.g. transparent search across all DLs) on the collective metadata from different digital libraries. There are two main approaches to federate DLs: distributed searching approach and harvesting approach. As the distributed searching approach replies on executing queries to digital libraries in real time, it has problems with scalability. The difficulty of creating a distributed searching service for a large federation is the motivation behind Open Archives Initiatives Protocols for Metadata Harvesting (OAI-PMH). OAI-PMH supports both data providers (repositories, archives) and service providers. Service providers develop value-added services based on the information collected from data providers. Data providers are simply collections of harvestable metadata. This dissertation examines the application of the metadata harvesting approach in DL federations. It addresses the following problems: (1) Whether or not metadata harvesting provides a realistic and scalable solution for DL federation. (2) What is the status of and problems with current data provider implementations, and how to solve these problems. (3) How to synchronize data providers and service providers. (4) How to build different types of federation services over harvested metadata. (5) How to create a scalable and reliable infrastructure to support federation services. The work done in this dissertation is based on OAI-PMH, and the results have influenced the evolution of OAI-PMH. However, the results are not limited to the scope of OAI-PMH. Our approach is to design and build key services for metadata harvesting and to deploy them on the Web. Implementing a publicly available service allows us to demonstrate how these approaches are practical. The problems posed above are evaluated by performing experiments over these services. To summarize the results of this thesis, we conclude that the metadata harvesting approach is a realistic and scalable approach to federate heterogeneous DLs. We present two models of building federation services: a centralized model and a replicated model. Our experiments also demonstrate that the repository synchronization problem can be addressed by push, pull, and hybrid push/pull models; each model has its strengths and weaknesses and fits a specific scenario. Finally, we present a scalable and reliable infrastructure to support the applications of metadata harvesting

    Acquisition des contenus intelligents dans l’archivage du Web

    Get PDF
    Web sites are dynamic by nature with content and structure changing overtime; many pages on the Web are produced by content management systems (CMSs). Tools currently used by Web archivists to preserve the content of the Web blindly crawl and store Web pages, disregarding the CMS the site is based on and whatever structured content is contained in Web pages. We first present an application-aware helper (AAH) that fits into an archiving crawl processing chain to perform intelligent and adaptive crawling of Web applications, given a knowledge base of common CMSs. The AAH has been integrated into two Web crawlers in the framework of the ARCOMEM project: the proprietary crawler of the Internet Memory Foundation and a customized version of Heritrix. Then we propose an efficient unsupervised Web crawling system ACEBot (Adaptive Crawler Bot for data Extraction), a structure-driven crawler that utilizes the inner structure of the pages and guides the crawling process based on the importance of their content. ACEBot works intwo phases: in the offline phase, it constructs a dynamic site map (limiting the number of URLs retrieved), learns a traversal strategy based on the importance of navigation patterns (selecting those leading to valuable content); in the online phase, ACEBot performs massive downloading following the chosen navigation patterns. The AAH and ACEBot makes 7 and 5 times, respectively, fewer HTTP requests as compared to a generic crawler, without compromising on effectiveness. We finally propose OWET (Open Web Extraction Toolkit) as a free platform for semi-supervised data extraction. OWET allows a user to extract the data hidden behind Web formsLes sites Web sont par nature dynamiques, leur contenu et leur structure changeant au fil du temps ; de nombreuses pages sur le Web sont produites par des systèmes de gestion de contenu (CMS). Les outils actuellement utilisés par les archivistes du Web pour préserver le contenu du Web collectent et stockent de manière aveugle les pages Web, en ne tenant pas compte du CMS sur lequel le site est construit et du contenu structuré de ces pages Web. Nous présentons dans un premier temps un application-aware helper (AAH) qui s’intègre à une chaine d’archivage classique pour accomplir une collecte intelligente et adaptative des applications Web, étant donnée une base de connaissance deCMS courants. L’AAH a été intégrée à deux crawlers Web dans le cadre du projet ARCOMEM : le crawler propriétaire d’Internet Memory Foundation et une version personnalisée d’Heritrix. Nous proposons ensuite un système de crawl efficace et non supervisé, ACEBot (Adaptive Crawler Bot for data Extraction), guidé par la structure qui exploite la structure interne des pages et dirige le processus de crawl en fonction de l’importance du contenu. ACEBot fonctionne en deux phases : dans la phase hors-ligne, il construit un plan dynamique du site (en limitant le nombre d’URL récupérées), apprend une stratégie de parcours basée sur l’importance des motifs de navigation (sélectionnant ceux qui mènent à du contenu de valeur) ; dans la phase en-ligne, ACEBot accomplit un téléchargement massif en suivant les motifs de navigation choisis. L’AAH et ACEBot font 7 et 5 fois moins, respectivement, de requêtes HTTP qu’un crawler générique, sans compromis de qualité. Nous proposons enfin OWET (Open Web Extraction Toolkit), une plate-forme libre pour l’extraction de données semi-supervisée. OWET permet à un utilisateur d’extraire les données cachées derrière des formulaires Web

    Architecture and implementation of online communities

    Get PDF
    Thesis (Ph.D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1999.Includes bibliographical references.by Philip Greenspun.Ph.D

    Towards an eGovernment: the case of the Emirate of Dubai

    Get PDF
    This thesis examines and assesses the transformation and implementation of the Dubai Government’s operation, governance and delivery of public services through its use of ICT. The research design includes a critical examination of the evolution of ICT and its role in changing public services and government operations worldwide as an early move towards E-Government. Three recognised theories are used to examine the E-Government transformation and its effects of on governments, namely: the Technology Acceptance Model (TAM), the Diffusion of Innovation Theory (DOI) and the Lens of Max Weber’s Theory of Bureaucracy. Generally, the study seeks to determine what were the important factors for Dubai to achieve its strategic plan. Six questions were addressed by the research, stating the scope of work undertaken. First, to measure the status of eGovernment initiatives in terms of usefulness and ease of use. Second, to assess the extent of eGovernment application in terms of Government-to-Customer, Government-to-Business, Government-to-Government, and Government-to-Employees. Third, to determine the level of acceptance of eGovernment initiatives. Fourth, to explore the factors/challenges in a successful eTransformation of Dubai. Fifth, to assess the impacts/opportunities of eGovernment initiatives in the development of Dubai. Sixth, to formulate the model to achieve a successful implementation of eGovernment. A purposive sampling method was used for selecting citizens/customers, business employees and government employees, totalling 1500 equally distributed respondents. The researcher has prepared, administered and empirically tested three questionnaires, and also prepared and administered structured interviews with some officials of eGovernment. Data obtained are presented and analysed. Also, the study examines the catalytic role of eGovernment in the development of society, commerce and government, and shows fundamental changes from traditional systems or from bureaucratic paradigms to eGovernment paradigms. Comparisons are made with eGovernment applications in other countries as per rankings made by the Economist Intelligence Unit (EIU). The researcher has selected top ranked states to examine best practices in e-Government. Most importantly, this research presents a unique and original contribution to knowledge of the subject treated in its programme for achieving successful eGovernment through the proposed rocket ship model Al Bakr eGovernment Model of implementation, adoption, conclusions and findings of the study

    High-quality Web information provisioning and quality-based data pricing

    Full text link
    Today, information can be considered a production factor. This is attributed to the technological innovations the Internet and the Web have brought about. Now, a plethora of information is available making it hard to find the most relevant information. Subsequently, the issue of finding and purchasing high-quality data arises. Addressing these challenges, this work first examines how high-quality information provisioning can be achieved with an approach called WiPo that exploits the idea of curation, i. e., the selection, organisation, and provisioning of information with human involvement. The second part of this work investigates the issue that there is little understanding of what the value of data is and how it can be priced – despite the fact that it is already being traded on data marketplaces. To overcome this, a pricing approach based on the Multiple-Choice Knapsack Problem is proposed that allows for utility maximisation for customers and profit maximisation for vendors

    Expanding perspective on open science: communities, cultures and diversity in concepts and practices

    Get PDF
    Twenty-one years ago, the term ‘electronic publishing’ promised all manner of potential that the Web and network technologies could bring to scholarly communication, scientific research and technical innovation. Over the last two decades, tremendous developments have indeed taken place across all of these domains. One of the most important of these has been Open Science; perhaps the most widely discussed topic in research communications today. This book presents the proceedings of Elpub 2017, the 21st edition of the International Conference on Electronic Publishing, held in Limassol, Cyprus, in June 2017. Continuing the tradition of bringing together academics, publishers, lecturers, librarians, developers, entrepreneurs, users and all other stakeholders interested in the issues surrounding electronic publishing, this edition of the conference focuses on Open Science, and the 27 research and practitioner papers and 1 poster included here reflect the results and ideas of researchers and practitioners with diverse backgrounds from all around the world with regard to this important subject. Intended to generate discussion and debate on the potential and limitations of openness, the book addresses the current challenges and opportunities in the ecosystem of Open Science, and explores how to move forward in developing an inclusive system that will work for a much broader range of participants. It will be of interest to all those concerned with electronic publishing, and Open Science in particular
    corecore