9 research outputs found
Webcrawling clustering en espacio multidimensional basado en distancia y su aplicación a Opinion Mining
La explosión multimedial y la revolución surgida a partir de la Web 2.0 donde los consumidores de información son a su vez productores de contenido han reflejado un cambio de paradigma en la comunicación. Este cambio vuelve a las herramientas de sondeo de opinión tales como encuestas, focus group y sondeos telefónicos limitadas en su alcance, imprecisas en sus resultados y sesgadas por sus métodos convirtiendo a las mismas en prácticamente obsoletas. Los medios de comunicación se han hecho eco de esto y en los medios más representativos se permite a los lectores participar de las noticias por medio de las redes sociales. Son necesarias para explorar y descubrir de manera continua y sistematizada estos nuevos canales de expresión, analizar los contenidos de las expresiones en los mismos y poder extraer conocimiento de estos flujos de información a escala masiva. En este trabajo se parte de un nuevo concepto de la minería de datos, se analiza una nueva estrategia para descubrir nuevos canales mediante el Webcrawling inteligente, se proponen nuevos formas de modelado de conceptos y opiniones para poder sintetizarlos y cuantificarlos para su posterior análisis, se da a conocer un método para realizar este análisis de las percepciones y finalmente se demuestran las posibilidades de clusterización de la información obtenida
Federating Heterogeneous Digital Libraries by Metadata Harvesting
This dissertation studies the challenges and issues faced in federating heterogeneous digital libraries (DLs) by metadata harvesting. The objective of federation is to provide high-level services (e.g. transparent search across all DLs) on the collective metadata from different digital libraries. There are two main approaches to federate DLs: distributed searching approach and harvesting approach. As the distributed searching approach replies on executing queries to digital libraries in real time, it has problems with scalability. The difficulty of creating a distributed searching service for a large federation is the motivation behind Open Archives Initiatives Protocols for Metadata Harvesting (OAI-PMH). OAI-PMH supports both data providers (repositories, archives) and service providers. Service providers develop value-added services based on the information collected from data providers. Data providers are simply collections of harvestable metadata. This dissertation examines the application of the metadata harvesting approach in DL federations. It addresses the following problems: (1) Whether or not metadata harvesting provides a realistic and scalable solution for DL federation. (2) What is the status of and problems with current data provider implementations, and how to solve these problems. (3) How to synchronize data providers and service providers. (4) How to build different types of federation services over harvested metadata. (5) How to create a scalable and reliable infrastructure to support federation services. The work done in this dissertation is based on OAI-PMH, and the results have influenced the evolution of OAI-PMH. However, the results are not limited to the scope of OAI-PMH. Our approach is to design and build key services for metadata harvesting and to deploy them on the Web. Implementing a publicly available service allows us to demonstrate how these approaches are practical. The problems posed above are evaluated by performing experiments over these services.
To summarize the results of this thesis, we conclude that the metadata harvesting approach is a realistic and scalable approach to federate heterogeneous DLs. We present two models of building federation services: a centralized model and a replicated model. Our experiments also demonstrate that the repository synchronization problem can be addressed by push, pull, and hybrid push/pull models; each model has its strengths and weaknesses and fits a specific scenario. Finally, we present a scalable and reliable infrastructure to support the applications of metadata harvesting
Acquisition des contenus intelligents dans l’archivage du Web
Web sites are dynamic by nature with content and structure changing overtime; many pages on the Web are produced by content management systems (CMSs). Tools currently used by Web archivists to preserve the content of the Web blindly crawl and store Web pages, disregarding the CMS the site is based on and whatever structured content is contained in Web pages. We first present an application-aware helper (AAH) that fits into an archiving crawl processing chain to perform intelligent and adaptive crawling of Web applications, given a knowledge base of common CMSs. The AAH has been integrated into two Web crawlers in the framework of the ARCOMEM project: the proprietary crawler of the Internet Memory Foundation and a customized version of Heritrix. Then we propose an efficient unsupervised Web crawling system ACEBot (Adaptive Crawler Bot for data Extraction), a structure-driven crawler that utilizes the inner structure of the pages and guides the crawling process based on the importance of their content. ACEBot works intwo phases: in the offline phase, it constructs a dynamic site map (limiting the number of URLs retrieved), learns a traversal strategy based on the importance of navigation patterns (selecting those leading to valuable content); in the online phase, ACEBot performs massive downloading following the chosen navigation patterns. The AAH and ACEBot makes 7 and 5 times, respectively, fewer HTTP requests as compared to a generic crawler, without compromising on effectiveness. We finally propose OWET (Open Web Extraction Toolkit) as a free platform for semi-supervised data extraction. OWET allows a user to extract the data hidden behind Web formsLes sites Web sont par nature dynamiques, leur contenu et leur structure changeant au fil du temps ; de nombreuses pages sur le Web sont produites par des systèmes de gestion de contenu (CMS). Les outils actuellement utilisés par les archivistes du Web pour préserver le contenu du Web collectent et stockent de manière aveugle les pages Web, en ne tenant pas compte du CMS sur lequel le site est construit et du contenu structuré de ces pages Web. Nous présentons dans un premier temps un application-aware helper (AAH) qui s’intègre à une chaine d’archivage classique pour accomplir une collecte intelligente et adaptative des applications Web, étant donnée une base de connaissance deCMS courants. L’AAH a été intégrée à deux crawlers Web dans le cadre du projet ARCOMEM : le crawler propriétaire d’Internet Memory Foundation et une version personnalisée d’Heritrix. Nous proposons ensuite un système de crawl efficace et non supervisé, ACEBot (Adaptive Crawler Bot for data Extraction), guidé par la structure qui exploite la structure interne des pages et dirige le processus de crawl en fonction de l’importance du contenu. ACEBot fonctionne en deux phases : dans la phase hors-ligne, il construit un plan dynamique du site (en limitant le nombre d’URL récupérées), apprend une stratégie de parcours basée sur l’importance des motifs de navigation (sélectionnant ceux qui mènent à du contenu de valeur) ; dans la phase en-ligne, ACEBot accomplit un téléchargement massif en suivant les motifs de navigation choisis. L’AAH et ACEBot font 7 et 5 fois moins, respectivement, de requêtes HTTP qu’un crawler générique, sans compromis de qualité. Nous proposons enfin OWET (Open Web Extraction Toolkit), une plate-forme libre pour l’extraction de données semi-supervisée. OWET permet à un utilisateur d’extraire les données cachées derrière des formulaires Web
Architecture and implementation of online communities
Thesis (Ph.D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1999.Includes bibliographical references.by Philip Greenspun.Ph.D
Recommended from our members
HealthCyberMap: Mapping the Health Cyberspace Using Hypermedia GIS and Clinical Codes
HealthCyberMap () is a Semantic Web service for healthcare professionals and librarians, patients and the public m general that aims at mappmg parts of medical/ health information resources in cyberspace in novel ways to improve their retrieval and navigation. The Semantic Web ( and ) aims to be the next-generation World Wide Web by giving machine-readable semantics and context to the currently presentation-based Web pages. HealthCyberMap features an unconventional use of GIS (Geographic Information Systems) to map conceptual spaces occupied by collections of medical/ health information resources. Besides mapping the semantic and non-geographical aspects of these resources using suitable spatial metaphors, HealthCyberMap also collects and maps the geographical provenance of these resources. Some of HealthCyberMap Web interfaces are visual (maps for browsing resources by clinical/ health topic, by provenance and by type), while others are textual (multilingual interfaces for browsing resources by language, and a directory of topical resource categories, besides HealthCyberMap Semantic Subject Search Engine that goes beyond conventional free-text and keyword-based search engines, and supports synonyms, disease variants, subtypes, as well as some semantic relationships between terms).
HealthCyberMap adopts a clinical metadata framework built upon a clinical coding scheme (vocabulary or ontology—ICD-9-CM* clinical classification in the current pilot service). Clinical coding schemes serve as a reliable common backbone for topical resource indexing, automated topical classification, topical visualisation and navigation of coded resource pools (using suitable metaphors), and enhanced information retrieval and linking. A resource metadata base based on Dublin Core metadata set with HealthCyberMap’s own extensions holds information about selected high-quality resources. HealthCyberMap then uses GIS spatialisation methods to generate interactive navigational cybermaps from the metadata base. These visual cybermaps are based on familiar metaphors for image-word association to give users a broad overview and understanding of what is available in this complex conceptual space of medical/ health Internet resources and help them navigate it more efficiently and effectively.
HealthCyberMap cybermaps can be considered as semantically-spatialised, ontology-based browsing views of the underlying resource metadata base. Using a clinical coding scheme as a metric for spatialisation (“semantic distance”) is unique to HealthCyberMap and is very much suited for the semantic categorisation and navigation of medical/ health Internet information resources. HealthCyberMap also introduces a useful form of cyberspatial analysis for the detection of topical coverage gaps in its resource pool using choropleth (shaded) maps of human body systems. The project features a cost-effective method for serving Web hypermaps with dynamic metadata base drill-down functionality. It also demonstrates the feasibility of Electronic Patient Record to Online Information Services (like HealthCyberMap) Problem to Knowledge Linking using clinical codes as crisp problem-knowledge linkers or knowledge hooks.
The Semantic Subject Search Engine queries the same HealthCyberMap resource metadata base. Explicit concepts in resource metadata map onto a brokering domain ontology (ICD-9-CM) allowing the search engine to infer implicit meanings (synonyms and semantic relationships) not directly mentioned in either the resource or its metadata. Similarly, user queries would map to the same ontology allowing the search engine to infer the implicit semantics of user queries and use them to optimise retrieval.
A formative evaluation study of HealthCyberMap pilot service using an online user evaluation questionnaire, in addition to analysis of HealthCyberMap server transaction log, has been conducted during the period from 18 April 2002 to 1 June 2002 with very encouraging results. This two-method evaluation approach was guided by methodologies described in NIH Web Site Evaluation and Performance Measures Toolkit among other resources.
Many exciting future possibilities have been also investigated by the author, including the further development of HealthCyberMap as a customisable, location-based medical/ health information service
Towards an eGovernment: the case of the Emirate of Dubai
This thesis examines and assesses the transformation and implementation of the Dubai Government’s operation, governance and delivery of public services through its use of ICT. The research design includes a critical examination of the evolution of ICT and its role in changing public services and government operations worldwide as an early move towards E-Government. Three recognised theories are used to examine the E-Government transformation and its effects of on governments, namely: the Technology Acceptance Model (TAM), the Diffusion of Innovation Theory (DOI) and the Lens of Max Weber’s Theory of Bureaucracy.
Generally, the study seeks to determine what were the important factors for Dubai to achieve its strategic plan. Six questions were addressed by the research, stating the scope of work undertaken. First, to measure the status of eGovernment initiatives in terms of usefulness and ease of use. Second, to assess the extent of eGovernment application in terms of Government-to-Customer, Government-to-Business, Government-to-Government, and Government-to-Employees. Third, to determine the level of acceptance of eGovernment initiatives. Fourth, to explore the factors/challenges in a successful eTransformation of Dubai. Fifth, to assess the impacts/opportunities of eGovernment initiatives in the development of Dubai. Sixth, to formulate the model to achieve a successful implementation of eGovernment.
A purposive sampling method was used for selecting citizens/customers, business employees and government employees, totalling 1500 equally distributed respondents. The researcher has prepared, administered and empirically tested three questionnaires, and also prepared and administered structured interviews with some officials of eGovernment. Data obtained are presented and analysed. Also, the study examines the catalytic role of eGovernment in the development of society, commerce and government, and shows fundamental changes from traditional systems or from bureaucratic paradigms to eGovernment paradigms. Comparisons are made with eGovernment applications in other countries as per rankings made by the Economist Intelligence Unit (EIU). The researcher has selected top ranked states to examine best practices in e-Government.
Most importantly, this research presents a unique and original contribution to knowledge of the subject treated in its programme for achieving successful eGovernment through the proposed rocket ship model Al Bakr eGovernment Model of implementation, adoption, conclusions and findings of the study
High-quality Web information provisioning and quality-based data pricing
Today, information can be considered a production factor. This is attributed to the technological innovations the Internet and the Web have brought about. Now, a plethora of information is available making it hard to find the most relevant information. Subsequently, the issue of finding and purchasing high-quality data arises. Addressing these challenges, this work first examines how high-quality information provisioning can be achieved with an approach called WiPo that exploits the idea of curation, i. e., the selection, organisation, and provisioning of information with human involvement. The second part of this work investigates the issue that there is little understanding of what the value of data is and how it can be priced – despite the fact that it is already being traded on data marketplaces. To overcome this, a pricing approach based on the Multiple-Choice Knapsack Problem is proposed that allows for utility maximisation for customers and profit maximisation for vendors
Expanding perspective on open science: communities, cultures and diversity in concepts and practices
Twenty-one years ago, the term ‘electronic publishing’ promised all manner of potential that the Web and network technologies could bring to scholarly communication, scientific research and technical innovation. Over the last two decades, tremendous developments have indeed taken place across all of these domains. One of the most important of these has been Open Science; perhaps the most widely discussed topic in research communications today.
This book presents the proceedings of Elpub 2017, the 21st edition of the International Conference on Electronic Publishing, held in Limassol, Cyprus, in June 2017. Continuing the tradition of bringing together academics, publishers, lecturers, librarians, developers, entrepreneurs, users and all other stakeholders interested in the issues surrounding electronic publishing, this edition of the conference focuses on Open Science, and the 27 research and practitioner papers and 1 poster included here reflect the results and ideas of researchers and practitioners with diverse backgrounds from all around the world with regard to this important subject.
Intended to generate discussion and debate on the potential and limitations of openness, the book addresses the current challenges and opportunities in the ecosystem of Open Science, and explores how to move forward in developing an inclusive system that will work for a much broader range of participants. It will be of interest to all those concerned with electronic publishing, and Open Science in particular