5,018 research outputs found

    Self-supervised automated wrapper generation for weblog data extraction

    Get PDF
    Data extraction from the web is notoriously hard. Of the types of resources available on the web, weblogs are becoming increasingly important due to the continued growth of the blogosphere, but remain poorly explored. Past approaches to data extraction from weblogs have often involved manual intervention and suffer from low scalability. This paper proposes a fully automated information extraction methodology based on the use of web feeds and processing of HTML. The approach includes a model for generating a wrapper that exploits web feeds for deriving a set of extraction rules automatically. Instead of performing a pairwise comparison between posts, the model matches the values of the web feeds against their corresponding HTML elements retrieved from multiple weblog posts. It adopts a probabilistic approach for deriving a set of rules and automating the process of wrapper generation. An evaluation of the model is conducted on a dataset of 2,393 posts and the results (92% accuracy) show that the proposed technique enables robust extraction of weblog properties and can be applied across the blogosphere for applications such as improved information retrieval and more robust web preservation initiatives

    EasyFJP: Providing Hybrid Parallelism as a Concern for Divide and Conquer Java Applications

    Get PDF
    Because of the increasing availability of multi-core machines, clus- ters, Grids, and combinations of these there is now plenty of computational power,but today's programmers are not fully prepared to exploit parallelism. In particular, Java has helped in handling the heterogeneity of such environments. However, there is a lot of ground to cover regarding facilities to easily and elegantly parallelizing applications. One path to this end seems to be the synthesis of semi- automatic parallelism and Parallelism as a Concern (PaaC). The former allows users to be mostly unaware of parallel exploitation problems and at the same time manually optimize parallelized applications whenever necessary, while the latter allows applications to be separated from parallel-related code. In this paper, we present EasyFJP, an approach that implicitly exploits parallelism in Java applications based on the concept of fork-join synchronization pattern, a simple but effective abstraction for creating and coordinating parallel tasks. In addition, EasyFJP lets users to explicitly optimize applications through policies, or user-provided rules to dynamically regulate task granularity. Finally, EasyFJP relies on PaaC by means of source code generation techniques to wire applications and parallel-specific code together. Experiments with real-world applications on an emulated Grid and a cluster evidence that EasyFJP delivers competitive performance compared to state-of-the-art Java parallel programming tools.Fil: Mateos Diaz, Cristian Maximiliano. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico - CONICET - Tandil. Instituto Superior de Ingenieria del Software; Argentina;Fil: Zunino Suarez, Alejandro Octavio. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico - CONICET - Tandil. Instituto Superior de Ingenieria del Software; Argentina;Fil: Hirsch Jofré, Matías Eberardo. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico - CONICET - Tandil. Instituto Superior de Ingenieria del Software; Argentina

    BlogForever D2.6: Data Extraction Methodology

    Get PDF
    This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform

    The Foundational Model of Anatomy Ontology

    Get PDF
    Anatomy is the structure of biological organisms. The term also denotes the scientific discipline devoted to the study of anatomical entities and the structural and developmental relations that obtain among these entities during the lifespan of an organism. Anatomical entities are the independent continuants of biomedical reality on which physiological and disease processes depend, and which, in response to etiological agents, can transform themselves into pathological entities. For these reasons, hard copy and in silico information resources in virtually all fields of biology and medicine, as a rule, make extensive reference to anatomical entities. Because of the lack of a generalizable, computable representation of anatomy, developers of computable terminologies and ontologies in clinical medicine and biomedical research represented anatomy from their own more or less divergent viewpoints. The resulting heterogeneity presents a formidable impediment to correlating human anatomy not only across computational resources but also with the anatomy of model organisms used in biomedical experimentation. The Foundational Model of Anatomy (FMA) is being developed to fill the need for a generalizable anatomy ontology, which can be used and adapted by any computer-based application that requires anatomical information. Moreover it is evolving into a standard reference for divergent views of anatomy and a template for representing the anatomy of animals. A distinction is made between the FMA ontology as a theory of anatomy and the implementation of this theory as the FMA artifact. In either sense of the term, the FMA is a spatial-structural ontology of the entities and relations which together form the phenotypic structure of the human organism at all biologically salient levels of granularity. Making use of explicit ontological principles and sound methods, it is designed to be understandable by human beings and navigable by computers. The FMA’s ontological structure provides for machine-based inference, enabling powerful computational tools of the future to reason with biomedical data

    Proceedings of the COLING 2004 Post Conference Workshop on Multilingual Linguistic Ressources MLR2004

    No full text
    International audienceIn an ever expanding information society, most information systems are now facing the "multilingual challenge". Multilingual language resources play an essential role in modern information systems. Such resources need to provide information on many languages in a common framework and should be (re)usable in many applications (for automatic or human use). Many centres have been involved in national and international projects dedicated to building har- monised language resources and creating expertise in the maintenance and further development of standardised linguistic data. These resources include dictionaries, lexicons, thesauri, word-nets, and annotated corpora developed along the lines of best practices and recommendations. However, since the late 90's, most efforts in scaling up these resources remain the responsibility of the local authorities, usually, with very low funding (if any) and few opportunities for academic recognition of this work. Hence, it is not surprising that many of the resource holders and developers have become reluctant to give free access to the latest versions of their resources, and their actual status is therefore currently rather unclear. The goal of this workshop is to study problems involved in the development, management and reuse of lexical resources in a multilingual context. Moreover, this workshop provides a forum for reviewing the present state of language resources. The workshop is meant to bring to the international community qualitative and quantitative information about the most recent developments in the area of linguistic resources and their use in applications. The impressive number of submissions (38) to this workshop and in other workshops and conferences dedicated to similar topics proves that dealing with multilingual linguistic ressources has become a very hot problem in the Natural Language Processing community. To cope with the number of submissions, the workshop organising committee decided to accept 16 papers from 10 countries based on the reviewers' recommendations. Six of these papers will be presented in a poster session. The papers constitute a representative selection of current trends in research on Multilingual Language Resources, such as multilingual aligned corpora, bilingual and multilingual lexicons, and multilingual speech resources. The papers also represent a characteristic set of approaches to the development of multilingual language resources, such as automatic extraction of information from corpora, combination and re-use of existing resources, online collaborative development of multilingual lexicons, and use of the Web as a multilingual language resource. The development and management of multilingual language resources is a long-term activity in which collaboration among researchers is essential. We hope that this workshop will gather many researchers involved in such developments and will give them the opportunity to discuss, exchange, compare their approaches and strengthen their collaborations in the field. The organisation of this workshop would have been impossible without the hard work of the program committee who managed to provide accurate reviews on time, on a rather tight schedule. We would also like to thank the Coling 2004 organising committee that made this workshop possible. Finally, we hope that this workshop will yield fruitful results for all participants

    Forum Session at the First International Conference on Service Oriented Computing (ICSOC03)

    Get PDF
    The First International Conference on Service Oriented Computing (ICSOC) was held in Trento, December 15-18, 2003. The focus of the conference ---Service Oriented Computing (SOC)--- is the new emerging paradigm for distributed computing and e-business processing that has evolved from object-oriented and component computing to enable building agile networks of collaborating business applications distributed within and across organizational boundaries. Of the 181 papers submitted to the ICSOC conference, 10 were selected for the forum session which took place on December the 16th, 2003. The papers were chosen based on their technical quality, originality, relevance to SOC and for their nature of being best suited for a poster presentation or a demonstration. This technical report contains the 10 papers presented during the forum session at the ICSOC conference. In particular, the last two papers in the report ere submitted as industrial papers
    • 

    corecore