10,664 research outputs found

    Exploiting the social and semantic web for guided web archiving

    Get PDF
    The constantly growing amount of Web content and the success of the Social Web lead to increasing needs for Web archiving. These needs go beyond the pure preservation of Web pages. Web archives are turning into "community memories" that aim at building a better understanding of the public view on, e.g., celebrities, court decisions, and other events. In this paper we present the ARCOMEM architecture that uses semantic information such as entities, topics, and events complemented with information from the social Web to guide a novel Web crawler. The resulting archives are automatically enriched with semantic meta-information to ease the access and allow retrieval based on conditions that involve high-level concepts. The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-642-33290-6_47.German Federal Ministry for the Environment, Nature Conservation and Nuclear Safety/0325296Solland Solar Cells BVSolarWorld Innovations GmbHSCHOTT Solar AGRENA GmbHSINGULUS TECHNOLOGIES A

    BlogForever: D3.1 Preservation Strategy Report

    Get PDF
    This report describes preservation planning approaches and strategies recommended by the BlogForever project as a core component of a weblog repository design. More specifically, we start by discussing why we would want to preserve weblogs in the first place and what it is exactly that we are trying to preserve. We further present a review of past and present work and highlight why current practices in web archiving do not address the needs of weblog preservation adequately. We make three distinctive contributions in this volume: a) we propose transferable practical workflows for applying a combination of established metadata and repository standards in developing a weblog repository, b) we provide an automated approach to identifying significant properties of weblog content that uses the notion of communities and how this affects previous strategies, c) we propose a sustainability plan that draws upon community knowledge through innovative repository design

    Interlinking documents based on semantic graphs

    Get PDF
    Connectivity and relatedness of Web resources are two concepts that define to what extent different parts are connected or related to one another. Measuring connectivity and relatedness between Web resources is a growing field of research, often the starting point of recommender systems. Although relatedness is liable to subjective interpretations, connectivity is not. Given the Semantic Web's ability of linking Web resources, connectivity can be measured by exploiting the links between entities. Further, these connections can be exploited to uncover relationships between Web resources. In this paper, we apply and expand a relationship assessment methodology from social network theory to measure the connectivity between documents. The connectivity measures are used to identify connected and related Web resources. Our approach is able to expose relations that traditional text-based approaches fail to identify. We validate and assess our proposed approaches through an evaluation on a real world dataset, where results show that the proposed techniques outperform state of the art approaches.CAPESEU/FP7/2007-2013CNPFAPER

    BlogForever D2.6: Data Extraction Methodology

    Get PDF
    This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform

    DARIAH and the Benelux

    Get PDF

    CHORUS Deliverable 2.2: Second report - identification of multi-disciplinary key issues for gap analysis toward EU multimedia search engines roadmap

    Get PDF
    After addressing the state-of-the-art during the first year of Chorus and establishing the existing landscape in multimedia search engines, we have identified and analyzed gaps within European research effort during our second year. In this period we focused on three directions, notably technological issues, user-centred issues and use-cases and socio- economic and legal aspects. These were assessed by two central studies: firstly, a concerted vision of functional breakdown of generic multimedia search engine, and secondly, a representative use-cases descriptions with the related discussion on requirement for technological challenges. Both studies have been carried out in cooperation and consultation with the community at large through EC concertation meetings (multimedia search engines cluster), several meetings with our Think-Tank, presentations in international conferences, and surveys addressed to EU projects coordinators as well as National initiatives coordinators. Based on the obtained feedback we identified two types of gaps, namely core technological gaps that involve research challenges, and “enablers”, which are not necessarily technical research challenges, but have impact on innovation progress. New socio-economic trends are presented as well as emerging legal challenges

    Summaries on the fly: Query-based extraction of structured knowledge from web documents

    Get PDF
    A large part of Web resources consists of unstructured textual content. Processing and retrieving relevant content for a particular information need is challenging for both machines and humans. While information retrieval techniques provide methods for detecting suitable resources for a particular query, information extraction techniques enable the extraction of structured data and text summarization allows the detection of important sentences. However, these techniques usually do not consider particular user interests and information needs. In this paper, we present a novel method to automatically generate structured summaries from user queries that uses POS patterns to identify relevant statements and entities in a certain context. Finally, we evaluate our work using the publicly available New York Times corpus, which shows the applicability of our method and the advantages over previous works. The final publication is available at Springer via https://doi.org/10.1007/978-3-642-39200-9_2

    Reframing Agribusiness: Moving from Farm to Market Centric

    Get PDF
    Agribusiness is moving from farm to market centric, where effective activities anticipate and respond to customers, markets, and the systems in which they function. This evolution requires a broader conceptualization and more accurate definition, to convey a more dynamic, systemic, and integrative discipline, which increasingly is committed to value creation and the sustainable orchestration of food, fiber, and renewable resources. We discuss the forces driving this shift to the market, offer a new and more representative definition of agribusiness, provide models to illustrate some of the most compelling trends, and articulate key elements and implications of those models.agribusiness definition, conceptual models, market centric, market systems, Agribusiness, Marketing, Production Economics,

    Towards a core ontology for information integration

    Get PDF
    In this paper, we argue that a core ontology is one of the key building blocks necessary to enable the scalable assimilation of information from diverse sources. A complete and extensible ontology that expresses the basic concepts that are common across a variety of domains and can provide the basis for specialization into domain-specific concepts and vocabularies, is essential for well-defined mappings between domain-specific knowledge representations (i.e., metadata vocabularies) and the subsequent building of a variety of services such as cross-domain searching, browsing, data mining and knowledge extraction. This paper describes the results of a series of three workshops held in 2001 and 2002 which brought together representatives from the cultural heritage and digital library communities with the goal of harmonizing their knowledge perspectives and producing a core ontology. The knowledge perspectives of these two communities were represented by the CIDOC/CRM [31], an ontology for information exchange in the cultural heritage and museum community, and the ABC ontology [33], a model for the exchange and integration of digital library information. This paper describes the mediation process between these two different knowledge biases and the results of this mediation - the harmonization of the ABC and CIDOC/CRM ontologies, which we believe may provide a useful basis for information integration in the wider scope of the involved communities

    Digitaalse teadmuse arhiveerimine – teoreetilis-praktiline uurimistöö Rahvusarhiivi näitel

    Get PDF
    Väitekirja elektrooniline versioon ei sisalda publikatsioone.Digitaalse informatsiooni pidevalt kiirenev juurdekasv on aidanud rõhutada ka olulise informatsiooni säilitamise vajadust. Säilitamine ei tähenda siinkohal pelgalt füüsilist varundamist, vaid ka informatsiooni kasutatavuse ja mõistetavuse tagamist. See tähendab, et tegelikkuses on vaja hoolitseda ka selle eest, et meil oleks olemas vajalik riist- ja tarkvara arhiveeritud teabe kasutamiseks. Kui seda ei ole, siis saab mõningatel juhtudel kasutada emulaatoreid, mis matkivad konkreetset aegunud süsteemi ja võimaldavad niiviisi vanu faile avada. Samas, kui tehnoloogia iganemist on võimalik ette näha, siis oleks mõistlik failid juba varakult püsivamasse vormingusse ümber konverteerida või andmekandja kaasaegsema vastu vahetada. Nii emuleerimine, konverteerimine kui ka nende kombineerimine aitavad säilitada informatsiooni kasutatavust, kuid ei pruugi tagada autentset mõistetavust, kuna digitaalse teabe esitus sõltub alati säilitatud bittide tõlgendamisest. Näiteks, kui luua WordPad tarkvara abil üks dokument ja avada seesama dokument Hex Editor Neo abil, siis näeme seda faili kahendkujul, Notepad++ näitab RTFi kodeeringut, Microsoft Word 2010 ja LibreOffice Writeri esitustes võime märgata juba mitmeid erinevusi. Kõik eelloetletud esitused on tehnoloogilises mõttes õiged. Faili avamisel veateateid ei teki, sest tarkvara seisukohast lähtudes peavadki esitused sellised olema. Siinjuures oluline rõhutada, et ka korrektne esitus võib jääda kasutajale mõistetamatuks – see, et andmed on säilinud, et neid on võimalik lugeda ja esitada, ei garanteeri paraku, et neid õigesti mõistetakse. Mõistetavuse tagamiseks tuleb alati arvestada ka lõppkasutajaskonnaga. Seetõttu uuribki antud töö võimalusi, kuidas toetada teadmuse (mõistetava informatsiooni) digitaalset arhiveerimist tuginedes eelkõige parimale praktikale, praktilistele eksperimentidele Rahvusarhiivis ja interdistsiplinaarsetele (nt infotehnoloogia kombineerimine arhiivindusega) võtetele.Digital preservation of knowledge is a very broad and complex research area. Many aspects are still open for research. According to the literature, the accessibility and usability of digital information have been more investigated than the comprehensibility of important digital information over time. Although there are remedies (e.g. emulation and migration) for mitigating the risks related to the accessibility and usability, the question how to guarantee understandability/comprehensibility of archived information is still ongoing research. Understanding digital information first requires a representation of the archived information, so that a user could then interpret and understand it. However, it is a not-so-well-known fact that the digital information does not have any fixed representation before involving some software. For example, if we create a document in WordPad and open the same file in Hex Editor Neo software, then we will see the binary representation which is also correct but not suitable for human users, as humans are not used to interpreting binary codes. When we open that file in Notepad++, then we can see the structure of the RTF coding. Again, this is the correct interpretation of this file, but not understandable for the ordinary user, as it shows the technical view of the file format structure. When we open that file in Microsoft Word 2010 or LibreOffice Writer, then we will notice some changes, although the original bits are the same and no errors are displayed by the software. Thus, all representations are technologically correct and no errors will be displayed to the user when they are opening this file. It is important to emphasise that in some cases even the original representation may be not understandable to the users. Therefore, it is important to know who the main users of the archives are and to ensure that the archived objects are independently understandable to that community over the long term. This dissertation will therefore research meaningful use of digital objects by taking into account the designated users’ knowledge and Open Archival Information System (OAIS) model. The research also includes several practical experimental projects at the National Archives of Estonia which will test some important parts of the theoretical work
    corecore