33 research outputs found

    Automation of Flexible Migration Workflows

    Get PDF
    Many digital preservation scenarios are based on the migration strategy, which itself is heavily tool-dependent. For popular, well-defined and often open file formats – e.g., digital images, such as PNG, GIF, JPEG – a wide range of tools exist. Migration workflows become more difficult with proprietary formats, as used by the several text processing applications becoming available in the last two decades. If a certain file format can not be rendered with actual software, emulation of the original environment remains a valid option. For instance, with the original Lotus AmiPro or Word Perfect, it is not a problem to save an object of this type in ASCII text or Rich Text Format. In specific environments, it is even possible to send the file to a virtual printer, thereby producing a PDF as a migration output. Such manual migration tasks typically involve human interaction, which may be feasible for a small number of objects, but not for larger batches of files.We propose a novel approach using a software-operated VNC abstraction layer in order to replace humans with machine interaction. Emulators or virtualization tools equipped with a VNC interface are very well suited for this approach. But screen, keyboard and mouse interaction is just part of the setup. Furthermore, digital objects need to be transferred into the original environment in order to be extracted after processing. Nevertheless, the complexity of the new generation of migration services is quickly rising; a preservation workflow is now comprised not only of the migration tool itself, but of a complete software and virtual hardware stack with recorded workflows linked to every supported migration scenario. Thus the requirements of OAIS management must include proper software archiving, emulator selection, system image and recording handling. The concept of view-paths could help either to automatically determine the proper pre-configured virtual environment or to set up system images for certain migration workflows. View-paths may rise in demand, as the generation of PDF output files from Word Perfect input could be cached as pre-fabricated emulator system images. The current groundwork provides several possible optimizations, such as using the automation features of the original environments

    Automation of Flexible Migration Workflows

    Full text link

    Proceedings of the 12th International Conference on Digital Preservation

    Get PDF
    The 12th International Conference on Digital Preservation (iPRES) was held on November 2-6, 2015 in Chapel Hill, North Carolina, USA. There were 327 delegates from 22 countries. The program included 12 long papers, 15 short papers, 33 posters, 3 demos, 6 workshops, 3 tutorials and 5 panels, as well as several interactive sessions and a Digital Preservation Showcase

    Proceedings of the 12th International Conference on Digital Preservation

    Get PDF
    The 12th International Conference on Digital Preservation (iPRES) was held on November 2-6, 2015 in Chapel Hill, North Carolina, USA. There were 327 delegates from 22 countries. The program included 12 long papers, 15 short papers, 33 posters, 3 demos, 6 workshops, 3 tutorials and 5 panels, as well as several interactive sessions and a Digital Preservation Showcase

    Growth and Duplication of Public Source Code over Time: Provenance Tracking at Scale

    Get PDF
    We study the evolution of the largest known corpus of publicly available source code, i.e., the Software Heritage archive (4B unique source code files, 1B commits capturing their development histories across 50M software projects). On such corpus we quantify the growth rate of original, never-seen-before source code files and commits. We find the growth rates to be exponential over a period of more than 40 years.We then estimate the multiplication factor, i.e., how much the same artifacts (e.g., files or commits) appear in different contexts (e.g., commits or source code distribution places). We observe a combinatorial explosion in the multiplication of identical source code files across different commits.We discuss the implication of these findings for the problem of tracking the provenance of source code artifacts (e.g., where and when a given source code file or commit has been observed in the wild) for the entire body of publicly available source code. To that end we benchmark different data models for capturing software provenance information at this scale and growth rate. We identify a viable solution that is deployable on commodity hardware and appears to be maintainable for the foreseeable future

    File formats for long-term preservation of electronic publications in the e-deposit system in the Czech Republic

    Get PDF
    This bachelor's thesis focuses on selection of archival formats for the purposes of long term preservation of electronic pulications within the Czech e-deposit system. It analyzes the current situation in the Czech Republic and, on the basis of a sample of seven foreign institutions, compares it with foreign approaches and experiences. The first part of the thesis deals with the electronic formats of electronic books on the theoretical side. Based on the market analysis, the formats relevant to the Czech market for electronic publications are identified and the risks of the individual formats are described with regard to the possibilities of their long-term preservation. The second part is devoted to case studies in which the format policies of foreign libraries dealing with long-term archiving of electronic publications are documented. An analysis of electronic publications, voluntarily deposited by the publisher to the NL CR during the pilot operation of the e-deposit system, is also performed. The conclusion of the thesis summarizes the findings in the form of recommendations for archival formats for the further development of the Czech e-deposit system.Bakalářská práce se zabývá výběrem archivačních formátů pro potřeby dlouhodobého uchovávání elektronických publikací v rámci českého systému pro e-deposit. Analyzuje současný stav v ČR a na základě vzorku sedmi zahraničních institucí jej porovnává se zahraničními přístupy a zkušenostmi. První část práce se zabývá elektronickými formáty elektronických knih po teoretické stránce. Na základě analýzy trhu jsou určeny formáty relevantní pro český trh s elektronickými publikacemi a jsou popsána rizika jednotlivých formátů s ohledem na možnosti jejich dlouhodobého uchování. Druhá část práce je věnována případovým studiím, ve kterých jsou zdokumentovány formátové politiky zahraničních knihoven, zabývajících se dlouhodobou archivací elektronických publikací. Je také provedena analýza dat elektronických publikací, dobrovolně odevzdaných vydavateli do NK ČR v rámci pilotního provozu systému pro e-deposit. Závěr práce shrnuje zjištěné poznatky formou doporučení archivačních formátů pro další rozvoj českého systému pro e-deposit.Institute of Information Studies and LibrarianshipÚstav informačních studií a knihovnictvíFilozofická fakultaFaculty of Art

    Building information modeling – A game changer for interoperability and a chance for digital preservation of architectural data?

    Get PDF
    Digital data associated with the architectural design-andconstruction process is an essential resource alongside -and even past- the lifecycle of the construction object it describes. Despite this, digital architectural data remains to be largely neglected in digital preservation research – and vice versa, digital preservation is so far neglected in the design-and-construction process. In the last 5 years, Building Information Modeling (BIM) has seen a growing adoption in the architecture and construction domains, marking a large step towards much needed interoperability. The open standard IFC (Industry Foundation Classes) is one way in which data is exchanged in BIM processes. This paper presents a first digital preservation based look at BIM processes, highlighting the history and adoption of the methods as well as the open file format standard IFC (Industry Foundation Classes) as one way to store and preserve BIM data

    API Usage Recommendation via Multi-View Heterogeneous Graph Representation Learning

    Full text link
    Developers often need to decide which APIs to use for the functions being implemented. With the ever-growing number of APIs and libraries, it becomes increasingly difficult for developers to find appropriate APIs, indicating the necessity of automatic API usage recommendation. Previous studies adopt statistical models or collaborative filtering methods to mine the implicit API usage patterns for recommendation. However, they rely on the occurrence frequencies of APIs for mining usage patterns, thus prone to fail for the low-frequency APIs. Besides, prior studies generally regard the API call interaction graph as homogeneous graph, ignoring the rich information (e.g., edge types) in the structure graph. In this work, we propose a novel method named MEGA for improving the recommendation accuracy especially for the low-frequency APIs. Specifically, besides call interaction graph, MEGA considers another two new heterogeneous graphs: global API co-occurrence graph enriched with the API frequency information and hierarchical structure graph enriched with the project component information. With the three multi-view heterogeneous graphs, MEGA can capture the API usage patterns more accurately. Experiments on three Java benchmark datasets demonstrate that MEGA significantly outperforms the baseline models by at least 19% with respect to the Success Rate@1 metric. Especially, for the low-frequency APIs, MEGA also increases the baselines by at least 55% regarding the Success Rate@1

    Growth and Duplication of Public Source Code over Time: Provenance Tracking at Scale

    Get PDF
    We study the evolution of the largest known corpus of publicly available source code, i.e., the Software Heritage archive (4B unique source code files, 1B commits capturing their development histories across 50M software projects). On such corpus we quantify the growth rate of original, never-seen-before source code files and commits. We find the growth rates to be exponential over a period of more than 40 years.We then estimate the multiplication factor, i.e., how much the same artifacts (e.g., files or commits) appear in different contexts (e.g., commits or source code distribution places). We observe a combinatorial explosion in the multiplication of identical source code files across different commits.We discuss the implication of these findings for the problem of tracking the provenance of source code artifacts (e.g., where and when a given source code file or commit has been observed in the wild) for the entire body of publicly available source code. To that end we benchmark different data models for capturing software provenance information at this scale and growth rate. We identify a viable solution that is deployable on commodity hardware and appears to be maintainable for the foreseeable future
    corecore