1,089 research outputs found
Reconstructing human-generated provenance through similarity-based clustering
In this paper, we revisit our method for reconstructing the primary sources of documents, which make up an important part of their provenance. Our method is based on the assumption that if two documents are semantically similar, there is a high chance that they also share a common source. We previously evaluated this assumption on an excerpt from a news archive, achieving 68.2% precision and 73% recall when reconstructing the primary sources of all articles. However, since we could not release this dataset to the public, it made our results hard to compare to others. In this work, we extend the flexibility of our method by adding a new parameter, and re-evaluate it on the human-generated dataset created for the 2014 Provenance Reconstruction Challenge. The extended method achieves up to 86% precision and 59% recall, and is now directly comparable to any approach that uses the same dataset
The normalized freebase distance
In this paper, we propose the Normalized Freebase Distance (NFD), a new measure for determing semantic concept relatedness that is based on similar principles as the Normalized Web Distance (NWD). We illustrate that the NFD is more effective when comparing ambiguous concepts
Using EPUB 3 and the open web platform for enhanced presentation and machine-understandable metadata for digital comics
Various methods are needed to extract information from current (digital) comics. Furthermore, the use of different (proprietary) formats by comic distribution platforms causes an overhead for authors. To overcome these issues, we propose a solution that makes use of the EPUB 3 specification, additionally leveraging the Open Web Platform to support animations, reading assistance, audio and multiple languages in a single format, by using our JavaScript library comicreader.js. We also provide administrative and descriptive metadata in the same format by introducing a new ontology: Dicera. Our solution is complementary to the current extraction methods, on the one hand because they can help with metadata creation, and on the other hand because the machine-understandable metadata alleviates their use. While the reading system support for our solution is currently limited, it can offer all features needed by current comic distribution platforms. When comparing comics generated by our solution to EPUB 3 textbooks, we observed an increase in file size, mainly due to the use of images. In future work, our solution can be further improved by extending the presentation features, investigating different types of comics, studying the use of new EPUB 3 extensions, and by incorporating it in digital book authoring environments
- …