28,958 research outputs found
On Using Active Learning and Self-Training when Mining Performance Discussions on Stack Overflow
Abundant data is the key to successful machine learning. However, supervised
learning requires annotated data that are often hard to obtain. In a
classification task with limited resources, Active Learning (AL) promises to
guide annotators to examples that bring the most value for a classifier. AL can
be successfully combined with self-training, i.e., extending a training set
with the unlabelled examples for which a classifier is the most certain. We
report our experiences on using AL in a systematic manner to train an SVM
classifier for Stack Overflow posts discussing performance of software
components. We show that the training examples deemed as the most valuable to
the classifier are also the most difficult for humans to annotate. Despite
carefully evolved annotation criteria, we report low inter-rater agreement, but
we also propose mitigation strategies. Finally, based on one annotator's work,
we show that self-training can improve the classification accuracy. We conclude
the paper by discussing implication for future text miners aspiring to use AL
and self-training.Comment: Preprint of paper accepted for the Proc. of the 21st International
Conference on Evaluation and Assessment in Software Engineering, 201
KULTUR: showcasing art through institutional repositories
Showcasing work has always been at the heart of the arts community, whether it be through an exhibition, site-specific installation or performance. Representation of the original work has also been important and use of print-based options like exhibition catalogues is now complemented by websites and multi-media friendly services like Flickr and YouTube and Vimeo. These services also provide options for sharing born-digital material. For those working in higher education there is a need to profile both the personal and the institutional aspects of creative outputs. The KULTUR project created a model for arts-based institutional repositories and it is hoped that this approach will be useful for other arts institutions
LODE: Linking Digital Humanities Content to the Web of Data
Numerous digital humanities projects maintain their data collections in the
form of text, images, and metadata. While data may be stored in many formats,
from plain text to XML to relational databases, the use of the resource
description framework (RDF) as a standardized representation has gained
considerable traction during the last five years. Almost every digital
humanities meeting has at least one session concerned with the topic of digital
humanities, RDF, and linked data. While most existing work in linked data has
focused on improving algorithms for entity matching, the aim of the
LinkedHumanities project is to build digital humanities tools that work "out of
the box," enabling their use by humanities scholars, computer scientists,
librarians, and information scientists alike. With this paper, we report on the
Linked Open Data Enhancer (LODE) framework developed as part of the
LinkedHumanities project. With LODE we support non-technical users to enrich a
local RDF repository with high-quality data from the Linked Open Data cloud.
LODE links and enhances the local RDF repository without compromising the
quality of the data. In particular, LODE supports the user in the enhancement
and linking process by providing intuitive user-interfaces and by suggesting
high-quality linking candidates using tailored matching algorithms. We hope
that the LODE framework will be useful to digital humanities scholars
complementing other digital humanities tools
Joining up health and bioinformatics: e-science meets e-health
CLEF (Co-operative Clinical e-Science Framework) is an MRC sponsored project in the e-Science programme that aims to establish methodologies and a technical infrastructure forthe next generation of integrated clinical and bioscience research. It is developing methodsfor managing and using pseudonymised repositories of the long-term patient histories whichcan be linked to genetic, genomic information or used to support patient care. CLEF concentrateson removing key barriers to managing such repositories ? ethical issues, informationcapture, integration of disparate sources into coherent ?chronicles? of events, userorientedmechanisms for querying and displaying the information, and compiling the requiredknowledge resources. This paper describes the overall information flow and technicalapproach designed to meet these aims within a Grid framework
Assigning Creative Commons Licenses to Research Metadata: Issues and Cases
This paper discusses the problem of lack of clear licensing and transparency
of usage terms and conditions for research metadata. Making research data
connected, discoverable and reusable are the key enablers of the new data
revolution in research. We discuss how the lack of transparency hinders
discovery of research data and make it disconnected from the publication and
other trusted research outcomes. In addition, we discuss the application of
Creative Commons licenses for research metadata, and provide some examples of
the applicability of this approach to internationally known data
infrastructures.Comment: 9 pages. Submitted to the 29th International Conference on Legal
Knowledge and Information Systems (JURIX 2016), Nice (France) 14-16 December
201
Mining Threat Intelligence about Open-Source Projects and Libraries from Code Repository Issues and Bug Reports
Open-Source Projects and Libraries are being used in software development
while also bearing multiple security vulnerabilities. This use of third party
ecosystem creates a new kind of attack surface for a product in development. An
intelligent attacker can attack a product by exploiting one of the
vulnerabilities present in linked projects and libraries.
In this paper, we mine threat intelligence about open source projects and
libraries from bugs and issues reported on public code repositories. We also
track library and project dependencies for installed software on a client
machine. We represent and store this threat intelligence, along with the
software dependencies in a security knowledge graph. Security analysts and
developers can then query and receive alerts from the knowledge graph if any
threat intelligence is found about linked libraries and projects, utilized in
their products
The role of institutional repositories in addressing higher education challenges
Over the last decade, Higher Education around the world is facing a number of challenges. Challenges such as adopting new technologies, improving the quality of learning and teaching, widening participation, student retention, curriculum design/alignment, student employability, funding and the necessity to improve governance are considered particularly in many literature. To effectively operate and to survive in this globalization era, Higher Education institutions need to respond those challenges in an efficient way. This paper proposes ways in which institutional data repositories can be utilized to address the challenges found in different literature. Also we discuss which repositories can be shared across the institutions and which need not to be shared in order to address those challenges. Finally the paper discusses the barriers to sharing Higher Education repositories and how those barriers can be addressed
Economics and Engineering for Preserving Digital Content
Progress towards practical long-term preservation seems to be stalled. Preservationists cannot afford specially developed technology, but must exploit what is created for the marketplace.
Economic and technical facts suggest that most preservation ork should be shifted from repository institutions to information producers and consumers. Prior publications describe solutions for all known conceptual challenges of preserving a single digital object, but do not deal with software development or scaling to large collections. Much of the document handling software needed is available. It has, however, not yet been selected, adapted, integrated, or
deployed for digital preservation. The daily tools of both information producers and information consumers can be extended to embed preservation packaging without much burdening these users.
We describe a practical strategy for detailed design and implementation. Document handling is intrinsically complicated because of human sensitivity to communication nuances. Our engineering section therefore starts by discussing how project managers can master the many pertinent details.
Characterizing the Landscape of Musical Data on the Web: State of the Art and Challenges
Musical data can be analysed, combined, transformed and exploited for diverse purposes. However, despite the proliferation of digital libraries and repositories for music, infrastructures and tools, such uses of musical data remain scarce. As an initial step to help fill this gap, we present a survey of the landscape of musical data on the Web, available as a Linked Open Dataset: the musoW dataset of catalogued musical resources. We present the dataset and the methodology and criteria for its creation and assessment. We map the identified dimensions and parameters to existing Linked Data vocabularies, present insights gained from SPARQL queries, and identify significant relations between resource features. We present a thematic analysis of the original research questions associated with surveyed resources and identify the extent to which the collected resources are Linked Data-ready
Semantic Integration of Cervical Cancer Data Repositories to Facilitate Multicenter Association Studies: The ASSIST Approach
The current work addresses the unifi cation of Electronic Health Records related to cervical cancer into a single medical knowledge source, in the context of the EU-funded ASSIST research project. The project aims to facilitate the research for cervical precancer and cancer through a system that virtually unifi es multiple patient record repositories, physically located in different medical centers/hospitals, thus, increasing fl exibility by allowing the formation of study groups “on demand” and by recycling patient records in new studies. To this end, ASSIST uses semantic technologies to translate all medical entities (such as patient examination results, history, habits, genetic profi le) and represent them in a common form, encoded in the ASSIST Cervical Cancer Ontology. The current paper presents the knowledge elicitation approach followed, towards the defi nition and representation of the disease’s medical concepts and rules that constitute the basis for the ASSIST Cervical Cancer Ontology. The proposed approach constitutes a paradigm for semantic integration of heterogeneous clinical data that may be applicable to other biomedical application domains
- …