14,225 research outputs found

    Assessing and refining mappings to RDF to improve dataset quality

    Get PDF
    RDF dataset quality assessment is currently performed primarily after data is published. However, there is neither a systematic way to incorporate its results into the dataset nor the assessment into the publishing workflow. Adjustments are manually -but rarely- applied. Nevertheless, the root of the violations which often derive from the mappings that specify how the RDF dataset will be generated, is not identified. We suggest an incremental, iterative and uniform validation workflow for RDF datasets stemming originally from (semi-) structured data (e.g., CSV, XML, JSON). In this work, we focus on assessing and improving their mappings. We incorporate (i) a test-driven approach for assessing the mappings instead of the RDF dataset itself, as mappings reflect how the dataset will be formed when generated; and (ii) perform semi-automatic mapping refinements based on the results of the quality assessment. The proposed workflow is applied to diverse cases, e.g., large, crowdsourced datasets such as DBpedia, or newly generated, such as iLastic. Our evaluation indicates the efficiency of our workflow, as it significantly improves the overall quality of an RDF dataset in the observed cases

    Quality Assessment of Linked Datasets using Probabilistic Approximation

    Full text link
    With the increasing application of Linked Open Data, assessing the quality of datasets by computing quality metrics becomes an issue of crucial importance. For large and evolving datasets, an exact, deterministic computation of the quality metrics is too time consuming or expensive. We employ probabilistic techniques such as Reservoir Sampling, Bloom Filters and Clustering Coefficient estimation for implementing a broad set of data quality metrics in an approximate but sufficiently accurate way. Our implementation is integrated in the comprehensive data quality assessment framework Luzzu. We evaluated its performance and accuracy on Linked Open Datasets of broad relevance.Comment: 15 pages, 2 figures, To appear in ESWC 2015 proceeding

    More is more in language learning:reconsidering the less-is-more hypothesis

    Get PDF
    The Less-is-More hypothesis was proposed to explain age-of-acquisition effects in first language (L1) acquisition and second language (L2) attainment. We scrutinize different renditions of the hypothesis by examining how learning outcomes are affected by (1) limited cognitive capacity, (2) reduced interference resulting from less prior knowledge, and (3) simplified language input. While there is little-to-no evidence of benefits of limited cognitive capacity, there is ample support for a More-is-More account linking enhanced capacity with better L1- and L2-learning outcomes, and reduced capacity with childhood language disorders. Instead, reduced prior knowledge (relative to adults) may afford children with greater flexibility in inductive inference; this contradicts the idea that children benefit from a more constrained hypothesis space. Finally, studies of childdirected speech (CDS) confirm benefits from less complex input at early stages, but also emphasize how greater lexical and syntactic complexity of the input confers benefits in L1-attainment

    Mapping solar array location, size, and capacity using deep learning and overhead imagery

    Full text link
    The effective integration of distributed solar photovoltaic (PV) arrays into existing power grids will require access to high quality data; the location, power capacity, and energy generation of individual solar PV installations. Unfortunately, existing methods for obtaining this data are limited in their spatial resolution and completeness. We propose a general framework for accurately and cheaply mapping individual PV arrays, and their capacities, over large geographic areas. At the core of this approach is a deep learning algorithm called SolarMapper - which we make publicly available - that can automatically map PV arrays in high resolution overhead imagery. We estimate the performance of SolarMapper on a large dataset of overhead imagery across three US cities in California. We also describe a procedure for deploying SolarMapper to new geographic regions, so that it can be utilized by others. We demonstrate the effectiveness of the proposed deployment procedure by using it to map solar arrays across the entire US state of Connecticut (CT). Using these results, we demonstrate that we achieve highly accurate estimates of total installed PV capacity within each of CT's 168 municipal regions

    Information sharing performance management: a semantic interoperability assessment in the maritime surveillance domain

    Get PDF
    Information Sharing (IS) is essential for organizations to obtain information in a cost-effective way. If the existing information is not shared among the organizations that hold it, the alternative is to develop the necessary capabilities to acquire, store, process and manage it, which will lead to duplicated costs, especially unwanted if governmental organizations are concerned. The European Commission has elected IS among public administrations as a priority, has launched several IS initiatives, such as the EUCISE2020 project within the roadmap for developing the maritime Common Information Sharing Environment (CISE), and has defined the levels of interoperability essential for IS, which entail Semantic Interoperability (SI). An open question is how can IS performance be managed? Specifically, how can IS as-is, and to-be states and targets be defined, and how can organizations progress be monitored and controlled? In this paper, we propose 11 indicators for assessing SI that contribute to answering these questions. They have been demonstrated and evaluated with the data collected through a questionnaire, based on the CISE information model proposed during the CoopP project, which was answered by five public authorities that require maritime surveillance information and are committed to share information with each other.Postprint (published version

    Towards a Universal Wordnet by Learning from Combined Evidenc

    Get PDF
    Lexical databases are invaluable sources of knowledge about words and their meanings, with numerous applications in areas like NLP, IR, and AI. We propose a methodology for the automatic construction of a large-scale multilingual lexical database where words of many languages are hierarchically organized in terms of their meanings and their semantic relations to other words. This resource is bootstrapped from WordNet, a well-known English-language resource. Our approach extends WordNet with around 1.5 million meaning links for 800,000 words in over 200 languages, drawing on evidence extracted from a variety of resources including existing (monolingual) wordnets, (mostly bilingual) translation dictionaries, and parallel corpora. Graph-based scoring functions and statistical learning techniques are used to iteratively integrate this information and build an output graph. Experiments show that this wordnet has a high level of precision and coverage, and that it can be useful in applied tasks such as cross-lingual text classification

    An intelligent linked data quality dashboard

    Get PDF
    This paper describes a new intelligent, data-driven dashboard for linked data quality assessment. The development goal was to assist data quality engineers to interpret data quality problems found when evaluating a dataset us-ing a metrics-based data quality assessment. This required construction of a graph linking the problematic things identified in the data, the assessment metrics and the source data. This context and supporting user interfaces help the user to un-derstand data quality problems. An analysis widget also helped the user identify the root cause multiple problems. This supported the user in identification and prioritization of the problems that need to be fixed and to improve data quality. The dashboard was shown to be useful for users to clean data. A user evaluation was performed with both expert and novice data quality engineers

    An analytical model for Loc/ID mappings caches

    Get PDF
    Concerns regarding the scalability of the interdomain routing have encouraged researchers to start elaborating a more robust Internet architecture. While consensus on the exact form of the solution is yet to be found, the need for a semantic decoupling of a node's location and identity is generally accepted as a promising way forward. However, this typically requires the use of caches that store temporal bindings between the two namespaces, to avoid hampering router packet forwarding speeds. In this article, we propose a methodology for an analytical analysis of cache performance that relies on the working-set theory. We first identify the conditions that network traffic must comply with for the theory to be applicable and then develop a model that predicts average cache miss rates relying on easily measurable traffic parameters. We validate the result by emulation, using real packet traces collected at the egress points of a campus and an academic network. To prove its versatility, we extend the model to consider cache polluting user traffic and observe that simple, low intensity attacks drastically reduce performance, whereby manufacturers should either overprovision router memory or implement more complex cache eviction policies.Peer ReviewedPostprint (author's final draft

    RegenBase: a knowledge base of spinal cord injury biology for translational research.

    Get PDF
    Spinal cord injury (SCI) research is a data-rich field that aims to identify the biological mechanisms resulting in loss of function and mobility after SCI, as well as develop therapies that promote recovery after injury. SCI experimental methods, data and domain knowledge are locked in the largely unstructured text of scientific publications, making large scale integration with existing bioinformatics resources and subsequent analysis infeasible. The lack of standard reporting for experiment variables and results also makes experiment replicability a significant challenge. To address these challenges, we have developed RegenBase, a knowledge base of SCI biology. RegenBase integrates curated literature-sourced facts and experimental details, raw assay data profiling the effect of compounds on enzyme activity and cell growth, and structured SCI domain knowledge in the form of the first ontology for SCI, using Semantic Web representation languages and frameworks. RegenBase uses consistent identifier schemes and data representations that enable automated linking among RegenBase statements and also to other biological databases and electronic resources. By querying RegenBase, we have identified novel biological hypotheses linking the effects of perturbagens to observed behavioral outcomes after SCI. RegenBase is publicly available for browsing, querying and download.Database URL:http://regenbase.org
    • 

    corecore