55 research outputs found

    A network aware resource discovery service (a performance evaluation study)

    Get PDF
    International audienceInternet in recent years has become a huge set of channels for content distribution highlighting limits and inefficiencies of the current protocol suite originally designed for host-to-host communication. In this paper we exploit recent advances in Information Centric Networks in the attempt to reshape the actual Internet infrastructure from a host-centric to a name-centric paradigm where the focus is on named data instead of machine name hosting those data. In particular, we pro- pose a Content Name System Service that provides a new network aware Content Discovery Service. The CNS behavior and architecture uses the BGP inter-domain routing information. In particular, the service registers and discovers resource names in each Autonomous System: contents are discovered by searching through the augmented AS graph represen- tation classifying ASes into customer, provider, and peering, as the BGP protocol does.Performance of CNS can be characterized by the fraction of Autonomous Systems that successfully locate a requested content and by the average number of CNS Servers explored during the search phase. A C-based simulator of CNS is developed and is run over real ASes topologies provided by the Center for Applied Internet Data Analysis to provide estimates of both performance indexes. Preliminary performance and sensitivity results show the CNS approach is promising and can be efficiently implemented by incrementally deploying CNS Servers

    Improving Record Linkage Accuracy with Hierarchical Feature Level Information and Parsed Data

    Get PDF
    Probabilistic record linkage is a well established topic in the literature. Fellegi-Sunter probabilistic record linkage and its enhanced versions are commonly used methods, which calculate match and non- match weights for each pair of records. Bayesian network classifiers – naive Bayes classifier and TAN have also been successfully used here. Recently, an extended version of TAN (called ETAN) has been developed and proved superior in classification accuracy to conventional TAN. However, no previous work has applied ETAN to record linkage and investigated the benefits of using naturally existing hierarchical feature level information and parsed fields of the datasets. In this work, we ex- tend the naive Bayes classifier with such hierarchical feature level information. Finally we illustrate the benefits of our method over previously proposed methods on 4 datasets in terms of the linkage performance (F1 score). We also show the results can be further improved by evaluating the benefit provided by additionally parsing the fields of these datasets

    On providing semantic alignment and unified access to music library metadata

    Get PDF
    A variety of digital data sources—including insti- tutional and formal digital libraries, crowd-sourced commu- nity resources, and data feeds provided by media organisa- tions such as the BBC—expose information of musicological interest, describing works, composers, performers, and wider historical and cultural contexts. Aggregated access across such datasets is desirable as these sources provide comple- mentary information on shared real-world entities. Where datasets do not share identifiers, an alignment process is required, but this process is fraught with ambiguity and difficult to automate, whereas manual alignment may be time-consuming and error-prone. We address this problem through the application of a Linked Data model and frame- work to assist domain experts in this process. Candidate alignment suggestions are generated automatically based on textual and on contextual similarity. The latter is determined according to user-configurable weighted graph traversals. Match decisions confirming or disputing the candidate sug- gestions are obtained in conjunction with user insight and expertise. These decisions are integrated into the knowledge base, enabling further iterative alignment, and simplifying the creation of unified viewing interfaces. Provenance of the musicologist’s judgement is captured and published, support- ing scholarly discourse and counter-proposals. We present our implementation and evaluation of this framework, con- ducting a user study with eight musicologists. We further demonstrate the value of our approach through a case study providing aligned access to catalogue metadata and digitised score images from the British Library and other sources, and broadcast data from the BBC Radio 3 Early Music Show

    Unsupervised record matching with noisy and incomplete data

    Get PDF
    We consider the problem of duplicate detection in noisy and incomplete data: given a large data set in which each record has multiple entries (attributes), detect which distinct records refer to the same real world entity. This task is complicated by noise (such as misspellings) and missing data, which can lead to records being different, despite referring to the same entity. Our method consists of three main steps: creating a similarity score between records, grouping records together into "unique entities", and refining the groups. We compare various methods for creating similarity scores between noisy records, considering different combinations of string matching, term frequency-inverse document frequency methods, and n-gram techniques. In particular, we introduce a vectorized soft term frequency-inverse document frequency method, with an optional refinement step. We also discuss two methods to deal with missing data in computing similarity scores. We test our method on the Los Angeles Police Department Field Interview Card data set, the Cora Citation Matching data set, and two sets of restaurant review data. The results show that the methods that use words as the basic units are preferable to those that use 3-grams. Moreover, in some (but certainly not all) parameter ranges soft term frequency-inverse document frequency methods can outperform the standard term frequency-inverse document frequency method. The results also confirm that our method for automatically determining the number of groups typically works well in many cases and allows for accurate results in the absence of a priori knowledge of the number of unique entities in the data set

    Active learning with optimal instance subset selection

    Full text link
    Active learning (AL) traditionally relies on some instance-based utility measures (such as uncertainty) to assess individual instances and label the ones with the maximum values for training. In this paper, we argue that such approaches cannot produce good labeling subsets mainly because instances are evaluated independently without considering their interactions, and individuals with maximal ability do not necessarily form an optimal instance subset for learning. Alternatively, we propose to achieve AL with optimal subset selection (ALOSS), where the key is to find an instance subset with a maximum utility value. To achieve the goal, ALOSS simultaneously considers the following: 1) the importance of individual instances and 2) the disparity between instances, to build an instance-correlation matrix. As a result, AL is transformed to a semidefinite programming problem to select a k-instance subset with a maximum utility value. Experimental results demonstrate that ALOSS outperforms state-of-the-art approaches for AL. © 2012 IEEE

    Domain Centered Design Support

    No full text

    InsightVideo: Toward hierarchical video content organization for efficient browsing, summarization and retrieval

    Full text link
    Hierarchical video browsing and feature-based video retrieval are two standard methods for accessing video content. Very little research, however, has addressed the benefits of integrating these two methods for more effective and efficient video content access. In this paper, we introduce Insight Video, a video analysis and retrieval system, which joins video content hierarchy, hierarchical browsing and retrieval for efficient video access. We propose several video processing techniques to organize the content hierarchy of the video. We first apply a camera motion classification and key-frame extraction strategy that operates in the compressed domain to extract video features. Then, shot grouping, scene detection and pairwise scene clustering strategies are applied to construct the video content hierarchy. We introduce a video similarity evaluation scheme at different levels (key-frame, shot, group, scene, and video.) By integrating the video content hierarchy and the video similarity evaluation scheme, hierarchical video browsing and retrieval are seamlessly integrated for efficient content access. We construct a progressive video retrieval scheme to refine user queries through the interactions of browsing and retrieval. Experimental results and comparisons of camera motion classification, key-frame extraction, scene detection, and video retrieval are presented to validate the effectiveness and efficiency of the proposed algorithms and the performance of the system. © 2005 IEEE
    corecore