208,678 research outputs found

    Identifying Web Tables - Supporting a Neglected Type of Content on the Web

    Full text link
    The abundance of the data in the Internet facilitates the improvement of extraction and processing tools. The trend in the open data publishing encourages the adoption of structured formats like CSV and RDF. However, there is still a plethora of unstructured data on the Web which we assume contain semantics. For this reason, we propose an approach to derive semantics from web tables which are still the most popular publishing tool on the Web. The paper also discusses methods and services of unstructured data extraction and processing as well as machine learning techniques to enhance such a workflow. The eventual result is a framework to process, publish and visualize linked open data. The software enables tables extraction from various open data sources in the HTML format and an automatic export to the RDF format making the data linked. The paper also gives the evaluation of machine learning techniques in conjunction with string similarity functions to be applied in a tables recognition task.Comment: 9 pages, 4 figure

    The Semantic Web MIDI Tape: An Interface for Interlinking MIDI and Context Metadata

    Get PDF
    The Linked Data paradigm has been used to publish a large number of musical datasets and ontologies on the Semantic Web, such as MusicBrainz, AcousticBrainz, and the Music Ontology. Recently, the MIDI Linked Data Cloud has been added to these datasets, representing more than 300,000 pieces in MIDI format as Linked Data, opening up the possibility for linking fine-grained symbolic music representations to existing music metadata databases. Despite the dataset making MIDI resources available in Web data standard formats such as RDF and SPARQL, the important issue of finding meaningful links between these MIDI resources and relevant contextual metadata in other datasets remains. A fundamental barrier for the provision and generation of such links is the difficulty that users have at adding new MIDI performance data and metadata to the platform. In this paper, we propose the Semantic Web MIDI Tape, a set of tools and associated interface for interacting with the MIDI Linked Data Cloud by enabling users to record, enrich, and retrieve MIDI performance data and related metadata in native Web data standards. The goal of such interactions is to find meaningful links between published MIDI resources and their relevant contextual metadata. We evaluate the Semantic Web MIDI Tape in various use cases involving user-contributed content, MIDI similarity querying, and entity recognition methods, and discuss their potential for finding links between MIDI resources and metadata

    The ethics of interpretation : The signifying chain from field to analysis

    Get PDF
    This paper attempts to describe the relationship between the embodied practice of fieldwork and the written articulation of this experience. Starting from Valerie Hey's conceptualisation of 'rapport' as form of 'intersubjective synergy', a moment of recognition of similarity within difference ā€“ similar in structure to Laclau and Moufffe's conceptualization of hegemony ā€“ the paper explores how we can understand these moments of recognition as positioned within a complex web of signifying chains that interlink social, psychic and linguistic means of representation. Laclau and Mouffe's logics of equivalence and difference and Lacan's account of the production of meaning through metaphor and metonymy provide a theoretical language through which to explore chains of meaning in two fragments of data drawn from a study comparing disciplines and institutions in higher education. My argument is that an awareness of these processes of production of meaning is necessary to the development of an ethical mode of interpretation

    Similarity Measures for Automatic Defect Detection on Patterned Textures

    Get PDF
    Similarity measures are widely used in various applications such as information retrieval, image and object recognition, text retrieval, and web data search. In this paper, we propose similarity-based methods for defect detection on patterned textures using five different similarity measures, viz., Normalized Histogram Intersection Coefficient, Bhattacharyya Coefficient, Pearson Product-moment Correlation Coefficient, Jaccard Coefficient and Cosine-angle Coefficient. Periodic blocks are extracted from each input defective image and similarity matrix is obtained based on the similarity coefficient of histogram of each periodic block with respect to itself and other all periodic blocks. Each similarity matrix is transformed into dissimilarity matrix containing true-distance metrics and Wardā€™s hierarchical clustering is performed to discern between defective and defect-free blocks. Performance of the proposed method is evaluated for each similarity measure based on precision, recall and accuracy for various real fabric images with defects such as broken end, hole, thin bar, thick bar, netting multiple, knot, and missing pick

    Automatic Meaning Discovery Using Google

    Get PDF
    We survey a new area of parameter-free similarity distance measures useful in data-mining, pattern recognition, learning and automatic semantics extraction. Given a family of distances on a set of objects, a distance is universal up to a certain precision for that family if it minorizes every distance in the family between every two objects in the set, up to the stated precision (we do not require the universal distance to be an element of the family). We consider similarity distances for two types of objects: literal objects that as such contain all of their meaning, like genomes or books, and names for objects. The latter may have literal embodyments like the first type, but may also be abstract like ``red\u27\u27 or ``christianity.\u27\u27 For the first type we consider a family of computable distance measures corresponding to parameters expressing similarity according to particular features between pairs of literal objects. For the second type we consider similarity distances generated by web users corresponding to particular semantic relations between the (names for) the designated objects. For both families we give universal similarity distance measures, incorporating all particular distance measures in the family. In the first case the universal distance is based on compression and in the second case it is based on Google page counts related to search terms. In both cases experiments on a massive scale give evidence of the viability of the approaches

    T-MARS: Improving Visual Representations by Circumventing Text Feature Learning

    Full text link
    Large web-sourced multimodal datasets have powered a slew of new methods for learning general-purpose visual representations, advancing the state of the art in computer vision and revolutionizing zero- and few-shot recognition. One crucial decision facing practitioners is how, if at all, to curate these ever-larger datasets. For example, the creators of the LAION-5B dataset chose to retain only image-caption pairs whose CLIP similarity score exceeded a designated threshold. In this paper, we propose a new state-of-the-art data filtering approach motivated by our observation that nearly 40% of LAION's images contain text that overlaps significantly with the caption. Intuitively, such data could be wasteful as it incentivizes models to perform optical character recognition rather than learning visual features. However, naively removing all such data could also be wasteful, as it throws away images that contain visual features (in addition to overlapping text). Our simple and scalable approach, T-MARS (Text Masking and Re-Scoring), filters out only those pairs where the text dominates the remaining visual features -- by first masking out the text and then filtering out those with a low CLIP similarity score of the masked image. Experimentally, T-MARS outperforms the top-ranked method on the "medium scale" of DataComp (a data filtering benchmark) by a margin of 6.5% on ImageNet and 4.7% on VTAB. Additionally, our systematic evaluation on various data pool sizes from 2M to 64M shows that the accuracy gains enjoyed by T-MARS linearly increase as data and compute are scaled exponentially. Code is available at https://github.com/locuslab/T-MARS

    New Similarity Measures for Capturing Browsing Interests of Users into Web Usage Profiles

    Get PDF
    The essence of web personalization is the adaptability of a website to the needs and interests of individual users. The recognition of user preferences and interests can be based on the knowledge gained from previous interactions of users with the site. Typically, a set of usage profiles is mined from web log data (records of website usage), where each profile models common browsing interests of a group of like-minded users. These profiles are later utilized to provide personalized recommendations. Clearly, the quality of usage profiles is critical to the performance of a personalization system. When using clustering for web mining, successful clustering of users is a major factor in deriving effective usage profiles. Clustering depends on the discriminatory capabilities of the similarity measure used. In this thesis, we first present a new weighted session similarity measure to capture the browsing interests of users into web usage profiles. We base our similarity measure on the reasonable assumption that when users spend longer times on pages or revisit pages in the same session, then very likely, such pages are of greater interest to the user. The proposed similarity measure combines structural similarity with session-wise page significance. The latter, representing the degree of user interest, is computed using page-access frequency and page-access duration. Web usage profiles are generated by applying a fuzzy clustering algorithm using this measure. For evaluating the effectiveness of the proposed measure, we adapt two model-based collaborative filtering algorithms for recommending pages. Experimental results show considerable improvement in overall performance of recommender systems as compared to other known similarity measures. Lastly, we propose a modification by replacing structural similarity by concept (content) similarity, which we expect would further enhance recommendation system performance
    • ā€¦
    corecore