208,678 research outputs found
Identifying Web Tables - Supporting a Neglected Type of Content on the Web
The abundance of the data in the Internet facilitates the improvement of
extraction and processing tools. The trend in the open data publishing
encourages the adoption of structured formats like CSV and RDF. However, there
is still a plethora of unstructured data on the Web which we assume contain
semantics. For this reason, we propose an approach to derive semantics from web
tables which are still the most popular publishing tool on the Web. The paper
also discusses methods and services of unstructured data extraction and
processing as well as machine learning techniques to enhance such a workflow.
The eventual result is a framework to process, publish and visualize linked
open data. The software enables tables extraction from various open data
sources in the HTML format and an automatic export to the RDF format making the
data linked. The paper also gives the evaluation of machine learning techniques
in conjunction with string similarity functions to be applied in a tables
recognition task.Comment: 9 pages, 4 figure
The Semantic Web MIDI Tape: An Interface for Interlinking MIDI and Context Metadata
The Linked Data paradigm has been used to publish a large number of musical datasets and ontologies on the Semantic Web, such as MusicBrainz, AcousticBrainz, and the Music Ontology. Recently, the MIDI Linked Data Cloud has been added to these datasets, representing more than 300,000 pieces in MIDI format as Linked Data, opening up the possibility for linking fine-grained symbolic music representations to existing music metadata databases. Despite the dataset making MIDI resources available in Web data standard formats such as RDF and SPARQL, the important issue of finding meaningful links between these MIDI resources and relevant contextual metadata in other datasets remains. A fundamental barrier for the provision and generation of such links is the difficulty that users have at adding new MIDI performance data and metadata to the platform. In this paper, we propose the Semantic Web MIDI Tape, a set of tools and associated interface for interacting with the MIDI Linked Data Cloud by enabling users to record, enrich, and retrieve MIDI performance data and related metadata in native Web data standards. The goal of such interactions is to find meaningful links between published MIDI resources and their relevant contextual metadata. We evaluate the Semantic Web MIDI Tape in various use cases involving user-contributed content, MIDI similarity querying, and entity recognition methods, and discuss their potential for finding links between MIDI resources and metadata
The ethics of interpretation : The signifying chain from field to analysis
This paper attempts to describe the relationship between the embodied practice of fieldwork and the written articulation of this experience. Starting from Valerie Hey's conceptualisation of 'rapport' as form of 'intersubjective synergy', a moment of recognition of similarity within difference ā similar in structure to Laclau and Moufffe's conceptualization of hegemony ā the paper explores how we can understand these moments of recognition as positioned within a complex web of signifying chains that interlink social, psychic and linguistic means of representation. Laclau and Mouffe's logics of equivalence and difference and Lacan's account of the production of meaning through metaphor and metonymy provide a theoretical language through which to explore chains of meaning in two fragments of data drawn from a study comparing disciplines and institutions in higher education. My argument is that an awareness of these processes of production of meaning is necessary to the development of an ethical mode of interpretation
Similarity Measures for Automatic Defect Detection on Patterned Textures
Similarity measures are widely used in various applications such as information retrieval, image and object recognition, text retrieval, and web data search. In this paper, we propose similarity-based methods for defect detection on patterned textures using five different similarity measures, viz., Normalized Histogram Intersection Coefficient, Bhattacharyya Coefficient, Pearson Product-moment Correlation Coefficient, Jaccard Coefficient and Cosine-angle Coefficient. Periodic blocks are extracted from each input defective image and similarity matrix is obtained based on the similarity coefficient of histogram of each periodic block with respect to itself and other all periodic blocks. Each similarity matrix is transformed into dissimilarity matrix containing true-distance metrics and Wardās hierarchical clustering is performed to discern between defective and defect-free blocks. Performance of the proposed method is evaluated for each similarity measure based on precision, recall and accuracy for various real fabric images with defects such as broken end, hole, thin bar, thick bar, netting multiple, knot, and missing pick
Automatic Meaning Discovery Using Google
We survey a new area of parameter-free similarity distance measures
useful in data-mining,
pattern recognition, learning and automatic semantics extraction.
Given a family of distances on a set of objects,
a distance is universal up to a certain precision for that family if it
minorizes every distance in the family between every two objects
in the set, up to the stated precision (we do not require the universal
distance to be an element of the family).
We consider similarity distances
for two types of objects: literal objects that as such contain all of their
meaning, like genomes or books, and names for objects.
The latter may have
literal embodyments like the first type, but may also
be abstract like ``red\u27\u27 or ``christianity.\u27\u27 For the first type
we consider
a family of computable distance measures
corresponding to parameters expressing similarity according to
particular features
between
pairs of literal objects. For the second type we consider similarity
distances generated by web users corresponding to particular semantic
relations between the (names for) the designated objects.
For both families we give universal similarity
distance measures, incorporating all particular distance measures
in the family. In the first case the universal
distance is based on compression and in the second
case it is based on Google page counts related to search terms.
In both cases experiments on a massive scale give evidence of the
viability of the approaches
T-MARS: Improving Visual Representations by Circumventing Text Feature Learning
Large web-sourced multimodal datasets have powered a slew of new methods for
learning general-purpose visual representations, advancing the state of the art
in computer vision and revolutionizing zero- and few-shot recognition. One
crucial decision facing practitioners is how, if at all, to curate these
ever-larger datasets. For example, the creators of the LAION-5B dataset chose
to retain only image-caption pairs whose CLIP similarity score exceeded a
designated threshold. In this paper, we propose a new state-of-the-art data
filtering approach motivated by our observation that nearly 40% of LAION's
images contain text that overlaps significantly with the caption. Intuitively,
such data could be wasteful as it incentivizes models to perform optical
character recognition rather than learning visual features. However, naively
removing all such data could also be wasteful, as it throws away images that
contain visual features (in addition to overlapping text). Our simple and
scalable approach, T-MARS (Text Masking and Re-Scoring), filters out only those
pairs where the text dominates the remaining visual features -- by first
masking out the text and then filtering out those with a low CLIP similarity
score of the masked image. Experimentally, T-MARS outperforms the top-ranked
method on the "medium scale" of DataComp (a data filtering benchmark) by a
margin of 6.5% on ImageNet and 4.7% on VTAB. Additionally, our systematic
evaluation on various data pool sizes from 2M to 64M shows that the accuracy
gains enjoyed by T-MARS linearly increase as data and compute are scaled
exponentially. Code is available at https://github.com/locuslab/T-MARS
New Similarity Measures for Capturing Browsing Interests of Users into Web Usage Profiles
The essence of web personalization is the adaptability of a website to the needs and interests of individual users. The recognition of user preferences and interests can be based on the knowledge gained from previous interactions of users
with the site. Typically, a set of usage profiles is mined from web log data (records of website usage), where each profile models common browsing interests of a group of like-minded users. These profiles are later utilized to provide
personalized recommendations. Clearly, the quality of usage profiles is critical to the performance of a personalization system. When using clustering for web mining, successful clustering of users is a major factor in deriving effective usage profiles. Clustering depends on the discriminatory capabilities of the similarity measure used. In this thesis, we first present a new weighted session similarity measure to capture the browsing interests of users into web usage profiles. We base our similarity measure on the reasonable assumption that when users spend longer times on pages or revisit pages in the same session, then very likely, such pages are of greater interest to the user. The proposed similarity measure combines structural similarity with session-wise page significance. The latter, representing the degree of user interest, is computed using page-access frequency and page-access duration. Web usage profiles are generated by applying a fuzzy clustering algorithm using this measure. For evaluating the effectiveness of the
proposed measure, we adapt two model-based collaborative filtering algorithms for recommending pages. Experimental results show considerable improvement in overall performance of recommender systems as compared to other known similarity measures. Lastly, we propose a modification by replacing structural similarity by concept (content) similarity, which we expect would further enhance recommendation system performance
- ā¦