1,831 research outputs found
Extracting semantic entities and events from sports tweets
Large volumes of user-generated content on practically every major issue and event are being created on the microblogging site Twitter. This content can be combined and processed to detect events, entities and popular moods to feed various knowledge-intensive practical applications. On the downside, these content items are very noisy and highly informal, making it difficult to extract sense out of the stream. In this paper, we exploit various approaches to detect the named entities and significant micro-events from users’ tweets during a live sports event. Here we describe how combining linguistic features with background knowledge and the use of Twitter-specific features can achieve high, precise detection results (f-measure = 87%) in different datasets. A study was conducted on tweets from cricket matches in the ICC World Cup in order to augment the event-related non-textual media with collective intelligence
Crowdsourcing Linked Data on listening experiences through reuse and enhancement of library data
Research has approached the practice of musical reception in a multitude of ways, such as the analysis of professional critique, sales figures and psychological processes activated by the act of listening. Studies in the Humanities, on the other hand, have been hindered by the lack of structured evidence of actual experiences of listening as reported by the listeners themselves, a concern that was voiced since the early Web era. It was however assumed that such evidence existed, albeit in pure textual form, but could not be leveraged until it was digitised and aggregated. The Listening Experience Database (LED) responds to this research need by providing a centralised hub for evidence of listening in the literature. Not only does LED support search and reuse across nearly 10,000 records, but it also provides machine-readable structured data of the knowledge around the contexts of listening. To take advantage of the mass of formal knowledge that already exists on the Web concerning these contexts, the entire framework adopts Linked Data principles and technologies. This also allows LED to directly reuse open data from the British Library for the source documentation that is already published. Reused data are re-published as open data with enhancements obtained by expanding over the model of the original data, such as the partitioning of published books and collections into individual stand-alone documents. The database was populated through crowdsourcing and seamlessly incorporates data reuse from the very early data entry phases. As the sources of the evidence often contain vague, fragmentary of uncertain information, facilities were put in place to generate structured data out of such fuzziness. Alongside elaborating on these functionalities, this article provides insights into the most recent features of the latest instalment of the dataset and portal, such as the interlinking with the MusicBrainz database, the relaxation of geographical input constraints through text mining, and the plotting of key locations in an interactive geographical browser
A crowdsourcing recommendation model for image annotations in cultural heritage platforms
Cultural heritage is one of many fields that has seen a significant digital transformation in the form of digitization and asset annotations for heritage preservation, inheritance, and dissemination. However, a lack of accurate and descriptive metadata in this field has an impact on the usability and discoverability of digital content, affecting cultural heritage platform visitors and resulting in an unsatisfactory user experience as well as limiting processing capabilities to add new functionalities. Over time, cultural heritage institutions were responsible for providing metadata for their collection items with the help of professionals, which is expensive and requires significant effort and time. In this sense, crowdsourcing can play a significant role in digital transformation or massive data processing, which can be useful for leveraging the crowd and enriching the metadata quality of digital cultural content. This paper focuses on a very important challenge faced by cultural heritage crowdsourcing platforms, which is how to attract users and make such activities enjoyable for them in order to achieve higher-quality annotations. One way to address this is to offer personalized interesting items based on each user preference, rather than making the user experience random and demanding. Thus, we present an image annotation recommendation system for users of cultural heritage platforms. The recommendation system design incorporates various technologies intending to help users in selecting the best matching images for annotations based on their interests and characteristics. Different classification methods were implemented to validate the accuracy of our work on Egyptian heritage.Agencia Estatal de Investigación | Ref. TIN2017-87604-RXunta de Galicia | Ref. ED431B 2020/3
A distributional and syntactic approach to fine-grained opinion mining
This thesis contributes to a larger social science research program of
analyzing the diffusion of IT innovations. We show how to
automatically discriminate portions of text dealing with opinions
about innovations by finding {source, target, opinion} triples in text.
In this context, we can discern a list of innovations as targets from
the domain itself. We can then use this list as an anchor for finding
the other two members of the triple at a ``fine-grained''
level---paragraph contexts or less.
We first demonstrate a vector space model for finding opinionated
contexts in which the innovation targets are mentioned. We can find
paragraph-level contexts by searching for an
``expresses-an-opinion-about'' relation between sources and targets
using a supervised model with an SVM that uses features derived from a
general-purpose subjectivity lexicon and a corpus indexing tool. We
show that our algorithm correctly filters the domain relevant subset
of subjectivity terms so that they are more highly valued.
We then turn to identifying the opinion. Typically, opinions in
opinion mining are taken to be positive or negative. We discuss a
crowd sourcing technique developed to create the seed data describing
human perception of opinion bearing language needed for our supervised
learning algorithm. Our user interface successfully limited the
meta-subjectivity inherent in the task (``What is an opinion?'') while
reliably retrieving relevant opinionated words using labour not expert
in the domain.
Finally, we developed a new data structure and modeling technique for
connecting targets with the correct within-sentence opinionated
language. Syntactic relatedness tries (SRTs) contain all paths from a
dependency graph of a sentence that connect a target expression to a
candidate opinionated word. We use factor graphs to model how far a
path through the SRT must be followed in order to connect the right
targets to the right words. It turns out that we can correctly label
significant portions of these tries with very rudimentary features
such as part-of-speech tags and dependency labels with minimal
processing. This technique uses the data from the crowdsourcing
technique we developed as training data.
We conclude by placing our work in the context of a larger sentiment
classification pipeline and by describing a model for learning from
the data structures produced by our work. This work contributes to
computational linguistics by proposing and verifying new data
gathering techniques and applying recent developments in machine
learning to inference over grammatical structures for highly
subjective purposes. It applies a suffix tree-based data structure to
model opinion in a specific domain by imposing a restriction on the
order in which the data is stored in the structure
EXP-Crowd: A Gamified Crowdsourcing Framework for Explainability
The spread of AI and black-box machine learning models made it necessary to explain their behavior. Consequently, the research field of Explainable AI was born. The main objective of an Explainable AI system is to be understood by a human as the final beneficiary of the model. In our research, we frame the explainability problem from the crowds point of view and engage both users and AI researchers through a gamified crowdsourcing framework. We research whether it's possible to improve the crowds understanding of black-box models and the quality of the crowdsourced content by engaging users in a set of gamified activities through a gamified crowdsourcing framework named EXP-Crowd. While users engage in such activities, AI researchers organize and share AI- and explainability-related knowledge to educate users. We present the preliminary design of a game with a purpose (G.W.A.P.) to collect features describing real-world entities which can be used for explainability purposes. Future works will concretise and improve the current design of the framework to cover specific explainability-related needs
Empirical Methodology for Crowdsourcing Ground Truth
The process of gathering ground truth data through human annotation is a
major bottleneck in the use of information extraction methods for populating
the Semantic Web. Crowdsourcing-based approaches are gaining popularity in the
attempt to solve the issues related to volume of data and lack of annotators.
Typically these practices use inter-annotator agreement as a measure of
quality. However, in many domains, such as event detection, there is ambiguity
in the data, as well as a multitude of perspectives of the information
examples. We present an empirically derived methodology for efficiently
gathering of ground truth data in a diverse set of use cases covering a variety
of domains and annotation tasks. Central to our approach is the use of
CrowdTruth metrics that capture inter-annotator disagreement. We show that
measuring disagreement is essential for acquiring a high quality ground truth.
We achieve this by comparing the quality of the data aggregated with CrowdTruth
metrics with majority vote, over a set of diverse crowdsourcing tasks: Medical
Relation Extraction, Twitter Event Identification, News Event Extraction and
Sound Interpretation. We also show that an increased number of crowd workers
leads to growth and stabilization in the quality of annotations, going against
the usual practice of employing a small number of annotators.Comment: in publication at the Semantic Web Journa
Entity Recognition at First Sight: Improving NER with Eye Movement Information
Previous research shows that eye-tracking data contains information about the
lexical and syntactic properties of text, which can be used to improve natural
language processing models. In this work, we leverage eye movement features
from three corpora with recorded gaze information to augment a state-of-the-art
neural model for named entity recognition (NER) with gaze embeddings. These
corpora were manually annotated with named entity labels. Moreover, we show how
gaze features, generalized on word type level, eliminate the need for recorded
eye-tracking data at test time. The gaze-augmented models for NER using
token-level and type-level features outperform the baselines. We present the
benefits of eye-tracking features by evaluating the NER models on both
individual datasets as well as in cross-domain settings.Comment: Accepted at NAACL-HLT 201
- …