3,124 research outputs found
Event-based Access to Historical Italian War Memoirs
The progressive digitization of historical archives provides new, often
domain specific, textual resources that report on facts and events which have
happened in the past; among these, memoirs are a very common type of primary
source. In this paper, we present an approach for extracting information from
Italian historical war memoirs and turning it into structured knowledge. This
is based on the semantic notions of events, participants and roles. We evaluate
quantitatively each of the key-steps of our approach and provide a graph-based
representation of the extracted knowledge, which allows to move between a Close
and a Distant Reading of the collection.Comment: 23 pages, 6 figure
An examination of automatic video retrieval technology on access to the contents of an historical video archive
Purpose – This paper aims to provide an initial understanding of the constraints that historical video collections pose to video retrieval technology and the potential that online access offers to both archive and users.
Design/methodology/approach – A small and unique collection of videos on customs and folklore was used as a case study. Multiple methods were employed to investigate the effectiveness of technology and the modality of user access. Automatic keyframe extraction was tested on the visual content while the audio stream was used for automatic classification of speech and music clips. The user access (search vs browse) was assessed in a controlled user evaluation. A focus group and a survey provided insight on the actual use of the analogue archive. The results of these multiple studies were then compared and integrated (triangulation).
Findings – The amateur material challenged automatic techniques for video and audio indexing, thus suggesting that the technology must be tested against the material before deciding on a digitisation strategy. Two user interaction modalities, browsing vs searching, were tested in a user evaluation. Results show users preferred searching, but browsing becomes essential when the search engine fails in matching query and indexed words. Browsing was also valued for serendipitous discovery; however the organisation of the archive was judged cryptic and therefore of limited use. This indicates that the categorisation of an online archive should be thought of in terms of users who might not understand the current classification. The focus group and the survey showed clearly the advantage of online access even when the quality of the video surrogate is poor. The evidence gathered suggests that the creation of a digital version of a video archive requires a rethinking of the collection in terms of the new medium: a new archive should be specially designed to exploit the potential that the digital medium offers. Similarly, users' needs have to be considered before designing the digital library interface, as needs are likely to be different from those imagined.
Originality/value – This paper is the first attempt to understand the advantages offered and limitations held by video retrieval technology for small video archives like those often found in special collections
Analyzing collaborative learning processes automatically
In this article we describe the emerging area of text classification research focused on the problem of collaborative learning process analysis both from a broad perspective and more specifically in terms of a publicly available tool set called TagHelper tools. Analyzing the variety of pedagogically valuable facets of learners’ interactions is a time consuming and effortful process. Improving automated analyses of such highly valued processes of collaborative learning by adapting and applying recent text classification technologies would make it a less arduous task to obtain insights from corpus data. This endeavor also holds the potential for enabling substantially improved on-line instruction both by providing teachers and facilitators with reports about the groups they are moderating and by triggering context sensitive collaborative learning support on an as-needed basis. In this article, we report on an interdisciplinary research project, which has been investigating the effectiveness of applying text classification technology to a large CSCL corpus that has been analyzed by human coders using a theory-based multidimensional coding scheme. We report promising results and include an in-depth discussion of important issues such as reliability, validity, and efficiency that should be considered when deciding on the appropriateness of adopting a new technology such as TagHelper tools. One major technical contribution of this work is a demonstration that an important piece of the work towards making text classification technology effective for this purpose is designing and building linguistic pattern detectors, otherwise known as features, that can be extracted reliably from texts and that have high predictive power for the categories of discourse actions that the CSCL community is interested in
Current Limitations in Cyberbullying Detection: on Evaluation Criteria, Reproducibility, and Data Scarcity
The detection of online cyberbullying has seen an increase in societal
importance, popularity in research, and available open data. Nevertheless,
while computational power and affordability of resources continue to increase,
the access restrictions on high-quality data limit the applicability of
state-of-the-art techniques. Consequently, much of the recent research uses
small, heterogeneous datasets, without a thorough evaluation of applicability.
In this paper, we further illustrate these issues, as we (i) evaluate many
publicly available resources for this task and demonstrate difficulties with
data collection. These predominantly yield small datasets that fail to capture
the required complex social dynamics and impede direct comparison of progress.
We (ii) conduct an extensive set of experiments that indicate a general lack of
cross-domain generalization of classifiers trained on these sources, and openly
provide this framework to replicate and extend our evaluation criteria.
Finally, we (iii) present an effective crowdsourcing method: simulating
real-life bullying scenarios in a lab setting generates plausible data that can
be effectively used to enrich real data. This largely circumvents the
restrictions on data that can be collected, and increases classifier
performance. We believe these contributions can aid in improving the empirical
practices of future research in the field
Combining Visual and Textual Features for Semantic Segmentation of Historical Newspapers
The massive amounts of digitized historical documents acquired over the last
decades naturally lend themselves to automatic processing and exploration.
Research work seeking to automatically process facsimiles and extract
information thereby are multiplying with, as a first essential step, document
layout analysis. If the identification and categorization of segments of
interest in document images have seen significant progress over the last years
thanks to deep learning techniques, many challenges remain with, among others,
the use of finer-grained segmentation typologies and the consideration of
complex, heterogeneous documents such as historical newspapers. Besides, most
approaches consider visual features only, ignoring textual signal. In this
context, we introduce a multimodal approach for the semantic segmentation of
historical newspapers that combines visual and textual features. Based on a
series of experiments on diachronic Swiss and Luxembourgish newspapers, we
investigate, among others, the predictive power of visual and textual features
and their capacity to generalize across time and sources. Results show
consistent improvement of multimodal models in comparison to a strong visual
baseline, as well as better robustness to high material variance
A Large-Scale Comparative Study of Accurate COVID-19 Information versus Misinformation
The COVID-19 pandemic led to an infodemic where an overwhelming amount of
COVID-19 related content was being disseminated at high velocity through social
media. This made it challenging for citizens to differentiate between accurate
and inaccurate information about COVID-19. This motivated us to carry out a
comparative study of the characteristics of COVID-19 misinformation versus
those of accurate COVID-19 information through a large-scale computational
analysis of over 242 million tweets. The study makes comparisons alongside four
key aspects: 1) the distribution of topics, 2) the live status of tweets, 3)
language analysis and 4) the spreading power over time. An added contribution
of this study is the creation of a COVID-19 misinformation classification
dataset. Finally, we demonstrate that this new dataset helps improve
misinformation classification by more than 9% based on average F1 measure
Categorical relevance judgment
In this study we aim to explore users' behaviour when assessing search results relevance based on the hypothesis of categorical thinking. In order to investigate how users categorise search engine results, we perform several experiments where users are asked to group a list of 20 search results into a number of categories, while attaching a relevance judgment to each formed category. Moreover, to determine how users change their minds over time, each experiment was repeated three times under the same conditions, with a gap of one month between rounds. The results show that on average users form 4-5 categories. Within each round the size of a category decreases with the relevance of a category. To measure the agreement between the search engine’s ranking and the users’ relevance judgments, we defined two novel similarity measures, the average concordance and the MinMax swap ratio. Similarity is shown to be the highest for the third round as the users' opinion stabilises. Qualitative analysis uncovered some interesting points, in particular, that users tended to categorise results by type and reliability of their source, and particularly, found commercial sites less trustworthy, and attached high relevance to Wikipedia when their prior domain knowledge was limited
- …