567 research outputs found
When Automated Assessment Meets Automated Content Generation: Examining Text Quality in the Era of GPTs
The use of machine learning (ML) models to assess and score textual data has
become increasingly pervasive in an array of contexts including natural
language processing, information retrieval, search and recommendation, and
credibility assessment of online content. A significant disruption at the
intersection of ML and text are text-generating large-language models such as
generative pre-trained transformers (GPTs). We empirically assess the
differences in how ML-based scoring models trained on human content assess the
quality of content generated by humans versus GPTs. To do so, we propose an
analysis framework that encompasses essay scoring ML-models, human and
ML-generated essays, and a statistical model that parsimoniously considers the
impact of type of respondent, prompt genre, and the ML model used for
assessment model. A rich testbed is utilized that encompasses 18,460
human-generated and GPT-based essays. Results of our benchmark analysis reveal
that transformer pretrained language models (PLMs) more accurately score human
essay quality as compared to CNN/RNN and feature-based ML methods.
Interestingly, we find that the transformer PLMs tend to score GPT-generated
text 10-15\% higher on average, relative to human-authored documents.
Conversely, traditional deep learning and feature-based ML models score human
text considerably higher. Further analysis reveals that although the
transformer PLMs are exclusively fine-tuned on human text, they more
prominently attend to certain tokens appearing only in GPT-generated text,
possibly due to familiarity/overlap in pre-training. Our framework and results
have implications for text classification settings where automated scoring of
text is likely to be disrupted by generative AI.Comment: Data available at:
https://github.com/nd-hal/automated-ML-scoring-versus-generatio
Tainted love:a systematic literature review of online romance scam research
Romance scams involve cybercriminals engineering a romantic relationship on online dating platforms for monetary gain. It is a cruel form of cybercrime whereby victims are left heartbroken, often facing financial ruin. We characterize the literary landscape on romance scams, advancing the understanding of researchers and practitioners by systematically reviewing and synthesizing contemporary qualitative and quantitative evidence. The systematic review establishes influencing factors of victimhood and explores countermeasures for mitigating romance scams. We searched 10 scholarly databases and websites using terms related to romance scams. The methodology followed the Preferred Reporting Items for Systematic Reviews and Meta-analyses (PRISMA) guidelines: a total of 279 papers were screened. One hundred seven papers were assessed for eligibility, and 53 were included in the final analysis. Three main contributions were identified: common profile features and techniques used by romance scammers, countermeasures for mitigating romance scams and factors predisposing an individual to become a scammer or a victim. Despite a growing corpus of literature, the total number of empirical or experimental examinations remained limited. The paper concludes with avenues for future research and victimhood intervention strategies for practitioners, law enforcement and the industry
AN EVOLUTIONARY APPROACH TO BIBLIOGRAPHIC CLASSIFICATION
This dissertation is research in the domain of information science and specifically, the organization and representation of information. The research has implications for classification of scientific books, especially as dissemination of information becomes more rapid and science becomes more diverse due to increases in multi-, inter-, trans-disciplinary research, which focus on phenomena, in contrast to traditional library classification schemes based on disciplines.The literature review indicates 1) human socio-cultural groups have many of the same properties as biological species, 2) output from human socio-cultural groups can be and has been the subject of evolutionary relationship analyses (i.e., phylogenetics), 3) library and information science theorists believe the most favorable and scientific classification for information packages is one based on common origin, but 4) library and information science classification researchers have not demonstrated a book classification based on evolutionary relationships of common origin.The research project supports the assertion that a sensible book classification method can be developed using a contemporary biological classification approach based on common origin, which has not been applied to a collection of books until now. Using a sample from a collection of earth-science digitized books, the method developed includes a text-mining step to extract important terms, which were converted into a dataset for input into the second step—the phylogenetic analysis. Three classification trees were produced and are discussed. Parsimony analysis, in contrast to distance and likelihood analyses, produced a sensible book classification tree. Also included is a comparison with a classification tree based on a well-known contemporary library classification scheme (the Library of Congress Classification).Final discussions connect this research with knowledge organization and information retrieval, information needs beyond science, and this type of research in context of a unified science of cultural evolution
An explainable recommender system based on semantically-aware matrix factorization.
Collaborative Filtering techniques provide the ability to handle big and sparse data to predict the ratings for unseen items with high accuracy. Matrix factorization is an accurate collaborative filtering method used to predict user preferences. However, it is a black box system that recommends items to users without being able to explain why. This is due to the type of information these systems use to build models. Although rich in information, user ratings do not adequately satisfy the need for explanation in certain domains. White box systems, in contrast, can, by nature, easily generate explanations. However, their predictions are less accurate than sophisticated black box models. Recent research has demonstrated that explanations are an essential component in bringing the powerful predictions of big data and machine learning methods to a mass audience without a compromise in trust. Explanations can take a variety of formats, depending on the recommendation domain and the machine learning model used to make predictions. Semantic Web (SW) technologies have been exploited increasingly in recommender systems in recent years. The SW consists of knowledge graphs (KGs) providing valuable information that can help improve the performance of recommender systems. Yet KGs, have not been used to explain recommendations in black box systems. In this dissertation, we exploit the power of the SW to build new explainable recommender systems. We use the SW\u27s rich expressive power of linked data, along with structured information search and understanding tools to explain predictions. More specifically, we take advantage of semantic data to learn a semantically aware latent space of users and items in the matrix factorization model-learning process to build richer, explainable recommendation models. Our off-line and on-line evaluation experiments show that our approach achieves accurate prediction with the additional ability to explain recommendations, in comparison to baseline approaches. By fostering explainability, we hope that our work contributes to more transparent, ethical machine learning without sacrificing accuracy
The anatomy of a search and mining system for digital humanities : Search And Mining Tools for Language Archives (SAMTLA)
Humanities researchers are faced with an overwhelming volume of digitised
primary source material, and "born digital" information, of relevance to their
research as a result of large-scale digitisation projects. The current digital tools
do not provide consistent support for analysing the content of digital archives
that are potentially large in scale, multilingual, and come in a range of data
formats. The current language-dependent, or project specific, approach to tool
development often puts the tools out of reach for many research disciplines in
the humanities. In addition, the tools can be incompatible with the way
researchers locate and compare the relevant sources. For instance, researchers
are interested in shared structural text patterns, known as \parallel passages"
that describe a specific cultural, social, or historical context relevant to their
research topic. Identifying these shared structural text patterns is challenging
due to their repeated yet highly variable nature, as a result of differences in
the domain, author, language, time period, and orthography.
The contribution of the thesis is a novel infrastructure that directly addresses
the need for generic,
flexible, extendable, and sustainable digital tools
that are applicable to a wide range of digital archives and research in the
humanities. The infrastructure adopts a character-level n-gram Statistical
Language Model (SLM), stored in a space-optimised k-truncated suffix tree
data structure as its underlying data model. A character-level n-gram model
is a relatively new approach that is competitive with word-level n-gram models,
but has the added advantage that it is domain and language-independent,
requiring little or no preprocessing of the document text unlike word-level
models that require some form of language-dependent tokenisation and stemming.
Character-level n-grams capture word internal features that are ignored
by word-level n-gram models, which provides greater
exibility in addressing
the information need of the user through tolerant search, and compensation
for erroneous query specification or spelling errors in the document text. Furthermore,
the SLM provides a unified approach to information retrieval and
text mining, where traditional approaches have tended to adopt separate data
models that are often ad-hoc or based on heuristic assumptions. In addition,
the performance of the character-level n-gram SLM was formally evaluated
through crowdsourcing, which demonstrates that the retrieval performance of
the SLM is close to that of the human level performance.
The proposed infrastructure, supports the development of the Samtla (Search
And Mining Tools for Language Archives), which provides humanities researchers
digital tools for search, browsing, and text mining of digital archives
in any domain or language, within a single system. Samtla supersedes many of
the existing tools for humanities researchers, by supporting the same or similar
functionality of the systems, but with a domain-independent and languageindependent
approach. The functionality includes a browsing tool constructed
from the metadata and named entities extracted from the document text, a
hybrid-recommendation system for recommending related queries and documents.
However, some tools are novel tools and developed in response to
the specific needs of the researchers, such as the document comparison tool
for visualising shared sequences between groups of related documents. Furthermore,
Samtla is the first practical example of a system with a SLM as
its primary data model that supports the real research needs of several case
studies covering different areas of research in the humanities
Exploiting the conceptual space in hybrid recommender systems: a semantic-based approach
Tesis doctoral inédita. Universidad Autónoma de Madrid, Escuela Politécnica Superior, octubre de 200
DravidianCodeMix: Sentiment Analysis and Offensive Language Identification Dataset for Dravidian Languages in Code-Mixed Text
This paper describes the development of a multilingual, manually annotated
dataset for three under-resourced Dravidian languages generated from social
media comments. The dataset was annotated for sentiment analysis and offensive
language identification for a total of more than 60,000 YouTube comments. The
dataset consists of around 44,000 comments in Tamil-English, around 7,000
comments in Kannada-English, and around 20,000 comments in Malayalam-English.
The data was manually annotated by volunteer annotators and has a high
inter-annotator agreement in Krippendorff's alpha. The dataset contains all
types of code-mixing phenomena since it comprises user-generated content from a
multilingual country. We also present baseline experiments to establish
benchmarks on the dataset using machine learning methods. The dataset is
available on Github
(https://github.com/bharathichezhiyan/DravidianCodeMix-Dataset) and Zenodo
(https://zenodo.org/record/4750858\#.YJtw0SYo\_0M).Comment: 36 page
Music information retrieval: conceptuel framework, annotation and user behaviour
Understanding music is a process both based on and influenced by the knowledge and experience of the listener. Although content-based music retrieval has been given increasing attention in recent years, much of the research still focuses on bottom-up retrieval techniques. In order to make a music information retrieval system appealing and useful to the user, more effort should be spent on constructing systems that both operate directly on the encoding of the physical energy of music and are flexible with respect to users’ experiences.
This thesis is based on a user-centred approach, taking into account the mutual relationship between music as an acoustic phenomenon and as an expressive phenomenon. The issues it addresses are: the lack of a conceptual framework, the shortage of annotated musical audio databases, the lack of understanding of the behaviour of system users and shortage of user-dependent knowledge with respect to high-level features of music.
In the theoretical part of this thesis, a conceptual framework for content-based music information retrieval is defined. The proposed conceptual framework - the first of its kind - is conceived as a coordinating structure between the automatic description of low-level music content, and the description of high-level content by the system users. A general framework for the manual annotation of musical audio is outlined as well. A new methodology for the manual annotation of musical audio is introduced and tested in case studies. The results from these studies show that manually annotated music files can be of great help in the development of accurate analysis tools for music information retrieval.
Empirical investigation is the foundation on which the aforementioned theoretical framework is built. Two elaborate studies involving different experimental issues are presented. In the first study, elements of signification related to spontaneous user behaviour are clarified. In the second study, a global profile of music information retrieval system users is given and their description of high-level content is discussed. This study has uncovered relationships between the users’ demographical background and their perception of expressive and structural features of music. Such a multi-level approach is exceptional as it included a large sample of the population of real users of interactive music systems. Tests have shown that the findings of this study are representative of the targeted population.
Finally, the multi-purpose material provided by the theoretical background and the results from empirical investigations are put into practice in three music information retrieval applications: a prototype of a user interface based on a taxonomy, an annotated database of experimental findings and a prototype semantic user recommender system.
Results are presented and discussed for all methods used. They show that, if reliably generated, the use of knowledge on users can significantly improve the quality of music content analysis. This thesis demonstrates that an informed knowledge of human approaches to music information retrieval provides valuable insights, which may be of particular assistance in the development of user-friendly, content-based access to digital music collections
- …