1,489 research outputs found
Assessing Fine-Grained Explicitness of Song Lyrics
Music plays a crucial role in our lives, with growing consumption and engagement through streaming services and social media platforms. However, caution is needed for children, who may be exposed to explicit content through songs. Initiatives such as the Parental Advisory Label (PAL) and similar labelling from streaming content providers aim to protect children from harmful content. However, so far, the labelling has been limited to tagging the song as explicit (if so), without providing any additional information on the reasons for the explicitness (e.g., strong language, sexual reference). This paper addresses this issue by developing a system capable of detecting explicit song lyrics and assessing the kind of explicit content detected. The novel contributions of the work include (i) a new dataset of 4000 song lyrics annotated with five possible reasons for content explicitness and (ii) experiments with machine learning classifiers to predict explicitness and the reasons for it. The results demonstrated the feasibility of automatically detecting explicit content and the reasons for explicitness in song lyrics. This work is the first to address explicitness at this level of detail and provides a valuable contribution to the music industry, helping to protect children from exposure to inappropriate content
On Classification with Bags, Groups and Sets
Many classification problems can be difficult to formulate directly in terms
of the traditional supervised setting, where both training and test samples are
individual feature vectors. There are cases in which samples are better
described by sets of feature vectors, that labels are only available for sets
rather than individual samples, or, if individual labels are available, that
these are not independent. To better deal with such problems, several
extensions of supervised learning have been proposed, where either training
and/or test objects are sets of feature vectors. However, having been proposed
rather independently of each other, their mutual similarities and differences
have hitherto not been mapped out. In this work, we provide an overview of such
learning scenarios, propose a taxonomy to illustrate the relationships between
them, and discuss directions for further research in these areas
Automating the anonymisation of textual corpora
[EU] Gaur egun, testu berriak etengabe sortzen doaz sare sozialetako mezu, osasun-txosten,
dokumentu o zial eta halakoen ondorioz. Hala ere, testuok informazio pertsonala baldin
badute, ezin dira ikerkuntzarako edota beste helburutarako baliatu, baldin eta aldez
aurretik ez badira anonimizatzen. Anonimizatze hori automatikoki egitea erronka handia
da eta askotan hutsetik anotatutako domeinukako datuak behar dira, ez baita arrunta
helburutzat ditugun testuinguruetarako anotatutako corpusak izatea. Hala, tesi honek bi
helburu ditu: (i) Gaztelaniazko elkarrizketa espontaneoz osatutako corpus anonimizatu
berri bat konpilatu eta eskura jartzea, eta (ii) sortutako baliabide hau ustiatzea
informazio sentiberaren identi kazio-teknikak aztertzeko, helburu gisa dugun domeinuan
testu etiketaturik izan gabe. Hala, lehenengo helburuari lotuta, ES-Port izeneko corpusa
sortu dugu. Telekomunikazio-ekoizle batek ahoz laguntza teknikoa ematen duenean sortu
diren 1170 elkarrizketa espontaneoek osatzen dute corpusa. Ordezkatze-tekniken bidez
anonimizatu da, eta ondorioz emaitza testu irakurgarri eta naturala izan da. Hamaika
anonimizazio-kategoria landu dira, eta baita hizkuntzakoak eta hizkuntzatik kanpokoak
diren beste zenbait anonimizazio-fenomeno ere, hala nola, kode-aldaketa, barrea,
errepikapena, ahoskatze okerrak, eta abar. Bigarren helburuari lotuta, berriz,
anonimizatu beharreko informazio sentibera identi katzeko, gordailuan oinarritutako
Ikasketa Aktiboa erabili da, honek helburutzat baitu ahalik eta testu anotatu
gutxienarekin sailkatzaile ahalik eta onena lortzea. Horretaz gain, emaitzak hobetzeko,
eta abiapuntuko hautaketarako eta galderen hautaketarako estrategiak aztertzeko,
Ezagutza Transferentzian oinarritutako teknikak ustiatu dira, aldez aurretik anotatuta
zegoen corpus txiki bat oinarri hartuta. Emaitzek adierazi dute, lan honetan
aukeratutako metodoak egokienak izan direla abiapuntuko hautaketa egiteko eta
kontsulta-estrategia gisa iturri eta helburu sailkapenen zalantzak konbinatzeak Ikasketa
Aktiboa hobetzen duela, ikaskuntza-kurba bizkorragoak eta sailkapen-errendimendu
handiagoak lortuz iterazio gutxiagotan.[EN] A huge amount of new textual data are created day by day through social media posts, health records, official documents, and so on. However, if such resources contain personal data, they cannot be shared for research or other purposes without undergoing proper anonymisation. Automating such task is challenging and often requires labelling in-domain data from scratch since anonymised annotated corpora for the target scenarios are rarely available. This thesis has two main objectives: (i) to compile and provide a new corpus in Spanish with annotated anonymised spontaneous dialogue data, and (ii) to exploit the newly provided resource to investigate techniques for automating the sensitive data identification task, in a setting where initially no annotated data from the target domain are available. Following such aims, first, the ES-Port corpus is presented. It is a compilation of 1170 spontaneous spoken human-human dialogues from calls to the technical support service of a telecommunications provider. The corpus has been anonymised using the substitution technique, which implies the result is a readable natural text, and it contains annotations of eleven different anonymisation categories, as well as some linguistic and extra-linguistic phenomena annotations like code-switching, laughter, repetitions, mispronunciations, and so on. Next, the compiled corpus is used to investigate automatic sensitive data identification within a pool-based Active Learning framework, whose aim is to obtain the best possible classifier having to annotate as little data as possible. In order to improve such setting, Knowledge Transfer techniques from another small available anonymisation annotated corpus are explored for seed selection and query selection strategies. Results show that using the proposed seed selection methods obtain the best seeds on which to initialise the base learner's training and that combining source and target classifiers' uncertainties as query strategy improves the Active Learning process, deriving in steeper learning curves and reaching top classifier performance in fewer iterations
An Urdu semantic tagger - lexicons, corpora, methods and tools
Extracting and analysing meaning-related information from natural language data has attracted the attention of researchers in various fields, such as Natural Language Processing (NLP), corpus linguistics, data sciences, etc. An important aspect of such automatic information extraction and analysis is the semantic annotation of language data using semantic annotation tool (a.k.a semantic tagger). Generally, different semantic annotation tools have been designed to carry out various levels of semantic annotations, for instance, sentiment analysis, word sense disambiguation, content analysis, semantic role labelling, etc. These semantic annotation tools identify or tag partial core semantic information of language data, moreover, they tend to be applicable only for English and other European languages. A semantic annotation tool that can annotate semantic senses of all lexical units (words) is still desirable for the Urdu language based on USAS (the UCREL Semantic Analysis System) semantic taxonomy, in order to provide comprehensive semantic analysis of Urdu language text. This research work report on the development of an Urdu semantic tagging tool and discuss challenging issues which have been faced in this Ph.D. research work. Since standard NLP pipeline tools are not widely available for Urdu, alongside the Urdu semantic tagger a suite of newly developed tools have been created: sentence tokenizer, word tokenizer and part-of-speech tagger. Results for these proposed tools are as follows: word tokenizer reports of 94.01\%, and accuracy of 97.21\%, sentence tokenizer shows F of 92.59\%, and accuracy of 93.15\%, whereas, POS tagger shows an accuracy of 95.14\%. The Urdu semantic tagger incorporates semantic resources (lexicon and corpora) as well as semantic field disambiguation methods. In terms of novelty, the NLP pre-processing tools are developed either using rule-based, statistical, or hybrid techniques. Furthermore, all semantic lexicons have been developed using a novel combination of automatic or semi-automatic approaches: mapping, crowdsourcing, statistical machine translation, GIZA++, word embeddings, and named entity. A large multi-target annotated corpus is also constructed using a semi-automatic approach to test accuracy of the Urdu semantic tagger, proposed corpus is also used to train and test supervised multi-target Machine Learning classifiers. The results show that Random k-labEL Disjoint Pruned Sets and Classifier Chain multi-target classifiers outperform all other classifiers on the proposed corpus with a Hamming Loss of 0.06\% and Accuracy of 0.94\%. The best lexical coverage of 88.59\%, 99.63\%, 96.71\% and 89.63\% are obtained on several test corpora. The developed Urdu semantic tagger shows encouraging precision on the proposed test corpus of 79.47\%
Analyzing and enhancing music mood classification : an empirical study
In the computer age, managing large data repositories is one of the common challenges,
especially for music data. Categorizing, manipulating, and refining music tracks are among
the most complex tasks in Music Information Retrieval (MIR). Classification is one of the
core functions in MIR, which classifies music data from different perspectives, from genre
to instrument to mood. The primary focus of this study is on music mood classification.
Mood is a subjective phenomenon in MIR, which involves different considerations, such
as psychology, musicology, culture, and social behavior. One of the most significant prerequisitions
in music mood classification is answering these questions: what combination
of acoustic features helps us to improve the accuracy of classification in this area? What
type of classifiers is appropriate in music mood classification? How can we increase the
accuracy of music mood classification using several classifiers?
To find the answers to these questions, we empirically explored different acoustic features
and classification schemes on the mood classification in music data. Also, we found the two
approaches to use several classifiers simultaneously to classify music tracks using mood labels
automatically. These methods contain two voting procedures; namely, Plurality Voting
and Borda Count. These approaches are categorized into ensemble techniques, which combine
a group of classifiers to reach better accuracy. The proposed ensemble methods are
implemented and verified through empirical experiments. The results of the experiments
have shown that these proposed approaches could improve the accuracy of music mood
classification
- …