285 research outputs found
The dynamics of correlated novelties
One new thing often leads to another. Such correlated novelties are a
familiar part of daily life. They are also thought to be fundamental to the
evolution of biological systems, human society, and technology. By opening new
possibilities, one novelty can pave the way for others in a process that
Kauffman has called "expanding the adjacent possible". The dynamics of
correlated novelties, however, have yet to be quantified empirically or modeled
mathematically. Here we propose a simple mathematical model that mimics the
process of exploring a physical, biological or conceptual space that enlarges
whenever a novelty occurs. The model, a generalization of Polya's urn, predicts
statistical laws for the rate at which novelties happen (analogous to Heaps'
law) and for the probability distribution on the space explored (analogous to
Zipf's law), as well as signatures of the hypothesized process by which one
novelty sets the stage for another. We test these predictions on four data sets
of human activity: the edit events of Wikipedia pages, the emergence of tags in
annotation systems, the sequence of words in texts, and listening to new songs
in online music catalogues. By quantifying the dynamics of correlated
novelties, our results provide a starting point for a deeper understanding of
the ever-expanding adjacent possible and its role in biological, linguistic,
cultural, and technological evolution
Computational lyricology: Quantitative approaches to understanding song lyrics and their interpretations
Recently, music complexity has drawn attention from researchers in the Music Information Retrieval (MIR) area. In particular, computational methods to measure music complexity have been studied to provide better music services in large-scale music digital libraries. However, the majority of music complexity research has focused on audio-related facets of music, while song lyrics have been rarely considered. Based on the observation that most popular songs contain lyrics, whose different levels of complexity contribute to the overall music complexity, this dissertation research investigates song lyric complexity and how it might be measured computationally.
In a broad sense, lyric complexity comes from two aspects of text complexity--quantitative and qualitative dimensions--that have a complementary relationship. For a comprehensive understanding of lyric complexity, this study explores both dimensions. First, for the quantitative dimensions, such as word frequency and word length, refer to those that can be measured efficiently using computer programs. Among them, this study examines the concreteness of song lyrics using trend analysis. Second, on the contrary to the quantitative dimensions, the qualitative dimensions refer to a deeper level of lyric complexity that requires attentive readers' comprehension and external knowledge. However, it is challenging to collect a large-scale qualitative analysis of lyric complexity due to the resource constraints. To this end, this dissertation introduces user-generated interpretations of song lyrics that are abundant on the web as a proxy for assessing the qualitative dimensions of lyric complexity. To be specific, this study first examines whether the user-generated data provide quality topic information, and then proposes a Lyric Topic Diversity Score (LTDS), a lyric complexity metric based on the diversity of the topics found in users' interpretations. The assumption behind this approach is that complex song lyrics tend to provoke diverse user interpretations due to their properties, such as ambiguous meanings, historical context, the author's intention, and so on.
The first findings of this study include that concreteness of popular song lyrics fell from the middle of the 1960s until the 1990s and rose after that. The advent of Hip-Hop/Rap and the number of words in song lyrics are highly correlated with the rise in concreteness after the early 1990s. Second, interpretations are a good input source for automatic topic detection algorithms. Third, the interpretation-based lyric complexity metric looks promising because it is correlated with Lexical Novelty Scores (LNS), the only previously developed lyric complexity measure. Overall, this work expands the scope of music complexity by focusing on relatively unexplored data, song lyrics. Moreover, these findings suggest that any potential analysis and application on any objects can benefit from this kind of auxiliary data, which is in the form of user comments
Expertise and Dynamics within Crowdsourced Musical Knowledge Curation: A Case Study of the Genius Platform
Many platforms collect crowdsourced information primarily from volunteers. As
this type of knowledge curation has become widespread, contribution formats
vary substantially and are driven by diverse processes across differing
platforms. Thus, models for one platform are not necessarily applicable to
others. Here, we study the temporal dynamics of Genius, a platform primarily
designed for user-contributed annotations of song lyrics. A unique aspect of
Genius is that the annotations are extremely local -- an annotated lyric may
just be a few lines of a song -- but also highly related, e.g., by song, album,
artist, or genre. We analyze several dynamical processes associated with lyric
annotations and their edits, which differ substantially from models for other
platforms. For example, expertise on song annotations follows a ``U shape''
where experts are both early and late contributors with non-experts
contributing intermediately; we develop a user utility model that captures such
behavior. We also find several contribution traits appearing early in a user's
lifespan of contributions that distinguish (eventual) experts from non-experts.
Combining our findings, we develop a model for early prediction of user
expertise.Comment: 9 pages. 10 figure
User-centric Music Information Retrieval
The rapid growth of the Internet and the advancements of the Web technologies have made it possible for users to have access to large amounts of on-line music data, including music acoustic signals, lyrics, style/mood labels, and user-assigned tags. The progress has made music listening more fun, but has raised an issue of how to organize this data, and more generally, how computer programs can assist users in their music experience.
An important subject in computer-aided music listening is music retrieval, i.e., the issue of efficiently helping users in locating the music they are looking for. Traditionally, songs were organized in a hierarchical structure such as genre-\u3eartist-\u3ealbum-\u3etrack, to facilitate the users’ navigation. However, the intentions of the users are often hard to be captured in such a simply organized structure. The users may want to listen to music of a particular mood, style or topic; and/or any songs similar to some given music samples. This motivated us to work on user-centric music retrieval system to improve users’ satisfaction with the system.
The traditional music information retrieval research was mainly concerned with classification, clustering, identification, and similarity search of acoustic data of music by way of feature extraction algorithms and machine learning techniques. More recently the music information retrieval research has focused on utilizing other types of data, such as lyrics, user access patterns, and user-defined tags, and on targeting non-genre categories for classification, such as mood labels and styles. This dissertation focused on investigating and developing effective data mining techniques for (1) organizing and annotating music data with styles, moods and user-assigned tags; (2) performing effective analysis of music data with features from diverse information sources; and (3) recommending music songs to the users utilizing both content features and user access patterns
Do Language Models Plagiarize?
Past literature has illustrated that language models (LMs) often memorize
parts of training instances and reproduce them in natural language generation
(NLG) processes. However, it is unclear to what extent LMs "reuse" a training
corpus. For instance, models can generate paraphrased sentences that are
contextually similar to training samples. In this work, therefore, we study
three types of plagiarism (i.e., verbatim, paraphrase, and idea) among GPT-2
generated texts, in comparison to its training data, and further analyze the
plagiarism patterns of fine-tuned LMs with domain-specific corpora which are
extensively used in practice. Our results suggest that (1) three types of
plagiarism widely exist in LMs beyond memorization, (2) both size and decoding
methods of LMs are strongly associated with the degrees of plagiarism they
exhibit, and (3) fine-tuned LMs' plagiarism patterns vary based on their corpus
similarity and homogeneity. Given that a majority of LMs' training data is
scraped from the Web without informing content owners, their reiteration of
words, phrases, and even core ideas from training sets into generated texts has
ethical implications. Their patterns are likely to exacerbate as both the size
of LMs and their training data increase, raising concerns about
indiscriminately pursuing larger models with larger training corpora.
Plagiarized content can also contain individuals' personal and sensitive
information. These findings overall cast doubt on the practicality of current
LMs in mission-critical writing tasks and urge more discussions around the
observed phenomena. Data and source code are available at
https://github.com/Brit7777/LM-plagiarism.Comment: Accepted to WWW'2
Structured Generation and Exploration of Design Space with Large Language Models for Human-AI Co-Creation
Thanks to their generative capabilities, large language models (LLMs) have
become an invaluable tool for creative processes. These models have the
capacity to produce hundreds and thousands of visual and textual outputs,
offering abundant inspiration for creative endeavors. But are we harnessing
their full potential? We argue that current interaction paradigms fall short,
guiding users towards rapid convergence on a limited set of ideas, rather than
empowering them to explore the vast latent design space in generative models.
To address this limitation, we propose a framework that facilitates the
structured generation of design space in which users can seamlessly explore,
evaluate, and synthesize a multitude of responses. We demonstrate the
feasibility and usefulness of this framework through the design and development
of an interactive system, Luminate, and a user study with 8 professional
writers. Our work advances how we interact with LLMs for creative tasks,
introducing a way to harness the creative potential of LLMs
Language and Linguistics in a Complex World Data, Interdisciplinarity, Transfer, and the Next Generation. ICAME41 Extended Book of Abstracts
This is a collection of papers, work-in-progress reports, and other contributions that were part of the ICAME41 digital conference
- …