285 research outputs found

    The dynamics of correlated novelties

    Full text link
    One new thing often leads to another. Such correlated novelties are a familiar part of daily life. They are also thought to be fundamental to the evolution of biological systems, human society, and technology. By opening new possibilities, one novelty can pave the way for others in a process that Kauffman has called "expanding the adjacent possible". The dynamics of correlated novelties, however, have yet to be quantified empirically or modeled mathematically. Here we propose a simple mathematical model that mimics the process of exploring a physical, biological or conceptual space that enlarges whenever a novelty occurs. The model, a generalization of Polya's urn, predicts statistical laws for the rate at which novelties happen (analogous to Heaps' law) and for the probability distribution on the space explored (analogous to Zipf's law), as well as signatures of the hypothesized process by which one novelty sets the stage for another. We test these predictions on four data sets of human activity: the edit events of Wikipedia pages, the emergence of tags in annotation systems, the sequence of words in texts, and listening to new songs in online music catalogues. By quantifying the dynamics of correlated novelties, our results provide a starting point for a deeper understanding of the ever-expanding adjacent possible and its role in biological, linguistic, cultural, and technological evolution

    Computational lyricology: Quantitative approaches to understanding song lyrics and their interpretations

    Get PDF
    Recently, music complexity has drawn attention from researchers in the Music Information Retrieval (MIR) area. In particular, computational methods to measure music complexity have been studied to provide better music services in large-scale music digital libraries. However, the majority of music complexity research has focused on audio-related facets of music, while song lyrics have been rarely considered. Based on the observation that most popular songs contain lyrics, whose different levels of complexity contribute to the overall music complexity, this dissertation research investigates song lyric complexity and how it might be measured computationally. In a broad sense, lyric complexity comes from two aspects of text complexity--quantitative and qualitative dimensions--that have a complementary relationship. For a comprehensive understanding of lyric complexity, this study explores both dimensions. First, for the quantitative dimensions, such as word frequency and word length, refer to those that can be measured efficiently using computer programs. Among them, this study examines the concreteness of song lyrics using trend analysis. Second, on the contrary to the quantitative dimensions, the qualitative dimensions refer to a deeper level of lyric complexity that requires attentive readers' comprehension and external knowledge. However, it is challenging to collect a large-scale qualitative analysis of lyric complexity due to the resource constraints. To this end, this dissertation introduces user-generated interpretations of song lyrics that are abundant on the web as a proxy for assessing the qualitative dimensions of lyric complexity. To be specific, this study first examines whether the user-generated data provide quality topic information, and then proposes a Lyric Topic Diversity Score (LTDS), a lyric complexity metric based on the diversity of the topics found in users' interpretations. The assumption behind this approach is that complex song lyrics tend to provoke diverse user interpretations due to their properties, such as ambiguous meanings, historical context, the author's intention, and so on. The first findings of this study include that concreteness of popular song lyrics fell from the middle of the 1960s until the 1990s and rose after that. The advent of Hip-Hop/Rap and the number of words in song lyrics are highly correlated with the rise in concreteness after the early 1990s. Second, interpretations are a good input source for automatic topic detection algorithms. Third, the interpretation-based lyric complexity metric looks promising because it is correlated with Lexical Novelty Scores (LNS), the only previously developed lyric complexity measure. Overall, this work expands the scope of music complexity by focusing on relatively unexplored data, song lyrics. Moreover, these findings suggest that any potential analysis and application on any objects can benefit from this kind of auxiliary data, which is in the form of user comments

    Expertise and Dynamics within Crowdsourced Musical Knowledge Curation: A Case Study of the Genius Platform

    Full text link
    Many platforms collect crowdsourced information primarily from volunteers. As this type of knowledge curation has become widespread, contribution formats vary substantially and are driven by diverse processes across differing platforms. Thus, models for one platform are not necessarily applicable to others. Here, we study the temporal dynamics of Genius, a platform primarily designed for user-contributed annotations of song lyrics. A unique aspect of Genius is that the annotations are extremely local -- an annotated lyric may just be a few lines of a song -- but also highly related, e.g., by song, album, artist, or genre. We analyze several dynamical processes associated with lyric annotations and their edits, which differ substantially from models for other platforms. For example, expertise on song annotations follows a ``U shape'' where experts are both early and late contributors with non-experts contributing intermediately; we develop a user utility model that captures such behavior. We also find several contribution traits appearing early in a user's lifespan of contributions that distinguish (eventual) experts from non-experts. Combining our findings, we develop a model for early prediction of user expertise.Comment: 9 pages. 10 figure

    Infants' perception of sound patterns in oral language play

    Get PDF

    User-centric Music Information Retrieval

    Get PDF
    The rapid growth of the Internet and the advancements of the Web technologies have made it possible for users to have access to large amounts of on-line music data, including music acoustic signals, lyrics, style/mood labels, and user-assigned tags. The progress has made music listening more fun, but has raised an issue of how to organize this data, and more generally, how computer programs can assist users in their music experience. An important subject in computer-aided music listening is music retrieval, i.e., the issue of efficiently helping users in locating the music they are looking for. Traditionally, songs were organized in a hierarchical structure such as genre-\u3eartist-\u3ealbum-\u3etrack, to facilitate the users’ navigation. However, the intentions of the users are often hard to be captured in such a simply organized structure. The users may want to listen to music of a particular mood, style or topic; and/or any songs similar to some given music samples. This motivated us to work on user-centric music retrieval system to improve users’ satisfaction with the system. The traditional music information retrieval research was mainly concerned with classification, clustering, identification, and similarity search of acoustic data of music by way of feature extraction algorithms and machine learning techniques. More recently the music information retrieval research has focused on utilizing other types of data, such as lyrics, user access patterns, and user-defined tags, and on targeting non-genre categories for classification, such as mood labels and styles. This dissertation focused on investigating and developing effective data mining techniques for (1) organizing and annotating music data with styles, moods and user-assigned tags; (2) performing effective analysis of music data with features from diverse information sources; and (3) recommending music songs to the users utilizing both content features and user access patterns

    Do Language Models Plagiarize?

    Full text link
    Past literature has illustrated that language models (LMs) often memorize parts of training instances and reproduce them in natural language generation (NLG) processes. However, it is unclear to what extent LMs "reuse" a training corpus. For instance, models can generate paraphrased sentences that are contextually similar to training samples. In this work, therefore, we study three types of plagiarism (i.e., verbatim, paraphrase, and idea) among GPT-2 generated texts, in comparison to its training data, and further analyze the plagiarism patterns of fine-tuned LMs with domain-specific corpora which are extensively used in practice. Our results suggest that (1) three types of plagiarism widely exist in LMs beyond memorization, (2) both size and decoding methods of LMs are strongly associated with the degrees of plagiarism they exhibit, and (3) fine-tuned LMs' plagiarism patterns vary based on their corpus similarity and homogeneity. Given that a majority of LMs' training data is scraped from the Web without informing content owners, their reiteration of words, phrases, and even core ideas from training sets into generated texts has ethical implications. Their patterns are likely to exacerbate as both the size of LMs and their training data increase, raising concerns about indiscriminately pursuing larger models with larger training corpora. Plagiarized content can also contain individuals' personal and sensitive information. These findings overall cast doubt on the practicality of current LMs in mission-critical writing tasks and urge more discussions around the observed phenomena. Data and source code are available at https://github.com/Brit7777/LM-plagiarism.Comment: Accepted to WWW'2

    Structured Generation and Exploration of Design Space with Large Language Models for Human-AI Co-Creation

    Full text link
    Thanks to their generative capabilities, large language models (LLMs) have become an invaluable tool for creative processes. These models have the capacity to produce hundreds and thousands of visual and textual outputs, offering abundant inspiration for creative endeavors. But are we harnessing their full potential? We argue that current interaction paradigms fall short, guiding users towards rapid convergence on a limited set of ideas, rather than empowering them to explore the vast latent design space in generative models. To address this limitation, we propose a framework that facilitates the structured generation of design space in which users can seamlessly explore, evaluate, and synthesize a multitude of responses. We demonstrate the feasibility and usefulness of this framework through the design and development of an interactive system, Luminate, and a user study with 8 professional writers. Our work advances how we interact with LLMs for creative tasks, introducing a way to harness the creative potential of LLMs

    Language and Linguistics in a Complex World Data, Interdisciplinarity, Transfer, and the Next Generation. ICAME41 Extended Book of Abstracts

    Get PDF
    This is a collection of papers, work-in-progress reports, and other contributions that were part of the ICAME41 digital conference
    • …
    corecore