24 research outputs found

    Text Segmentation Using Exponential Models

    Full text link
    This paper introduces a new statistical approach to partitioning text automatically into coherent segments. Our approach enlists both short-range and long-range language models to help it sniff out likely sites of topic changes in text. To aid its search, the system consults a set of simple lexical hints it has learned to associate with the presence of boundaries through inspection of a large corpus of annotated data. We also propose a new probabilistically motivated error metric for use by the natural language processing and information retrieval communities, intended to supersede precision and recall for appraising segmentation algorithms. Qualitative assessment of our algorithm as well as evaluation using this new metric demonstrate the effectiveness of our approach in two very different domains, Wall Street Journal articles and the TDT Corpus, a collection of newswire articles and broadcast news transcripts.Comment: 12 pages, LaTeX source and postscript figures for EMNLP-2 pape

    Suppression of Supergravity Anomalies in Conformal Sequestering

    Get PDF
    We show that the anomaly-mediated supersymmetry breaking via the Kahler and sigma-model anomalies is suppressed by conformal dynamics in the supersymmetry breaking sector.Comment: 8 page

    FeedbackMap: a tool for making sense of open-ended survey responses

    Full text link
    Analyzing open-ended survey responses is a crucial yet challenging task for social scientists, non-profit organizations, and educational institutions, as they often face the trade-off between obtaining rich data and the burden of reading and coding textual responses. This demo introduces FeedbackMap, a web-based tool that uses natural language processing techniques to facilitate the analysis of open-ended survey responses. FeedbackMap lets researchers generate summaries at multiple levels, identify interesting response examples, and visualize the response space through embeddings. We discuss the importance of examining survey results from multiple perspectives and the potential biases introduced by summarization methods, emphasizing the need for critical evaluation of the representation and omission of respondent voices.Comment: Demo at CSCW 202

    RadioTalk: a large-scale corpus of talk radio transcripts

    Full text link
    We introduce RadioTalk, a corpus of speech recognition transcripts sampled from talk radio broadcasts in the United States between October of 2018 and March of 2019. The corpus is intended for use by researchers in the fields of natural language processing, conversational analysis, and the social sciences. The corpus encompasses approximately 2.8 billion words of automatically transcribed speech from 284,000 hours of radio, together with metadata about the speech, such as geographical location, speaker turn boundaries, gender, and radio program information. In this paper we summarize why and how we prepared the corpus, give some descriptive statistics on stations, shows and speakers, and carry out a few high-level analyses.Comment: 5 pages, 4 figures, accepted by Interspeech 201

    Topic Detection and Tracking with Time-Aware Document Embeddings

    Full text link
    The time at which a message is communicated is a vital piece of metadata in many real-world natural language processing tasks such as Topic Detection and Tracking (TDT). TDT systems aim to cluster a corpus of news articles by event, and in that context, stories that describe the same event are likely to have been written at around the same time. Prior work on time modeling for TDT takes this into account, but does not well capture how time interacts with the semantic nature of the event. For example, stories about a tropical storm are likely to be written within a short time interval, while stories about a movie release may appear over weeks or months. In our work, we design a neural method that fuses temporal and textual information into a single representation of news documents for event detection. We fine-tune these time-aware document embeddings with a triplet loss architecture, integrate the model into downstream TDT systems, and evaluate the systems on two benchmark TDT data sets in English. In the retrospective setting, we apply clustering algorithms to the time-aware embeddings and show substantial improvements over baselines on the News2013 data set. In the online streaming setting, we add our document encoder to an existing state-of-the-art TDT pipeline and demonstrate that it can benefit the overall performance. We conduct ablation studies on the time representation and fusion algorithm strategies, showing that our proposed model outperforms alternative strategies. Finally, we probe the model to examine how it handles recurring events more effectively than previous TDT systems

    All a-board: sharing educational data science research with school districts

    Full text link
    Educational data scientists often conduct research with the hopes of translating findings into lasting change through policy, civil society, or other channels. However, the bridge from research to practice can be fraught with sociopolitical frictions that impede, or altogether block, such translations -- especially when they are contentious or otherwise difficult to achieve. Focusing on one entrenched educational equity issue in US public schools -- racial and ethnic segregation -- we conduct randomized email outreach experiments and surveys to explore how local school districts respond to algorithmically-generated school catchment areas ("attendance boundaries") designed to foster more diverse and integrated schools. Cold email outreach to approximately 4,320 elected school board members across over 800 school districts informing them of potential boundary changes reveals a large average open rate of nearly 40%, but a relatively small click-through rate of 2.5% to an interactive dashboard depicting such changes. Board members, however, appear responsive to different messaging techniques -- particularly those that dovetail issues of racial and ethnic diversity with other top-of-mind issues (like school capacity planning). On the other hand, media coverage of the research drives more dashboard engagement, especially in more segregated districts. A small but rich set of survey responses from school board and community members across several districts identify data and operational bottlenecks to implementing boundary changes to foster more diverse schools, but also share affirmative comments on the potential viability of such changes. Together, our findings may support educational data scientists in more effectively disseminating research that aims to bridge educational inequalities through systems-level change.Comment: In Proceedings of the Tenth ACM Conference on Learning at Scale (L@S '23

    Agglomerative clustering of a search engine query log

    No full text
    Categories and Subject �� � �¤ � �����������������¤���������������������������������������¤��������������¥�������£��©�¤

    Lexical Discovery with an Enriched Semantic Network

    No full text
    The study of lexical semantics has produced a systematic analysis of relationships between content words that has greatly bene#ted both lexical search tools and natural language processing systems. We describe researchtoward a common algorithmic core for these two applications. We #rst introduce a database system called FreeNet that facilitates the description and exploration #nite binary relations. We then describe the design and implementation of Lexical FreeNet, a semantic network that mixes WordNet-derived semantic relations with data-derived and phonetically-derived relations. We discuss how Lexical FreeNet has aided in lexical discovery, the pursuit of linguistic and factual knowledge by the computer-aided exploration of lexical relations, and relate this to ongoing researchinhow the system can be employed in natural language processing algorithms for segmentation and summarization. Submission Type: Paper for Coling-ACL '98 Workshop, #Usage of WordNet in Natural Langua..
    corecore