24 research outputs found
Text Segmentation Using Exponential Models
This paper introduces a new statistical approach to partitioning text
automatically into coherent segments. Our approach enlists both short-range and
long-range language models to help it sniff out likely sites of topic changes
in text. To aid its search, the system consults a set of simple lexical hints
it has learned to associate with the presence of boundaries through inspection
of a large corpus of annotated data. We also propose a new probabilistically
motivated error metric for use by the natural language processing and
information retrieval communities, intended to supersede precision and recall
for appraising segmentation algorithms. Qualitative assessment of our algorithm
as well as evaluation using this new metric demonstrate the effectiveness of
our approach in two very different domains, Wall Street Journal articles and
the TDT Corpus, a collection of newswire articles and broadcast news
transcripts.Comment: 12 pages, LaTeX source and postscript figures for EMNLP-2 pape
Suppression of Supergravity Anomalies in Conformal Sequestering
We show that the anomaly-mediated supersymmetry breaking via the Kahler and
sigma-model anomalies is suppressed by conformal dynamics in the supersymmetry
breaking sector.Comment: 8 page
FeedbackMap: a tool for making sense of open-ended survey responses
Analyzing open-ended survey responses is a crucial yet challenging task for
social scientists, non-profit organizations, and educational institutions, as
they often face the trade-off between obtaining rich data and the burden of
reading and coding textual responses. This demo introduces FeedbackMap, a
web-based tool that uses natural language processing techniques to facilitate
the analysis of open-ended survey responses. FeedbackMap lets researchers
generate summaries at multiple levels, identify interesting response examples,
and visualize the response space through embeddings. We discuss the importance
of examining survey results from multiple perspectives and the potential biases
introduced by summarization methods, emphasizing the need for critical
evaluation of the representation and omission of respondent voices.Comment: Demo at CSCW 202
RadioTalk: a large-scale corpus of talk radio transcripts
We introduce RadioTalk, a corpus of speech recognition transcripts sampled
from talk radio broadcasts in the United States between October of 2018 and
March of 2019. The corpus is intended for use by researchers in the fields of
natural language processing, conversational analysis, and the social sciences.
The corpus encompasses approximately 2.8 billion words of automatically
transcribed speech from 284,000 hours of radio, together with metadata about
the speech, such as geographical location, speaker turn boundaries, gender, and
radio program information. In this paper we summarize why and how we prepared
the corpus, give some descriptive statistics on stations, shows and speakers,
and carry out a few high-level analyses.Comment: 5 pages, 4 figures, accepted by Interspeech 201
Topic Detection and Tracking with Time-Aware Document Embeddings
The time at which a message is communicated is a vital piece of metadata in
many real-world natural language processing tasks such as Topic Detection and
Tracking (TDT). TDT systems aim to cluster a corpus of news articles by event,
and in that context, stories that describe the same event are likely to have
been written at around the same time. Prior work on time modeling for TDT takes
this into account, but does not well capture how time interacts with the
semantic nature of the event. For example, stories about a tropical storm are
likely to be written within a short time interval, while stories about a movie
release may appear over weeks or months. In our work, we design a neural method
that fuses temporal and textual information into a single representation of
news documents for event detection. We fine-tune these time-aware document
embeddings with a triplet loss architecture, integrate the model into
downstream TDT systems, and evaluate the systems on two benchmark TDT data sets
in English. In the retrospective setting, we apply clustering algorithms to the
time-aware embeddings and show substantial improvements over baselines on the
News2013 data set. In the online streaming setting, we add our document encoder
to an existing state-of-the-art TDT pipeline and demonstrate that it can
benefit the overall performance. We conduct ablation studies on the time
representation and fusion algorithm strategies, showing that our proposed model
outperforms alternative strategies. Finally, we probe the model to examine how
it handles recurring events more effectively than previous TDT systems
All a-board: sharing educational data science research with school districts
Educational data scientists often conduct research with the hopes of
translating findings into lasting change through policy, civil society, or
other channels. However, the bridge from research to practice can be fraught
with sociopolitical frictions that impede, or altogether block, such
translations -- especially when they are contentious or otherwise difficult to
achieve. Focusing on one entrenched educational equity issue in US public
schools -- racial and ethnic segregation -- we conduct randomized email
outreach experiments and surveys to explore how local school districts respond
to algorithmically-generated school catchment areas ("attendance boundaries")
designed to foster more diverse and integrated schools. Cold email outreach to
approximately 4,320 elected school board members across over 800 school
districts informing them of potential boundary changes reveals a large average
open rate of nearly 40%, but a relatively small click-through rate of 2.5% to
an interactive dashboard depicting such changes. Board members, however, appear
responsive to different messaging techniques -- particularly those that
dovetail issues of racial and ethnic diversity with other top-of-mind issues
(like school capacity planning). On the other hand, media coverage of the
research drives more dashboard engagement, especially in more segregated
districts. A small but rich set of survey responses from school board and
community members across several districts identify data and operational
bottlenecks to implementing boundary changes to foster more diverse schools,
but also share affirmative comments on the potential viability of such changes.
Together, our findings may support educational data scientists in more
effectively disseminating research that aims to bridge educational inequalities
through systems-level change.Comment: In Proceedings of the Tenth ACM Conference on Learning at Scale (L@S
'23
Agglomerative clustering of a search engine query log
Categories and Subject �� � �¤ � �����������������¤���������������������������������������¤��������������¥�������£��©�¤
Lexical Discovery with an Enriched Semantic Network
The study of lexical semantics has produced a systematic analysis of relationships between content words that has greatly bene#ted both lexical search tools and natural language processing systems. We describe researchtoward a common algorithmic core for these two applications. We #rst introduce a database system called FreeNet that facilitates the description and exploration #nite binary relations. We then describe the design and implementation of Lexical FreeNet, a semantic network that mixes WordNet-derived semantic relations with data-derived and phonetically-derived relations. We discuss how Lexical FreeNet has aided in lexical discovery, the pursuit of linguistic and factual knowledge by the computer-aided exploration of lexical relations, and relate this to ongoing researchinhow the system can be employed in natural language processing algorithms for segmentation and summarization. Submission Type: Paper for Coling-ACL '98 Workshop, #Usage of WordNet in Natural Langua..