57,480 research outputs found
SeLeCT: a lexical cohesion based news story segmentation system
In this paper we compare the performance of three distinct approaches to lexical cohesion based text segmentation. Most work in this area has focused on the discovery of textual units that discuss subtopic structure within documents. In contrast our segmentation task requires the discovery of topical units of text i.e., distinct news stories from broadcast news programmes. Our approach to news story segmentation (the SeLeCT system) is based on an analysis of lexical cohesive strength between textual units using a linguistic technique called lexical chaining. We evaluate the relative performance of SeLeCT with respect to two other cohesion based segmenters: TextTiling and C99. Using a recently introduced evaluation metric WindowDiff, we contrast the segmentation accuracy of each system on both "spoken" (CNN news transcripts) and "written" (Reuters newswire) news story test sets extracted from the TDT1 corpus
Time and position distributions in large volume spherical scintillation detectors
Large spherical scintillation detectors are playing an increasingly important
role in experimental neutrino physics studies. From the instrumental point of
view the primary signal response of these set-ups is constituted by the time
and amplitude of the anode pulses delivered by each individual phototube
following a particle interaction in the scintillator. In this work, under some
approximate assumptions, we derive a number of analytical formulas able to give
a fairly accurate description of the most important timing features of these
detectors, intended to complement the more complete Monte Carlo studies
normally used for a full modelling approach. The paper is completed with a
mathematical description of the event position distributions which can be
inferred, through some inference algorithm, starting from the primary time
measures of the photomultiplier tubes.Comment: 29 pages, 20 figures, accepted for publication on Nucl. Instr. and
Meth.
Automated Big Text Security Classification
In recent years, traditional cybersecurity safeguards have proven ineffective
against insider threats. Famous cases of sensitive information leaks caused by
insiders, including the WikiLeaks release of diplomatic cables and the Edward
Snowden incident, have greatly harmed the U.S. government's relationship with
other governments and with its own citizens. Data Leak Prevention (DLP) is a
solution for detecting and preventing information leaks from within an
organization's network. However, state-of-art DLP detection models are only
able to detect very limited types of sensitive information, and research in the
field has been hindered due to the lack of available sensitive texts. Many
researchers have focused on document-based detection with artificially labeled
"confidential documents" for which security labels are assigned to the entire
document, when in reality only a portion of the document is sensitive. This
type of whole-document based security labeling increases the chances of
preventing authorized users from accessing non-sensitive information within
sensitive documents. In this paper, we introduce Automated Classification
Enabled by Security Similarity (ACESS), a new and innovative detection model
that penetrates the complexity of big text security classification/detection.
To analyze the ACESS system, we constructed a novel dataset, containing
formerly classified paragraphs from diplomatic cables made public by the
WikiLeaks organization. To our knowledge this paper is the first to analyze a
dataset that contains actual formerly sensitive information annotated at
paragraph granularity.Comment: Pre-print of Best Paper Award IEEE Intelligence and Security
Informatics (ISI) 2016 Manuscrip
Integrating Prosodic and Lexical Cues for Automatic Topic Segmentation
We present a probabilistic model that uses both prosodic and lexical cues for
the automatic segmentation of speech into topically coherent units. We propose
two methods for combining lexical and prosodic information using hidden Markov
models and decision trees. Lexical information is obtained from a speech
recognizer, and prosodic features are extracted automatically from speech
waveforms. We evaluate our approach on the Broadcast News corpus, using the
DARPA-TDT evaluation metrics. Results show that the prosodic model alone is
competitive with word-based segmentation methods. Furthermore, we achieve a
significant reduction in error by combining the prosodic and word-based
knowledge sources.Comment: 27 pages, 8 figure
Automatic Segmentation of Multiparty Dialogue
In this paper, we investigate the problem of automatically predicting segment boundaries in spoken multiparty dialogue. We extend prior work in two ways. We first apply approaches that have been proposed for predicting top-level topic shifts to the problem of identifying subtopic boundaries. We then explore the impact on performance of using ASR output as opposed to human transcription. Examination of the effect of features shows that predicting top-level and predicting subtopic boundaries are two distinct tasks: (1) for predicting subtopic boundaries, the lexical cohesion-based approach alone can achieve competitive results, (2) for predicting top-level boundaries, the machine learning approach that combines lexical-cohesion and conversational features performs best, and (3) conversational cues, such as cue phrases and overlapping speech, are better indicators for the top-level prediction task. We also find that the transcription errors inevitable in ASR output have a negative impact on models that combine lexical-cohesion and conversational features, but do not change the general preference of approach for the two tasks
Phase Stability and Segregation in Alloy 22 Base Metal and Weldments
The current design of the waste disposal containers relies heavily on encasement in a multi-layered container, featuring a corrosion barrier of Alloy 22, a Ni-Cr-Mo-W based alloy with excellent corrosion resistance over a wide range of conditions. The fundamental concern from the perspective of the Yucca Mountain Project, however, is the inherent uncertainty in the (very) long-term stability of the base metal and welds. Should the properties of the selected materials change over the long service life of the waste packages, it is conceivable that the desired performance characteristics (such as corrosion reistance) will become compromised, leading to premature failure of the system. To address this, we will study the phase stability and solute segregation characteristics of Alloy 22 base metal and welds. A better understanding of the underlying microstructural evolution tendencies, and their connections with corrosion behavior will (in turn) produce a higher confidence in the extrapolated behavior of the container materials over time periods that are not feasibly tested in a laboratory. Additionally, the knowledge gained here may potentially lead to cost savings through development of safe and realistic design constraints and model assumptions throughout the entire disposal system
GumDrop at the DISRPT2019 Shared Task: A Model Stacking Approach to Discourse Unit Segmentation and Connective Detection
In this paper we present GumDrop, Georgetown University's entry at the DISRPT
2019 Shared Task on automatic discourse unit segmentation and connective
detection. Our approach relies on model stacking, creating a heterogeneous
ensemble of classifiers, which feed into a metalearner for each final task. The
system encompasses three trainable component stacks: one for sentence
splitting, one for discourse unit segmentation and one for connective
detection. The flexibility of each ensemble allows the system to generalize
well to datasets of different sizes and with varying levels of homogeneity.Comment: Proceedings of Discourse Relation Parsing and Treebanking
(DISRPT2019
A Study of Speed of the Boundary Element Method as applied to the Realtime Computational Simulation of Biological Organs
In this work, possibility of simulating biological organs in realtime using
the Boundary Element Method (BEM) is investigated. Biological organs are
assumed to follow linear elastostatic material behavior, and constant boundary
element is the element type used. First, a Graphics Processing Unit (GPU) is
used to speed up the BEM computations to achieve the realtime performance.
Next, instead of the GPU, a computer cluster is used. Results indicate that BEM
is fast enough to provide for realtime graphics if biological organs are
assumed to follow linear elastostatic material behavior. Although the present
work does not conduct any simulation using nonlinear material models, results
from using the linear elastostatic material model imply that it would be
difficult to obtain realtime performance if highly nonlinear material models
that properly characterize biological organs are used. Although the use of BEM
for the simulation of biological organs is not new, the results presented in
the present study are not found elsewhere in the literature.Comment: preprint, draft, 2 tables, 47 references, 7 files, Codes that can
solve three dimensional linear elastostatic problems using constant boundary
elements (of triangular shape) while ignoring body forces are provided as
supplementary files; codes are distributed under the MIT License in three
versions: i) MATLAB version ii) Fortran 90 version (sequential code) iii)
Fortran 90 version (parallel code
- …