83 research outputs found
Recommended from our members
Term burstiness: evidence, model and applications
The present thesis looks at the phenomenon of term burstiness in text. Term burstiness is defined as the multiple re-occurrences in short succession of a particular term after it has occurred once in a certain text. Term burstiness is important as it aids in providing structure and meaning to a document. Various kinds of term burstiness in text are studied and their effect on a dataset explored in a series of homogeneity experiments. A novel model of term burstiness is proposed and evaluations based on the proposed model are performed on three different applications. The “bag-of-words” assumption is often used in statistical Natural Language Processing and Information Retrieval applications. Under this assumption all structure and positional information of terms is lost and only frequency counts of the document are retained. As a result of counting frequencies only, the “bag-of-words” representation of text assumes that the probability of a word occurring remains constant throughout the text. This assumption is often used because of its simplicity and the ease it provides for the application of mathematical and statistical techniques on text. Though this assumption is known to be untrue [CG95b, CG95a, ChuOO], but applications [SB97, Lew98, MN98, Seb02] based on this assumption appear not to be much hampered. A series of homogeneity based experiments are carried out to study the presence and extent of term burstiness against the term independence based homogeneity assumption on the dataset. A null hypothesis stating the homogeneity of a dataset is formulated and defeated in a series of experiments based on the y2 test, which tests the equality between two partitions of a certain dataset. Various schemes of partitioning a dataset are adopted to illustrate the effect of term burstiness and structure in text. This provided evidence of term burstiness in the dataset, and fine-grained information about the distribution of terms that might be used for characterizing or profiling a dataset. A model for term burstiness in a dataset is proposed based on the gaps between successive occurrences of a particular term. This model is not merely based on frequency counts like other existing models, but takes into account the structural and positional information about the term’s occurrence in the document. The proposed term burstiness model looks at gaps between successive occurrences of the term. These gaps are modeled using a mixture of exponential distributions. The first exponential distribution provides the overall rate of occurrence of a term in a dataset and the second exponential distribution determines the term’s rate of re-occurrence in a burst or when it has already occurred once previously. Since most terms occur in only a few documents, there are a large number of documents with no occurrences of a particular term. In the proposed model, non-occurrence of a term in a document is accounted for by the method of data censoring. It is not straightforward to obtain parameter estimates for such a complex model. So, Bayesian statistics is used for flexibility and ease of fitting this model, and for obtaining parameter estimates. The model can be used for all kinds of terms, be they rare content words, medium frequency terms or frequent function words. The term re-occurrence model is instantiated and verified against the background of different collections, in the context of three different applications. The applications include studying various terms within a dataset to identify behavioral differences between the terms, studying similar terms across different datasets to detect stylistic features based on the term’s distribution and studying the characteristics of very frequent terms across different datasets. The model aids in the identification of term characteristics in a dataset. It helps distinguish between highly bursty content terms and less bursty function words. The model can differentiate between a frequent function word and a scattered one. It can be used to identify stylistic features in a term’s distribution across text of varying genres. The model also aids in understanding the behaviour of very frequent (usually function) words in a dataset
Recommended from our members
A Bayesian mixture model for term re-occurrence and burstiness
This paper proposes a model for term reoccurrence in a text collection based on the gaps between successive occurrences of a term. These gaps are modeled using
a mixture of exponential distributions. Parameter
estimation is based on a Bayesian framework that allows us to fit a flexible model. The model provides measures of a term’s re-occurrence rate and withindocument burstiness. The model works for all kinds of terms, be it rare content
word, medium frequency term or frequent function word. A measure is proposed to account for the term’s importance based on its distribution pattern in the corpus
Self-learning Emulators and Eigenvector Continuation
Emulators that can bypass computationally expensive scientific calculations
with high accuracy and speed can enable new studies of fundamental science as
well as more potential applications. In this work we focus on solving a system
of constraint equations efficiently using a new machine learning approach that
we call self-learning emulation. A self-learning emulator is an active learning
protocol that can rapidly solve a system of equations over some range of
control parameters. The key ingredient is a fast estimate of the emulator error
that becomes progressively more accurate as the emulator improves. This
acceleration is possible because the emulator itself is used to estimate the
error, and we illustrate with two examples. The first uses cubic spline
interpolation to find the roots of a polynomial with variable coefficients. The
second example uses eigenvector continuation to find the eigenvectors and
eigenvalues of a large Hamiltonian matrix that depends on several control
parameters. We envision future applications of self-learning emulators for
solving systems of algebraic equations, linear and nonlinear differential
equations, and linear and nonlinear eigenvalue problems.Comment: 5 + 2 pages (main + supplemental), 5 + 0 figures (main +
supplemental
tagE: Enabling an Embodied Agent to Understand Human Instructions
Natural language serves as the primary mode of communication when an
intelligent agent with a physical presence engages with human beings. While a
plethora of research focuses on natural language understanding (NLU),
encompassing endeavors such as sentiment analysis, intent prediction, question
answering, and summarization, the scope of NLU directed at situations
necessitating tangible actions by an embodied agent remains limited. The
inherent ambiguity and incompleteness inherent in natural language present
challenges for intelligent agents striving to decipher human intention. To
tackle this predicament head-on, we introduce a novel system known as task and
argument grounding for Embodied agents (tagE). At its core, our system employs
an inventive neural network model designed to extract a series of tasks from
complex task instructions expressed in natural language. Our proposed model
adopts an encoder-decoder framework enriched with nested decoding to
effectively extract tasks and their corresponding arguments from these
intricate instructions. These extracted tasks are then mapped (or grounded) to
the robot's established collection of skills, while the arguments find
grounding in objects present within the environment. To facilitate the training
and evaluation of our system, we have curated a dataset featuring complex
instructions. The results of our experiments underscore the prowess of our
approach, as it outperforms robust baseline models.Comment: Accepted in EMNLP Findings 202
Time fractals and discrete scale invariance with trapped ions
We show that a one-dimensional chain of trapped ions can be engineered to
produce a quantum mechanical system with discrete scale invariance and
fractal-like time dependence. By discrete scale invariance we mean a system
that replicates itself under a rescaling of distance for some scale factor, and
a time fractal is a signal that is invariant under the rescaling of time. These
features are reminiscent of the Efimov effect, which has been predicted and
observed in bound states of three-body systems. We demonstrate that discrete
scale invariance in the trapped ion system can be controlled with two
independently tunable parameters. We also discuss the extension to n-body
states where the discrete scaling symmetry has an exotic heterogeneous
structure. The results we present can be realized using currently available
technologies developed for trapped ion quantum systems.Comment: 4 + 5 pages (main + supplemental materials), 2 + 3 figures (main +
supplemental materials), version to appear in Physical Review A Rapid
Communication
Breakage Modeling of Needle-Shaped Particles Using The Discrete Element Method
This paper models the breakage of large aspect ratio particles in an attrition cell using discrete element method (DEM) and population balance (PB) models. The particles are modeled in DEM as sphero-cylinders. The stresses within each particle are calculated along the particle length using beam theory and the particle breaks into two parts if the stress exceeds a critical value. Thus, the size distribution changes with time within the DEM model. The DEM model is validated against previously published experimental data. The simulations demonstrate that particle breakage occurs primarily in front of the attrition cell blades, with the breakage rate decreasing as the particle sizes decrease. Increasing the particle elastic modulus, decreasing the particle yield strength, and increasing the attrition cell lid stress also increase the rate of breakage. Particles break most frequently at their center and the daughter size distribution normalized by the initial particle size is fit well with a Gaussian distribution. Parametric studies in which the initial particle size distribution varies demonstrate that the particle sizes approach a distribution that is independent of the initial state after a sufficient amount of work is done on the particle bed. A correlation for the specific breakage rate is developed from the DEM simulations and used within a PB model along with the daughter size distribution fit. The PB model also clearly shows that the particle size distribution becomes independent of the initial size distribution and after a sufficiently long time, is fit well with a log-normal distribution
Projected Cooling Algorithm for Quantum Computation
In the current era of noisy quantum devices, there is a need for quantum
algorithms that are efficient and robust against noise. Towards this end, we
introduce the projected cooling algorithm for quantum computation. The
projected cooling algorithm is able to construct the localized ground state of
any Hamiltonian with a translationally-invariant kinetic energy and
interactions that vanish at large distances. The term "localized" refers to
localization in position space. The method can be viewed as the quantum analog
of evaporative cooling. We start with an initial state with support over a
compact region of a large volume. We then drive the excited quantum states to
disperse and measure the remaining portion of the wave function left behind.
For the nontrivial examples we consider here, the improvement over other
methods is substantial. The only additional resource required is performing the
operations in a volume significantly larger than the size of the localized
state. These characteristics make the projected cooling algorithm a promising
tool for calculations of self-bound systems such as atomic nuclei.Comment: 12 pages and 3 figures in the main text, 7 pages in the supplemental
materials, final version to appear Physics Letters
THE SPLICEOSOMAL PROTEIN SnRNP F BINDS TO BOTH U3 AND U14 CLASS OF snoRNA IN Giardia lamblia
Small nuclear Ribonucleo Protein F (snRNP F) is a spliceosomal protein that binds with U1, U2, U4/U6 and U5 small nuclear RNA (snRNA) to form spliceosomal complexes responsible for pre mRNA processing. This study reports the unusual interaction of giardial snRNP F with small nucleolar RNAs (snoRNA) that are responsible for pre rRNA processing. Electrophoretic Mobility Shift Assay was used to demonstrate the interaction of this protein with U3 and U14 class snoRNA of the early branching eukaryote Giardia lamblia. It was also evident from our study that snRNP F in Giardia is evolutionary distinct from its other eukaryotic orthologues
- …