Search CORE

83 research outputs found

Recommended from our members

Term burstiness: evidence, model and applications

Author: Sarkar Avik
Publication venue
Publication date: 16/01/2008
Field of study

The present thesis looks at the phenomenon of term burstiness in text. Term burstiness is defined as the multiple re-occurrences in short succession of a particular term after it has occurred once in a certain text. Term burstiness is important as it aids in providing structure and meaning to a document. Various kinds of term burstiness in text are studied and their effect on a dataset explored in a series of homogeneity experiments. A novel model of term burstiness is proposed and evaluations based on the proposed model are performed on three different applications. The “bag-of-words” assumption is often used in statistical Natural Language Processing and Information Retrieval applications. Under this assumption all structure and positional information of terms is lost and only frequency counts of the document are retained. As a result of counting frequencies only, the “bag-of-words” representation of text assumes that the probability of a word occurring remains constant throughout the text. This assumption is often used because of its simplicity and the ease it provides for the application of mathematical and statistical techniques on text. Though this assumption is known to be untrue [CG95b, CG95a, ChuOO], but applications [SB97, Lew98, MN98, Seb02] based on this assumption appear not to be much hampered. A series of homogeneity based experiments are carried out to study the presence and extent of term burstiness against the term independence based homogeneity assumption on the dataset. A null hypothesis stating the homogeneity of a dataset is formulated and defeated in a series of experiments based on the y2 test, which tests the equality between two partitions of a certain dataset. Various schemes of partitioning a dataset are adopted to illustrate the effect of term burstiness and structure in text. This provided evidence of term burstiness in the dataset, and fine-grained information about the distribution of terms that might be used for characterizing or profiling a dataset. A model for term burstiness in a dataset is proposed based on the gaps between successive occurrences of a particular term. This model is not merely based on frequency counts like other existing models, but takes into account the structural and positional information about the term’s occurrence in the document. The proposed term burstiness model looks at gaps between successive occurrences of the term. These gaps are modeled using a mixture of exponential distributions. The first exponential distribution provides the overall rate of occurrence of a term in a dataset and the second exponential distribution determines the term’s rate of re-occurrence in a burst or when it has already occurred once previously. Since most terms occur in only a few documents, there are a large number of documents with no occurrences of a particular term. In the proposed model, non-occurrence of a term in a document is accounted for by the method of data censoring. It is not straightforward to obtain parameter estimates for such a complex model. So, Bayesian statistics is used for flexibility and ease of fitting this model, and for obtaining parameter estimates. The model can be used for all kinds of terms, be they rare content words, medium frequency terms or frequent function words. The term re-occurrence model is instantiated and verified against the background of different collections, in the context of three different applications. The applications include studying various terms within a dataset to identify behavioral differences between the terms, studying similar terms across different datasets to detect stylistic features based on the term’s distribution and studying the characteristics of very frequent terms across different datasets. The model aids in the identification of term characteristics in a dataset. It helps distinguish between highly bursty content terms and less bursty function words. The model can differentiate between a frequent function word and a scattered one. It can be used to identify stylistic features in a term’s distribution across text of varying genres. The model also aids in understanding the behaviour of very frequent (usually function) words in a dataset

Open Research Online (The Open University)

Recommended from our members

A Bayesian mixture model for term re-occurrence and burstiness

Author: De Roeck Anne
Garthwaite Paul
Sarkar Avik
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2005
Field of study

This paper proposes a model for term reoccurrence in a text collection based on the gaps between successive occurrences of a term. These gaps are modeled using a mixture of exponential distributions. Parameter estimation is based on a Bayesian framework that allows us to fit a flexible model. The model provides measures of a term’s re-occurrence rate and withindocument burstiness. The model works for all kinds of terms, be it rare content word, medium frequency term or frequent function word. A measure is proposed to account for the term’s importance based on its distribution pattern in the corpus

Open Research Online (The Open University)

Self-learning Emulators and Eigenvector Continuation

Author: Lee Dean
Sarkar Avik
Publication venue
Publication date: 28/07/2021
Field of study

Emulators that can bypass computationally expensive scientific calculations with high accuracy and speed can enable new studies of fundamental science as well as more potential applications. In this work we focus on solving a system of constraint equations efficiently using a new machine learning approach that we call self-learning emulation. A self-learning emulator is an active learning protocol that can rapidly solve a system of equations over some range of control parameters. The key ingredient is a fast estimate of the emulator error that becomes progressively more accurate as the emulator improves. This acceleration is possible because the emulator itself is used to estimate the error, and we illustrate with two examples. The first uses cubic spline interpolation to find the roots of a polynomial with variable coefficients. The second example uses eigenvector continuation to find the eigenvectors and eigenvalues of a large Hamiltonian matrix that depends on several control parameters. We envision future applications of self-learning emulators for solving systems of algebraic equations, linear and nonlinear differential equations, and linear and nonlinear eigenvalue problems.Comment: 5 + 2 pages (main + supplemental), 5 + 0 figures (main + supplemental

arXiv.org e-Print Archive

tagE: Enabling an Embodied Agent to Understand Human Instructions

Author: Mitra Avik
Nayak Tapas
Pramanick Pradip
Sarkar Chayan
Publication venue
Publication date: 24/10/2023
Field of study

Natural language serves as the primary mode of communication when an intelligent agent with a physical presence engages with human beings. While a plethora of research focuses on natural language understanding (NLU), encompassing endeavors such as sentiment analysis, intent prediction, question answering, and summarization, the scope of NLU directed at situations necessitating tangible actions by an embodied agent remains limited. The inherent ambiguity and incompleteness inherent in natural language present challenges for intelligent agents striving to decipher human intention. To tackle this predicament head-on, we introduce a novel system known as task and argument grounding for Embodied agents (tagE). At its core, our system employs an inventive neural network model designed to extract a series of tasks from complex task instructions expressed in natural language. Our proposed model adopts an encoder-decoder framework enriched with nested decoding to effectively extract tasks and their corresponding arguments from these intricate instructions. These extracted tasks are then mapped (or grounded) to the robot's established collection of skills, while the arguments find grounding in objects present within the environment. To facilitate the training and evaluation of our system, we have curated a dataset featuring complex instructions. The results of our experiments underscore the prowess of our approach, as it outperforms robust baseline models.Comment: Accepted in EMNLP Findings 202

arXiv.org e-Print Archive

Time fractals and discrete scale invariance with trapped ions

Author: Frame Dillon
Given Gabriel
He Rongzheng
Lee Dean
Li Ning
Lu Bing-Nan
Sarkar Avik
Watkins Jacob
Publication venue: 'American Physical Society (APS)'
Publication date: 01/01/2019
Field of study

We show that a one-dimensional chain of trapped ions can be engineered to produce a quantum mechanical system with discrete scale invariance and fractal-like time dependence. By discrete scale invariance we mean a system that replicates itself under a rescaling of distance for some scale factor, and a time fractal is a signal that is invariant under the rescaling of time. These features are reminiscent of the Efimov effect, which has been predicted and observed in bound states of three-body systems. We demonstrate that discrete scale invariance in the trapped ion system can be controlled with two independently tunable parameters. We also discuss the extension to n-body states where the discrete scaling symmetry has an exotic heterogeneous structure. The results we present can be realized using currently available technologies developed for trapped ion quantum systems.Comment: 4 + 5 pages (main + supplemental materials), 2 + 3 figures (main + supplemental materials), version to appear in Physical Review A Rapid Communication

arXiv.org e-Print Archive

Juelich Shared Electronic Resources

Breakage Modeling of Needle-Shaped Particles Using The Discrete Element Method

Author: Curtis Jennifer S
Ketterhagen William
Kumar Rohit
Sarkar Avik
Wassgren Carl
Publication venue: 'Purdue University (bepress)'
Publication date: 18/05/2019
Field of study

This paper models the breakage of large aspect ratio particles in an attrition cell using discrete element method (DEM) and population balance (PB) models. The particles are modeled in DEM as sphero-cylinders. The stresses within each particle are calculated along the particle length using beam theory and the particle breaks into two parts if the stress exceeds a critical value. Thus, the size distribution changes with time within the DEM model. The DEM model is validated against previously published experimental data. The simulations demonstrate that particle breakage occurs primarily in front of the attrition cell blades, with the breakage rate decreasing as the particle sizes decrease. Increasing the particle elastic modulus, decreasing the particle yield strength, and increasing the attrition cell lid stress also increase the rate of breakage. Particles break most frequently at their center and the daughter size distribution normalized by the initial particle size is fit well with a Gaussian distribution. Parametric studies in which the initial particle size distribution varies demonstrate that the particle sizes approach a distribution that is independent of the initial state after a sufficient amount of work is done on the particle bed. A correlation for the specific breakage rate is developed from the DEM simulations and used within a PB model along with the daughter size distribution fit. The PB model also clearly shows that the particle size distribution becomes independent of the initial size distribution and after a sufficiently long time, is fit well with a log-normal distribution

Purdue E-Pubs

Projected Cooling Algorithm for Quantum Computation

Author: Bonitati Joey
Given Gabriel
Hicks Caleb
Lee Dean
Li Ning
Lu Bing-Nan
Rai Abudit
Sarkar Avik
Watkins Jacob
Publication venue: 'Elsevier BV'
Publication date: 01/01/2020
Field of study

In the current era of noisy quantum devices, there is a need for quantum algorithms that are efficient and robust against noise. Towards this end, we introduce the projected cooling algorithm for quantum computation. The projected cooling algorithm is able to construct the localized ground state of any Hamiltonian with a translationally-invariant kinetic energy and interactions that vanish at large distances. The term "localized" refers to localization in position space. The method can be viewed as the quantum analog of evaporative cooling. We start with an initial state with support over a compact region of a large volume. We then drive the excited quantum states to disperse and measure the remaining portion of the wave function left behind. For the nontrivial examples we consider here, the improvement over other methods is substantial. The only additional resource required is performing the operations in a volume significantly larger than the size of the localized state. These characteristics make the projected cooling algorithm a promising tool for calculations of self-bound systems such as atomic nuclei.Comment: 12 pages and 3 figures in the main text, 7 pages in the supplemental materials, final version to appear Physics Letters

arXiv.org e-Print Archive

Juelich Shared Electronic Resources

THE SPLICEOSOMAL PROTEIN SnRNP F BINDS TO BOTH U3 AND U14 CLASS OF snoRNA IN Giardia lamblia

Author: Das Koushik
Ganguly Sandipan
Ghosh Arjun
Karmakar Sumallya
Mukherjee Avik K.
Nozaki T.
Raj Dibyendu
Sarkar Srimanti
Publication venue: Longdom Publishing
Publication date: 01/01/2013
Field of study

Small nuclear Ribonucleo Protein F (snRNP F) is a spliceosomal protein that binds with U1, U2, U4/U6 and U5 small nuclear RNA (snRNA) to form spliceosomal complexes responsible for pre mRNA processing. This study reports the unusual interaction of giardial snRNP F with small nucleolar RNAs (snoRNA) that are responsible for pre rRNA processing. Electrophoretic Mobility Shift Assay was used to demonstrate the interaction of this protein with U3 and U14 class snoRNA of the early branching eukaryote Giardia lamblia. It was also evident from our study that snRNP F in Giardia is evolutionary distinct from its other eukaryotic orthologues

Okayama University Scientific Achievement Repository

Bayesian Treed Multivariate Gaussian Process With Adaptive Design: Application to a Carbon Capture Unit

Author: Avik Sarkar
Banerjee S.
Bledar Konomi
Cohn D.A.
Cressie N.
Georgios Karagiannis
Gidaspow D.
Guang Lin
Hjort N.
Iman R.L.
Mardia K.V.
Seo S.
Xin Sun
Publication venue: 'Informa UK Limited'
Publication date
Field of study

Crossref