Search CORE

251,270 research outputs found

Improving Term Frequency Normalization for Multi-topical Documents, and Application to Language Modeling Approaches

Author: Kang In-Su
Lee Jong-Hyeok
Na Seung-Hoon
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 08/02/2015
Field of study

Term frequency normalization is a serious issue since lengths of documents are various. Generally, documents become long due to two different reasons - verbosity and multi-topicality. First, verbosity means that the same topic is repeatedly mentioned by terms related to the topic, so that term frequency is more increased than the well-summarized one. Second, multi-topicality indicates that a document has a broad discussion of multi-topics, rather than single topic. Although these document characteristics should be differently handled, all previous methods of term frequency normalization have ignored these differences and have used a simplified length-driven approach which decreases the term frequency by only the length of a document, causing an unreasonable penalization. To attack this problem, we propose a novel TF normalization method which is a type of partially-axiomatic approach. We first formulate two formal constraints that the retrieval model should satisfy for documents having verbose and multi-topicality characteristic, respectively. Then, we modify language modeling approaches to better satisfy these two constraints, and derive novel smoothing methods. Experimental results show that the proposed method increases significantly the precision for keyword queries, and substantially improves MAP (Mean Average Precision) for verbose queries.Comment: 8 pages, conference paper, published in ECIR '0

arXiv.org e-Print Archive

CiteSeerX

Probabilistic models of information retrieval based on measuring the divergence from randomness

Author: Allan J.
Amati G.
Bookstein A.
Carpineto C.
Cornelis Joost Van Rijsbergen
Croft W.
Damerau F.
Gianni Amati
Harman D.
Harter S. P.
Harter S. P.
Lafferty J.
Margulis E.
Ponte J.
Robertson S.
Robertson S.
Robertson S. E.
Robertson S. E.
Solomonoff R.
Solomonoff R.
van Rijsbergen C.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/10/2002
Field of study

We introduce and create a framework for deriving probabilistic models of Information Retrieval. The models are nonparametric models of IR obtained in the language model approach. We derive term-weighting models by measuring the divergence of the actual term distribution from that obtained under a random process. Among the random processes we study the binomial distribution and Bose--Einstein statistics. We define two types of term frequency normalization for tuning term weights in the document--query matching process. The first normalization assumes that documents have the same length and measures the information gain with the observed term once it has been accepted as a good descriptor of the observed document. The second normalization is related to the document length and to other statistics. These two normalization methods are applied to the basic models in succession to obtain weighting formulae. Results show that our framework produces different nonparametric models forming baseline alternatives to the standard tf-idf model

Crossref

Enlighten

On spontaneous photon emission in collapse models

Author: Adler Stephen L.
Bassi Angelo
Donadi Sandro
Publication venue
Publication date: 01/01/2013
Field of study

We reanalyze the problem of spontaneous photon emission in collapse models. We show that the extra term found by Bassi and Duerr is present for non-white (colored) noise, but its coefficient is proportional to the zero frequency Fourier component of the noise. This leads one to suspect that the extra term is an artifact. When the calculation is repeated with the final electron in a wave packet and with the noise confined to a bounded region, the extra term vanishes in the limit of continuum state normalization. The result obtained by Fu and by Adler and Ramazanoglu from application of the Golden Rule is then recovered.Comment: 23 pages, LaTex. Minor changes with respect to previous versio

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Università di Trieste

Robust sound event detection in bioacoustic sensor networks

Author: Bello Juan Pablo
Farnsworth Andrew
Kelling Steve
Lostanlen Vincent
Salamon Justin
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2019
Field of study

Bioacoustic sensors, sometimes known as autonomous recording units (ARUs), can record sounds of wildlife over long periods of time in scalable and minimally invasive ways. Deriving per-species abundance estimates from these sensors requires detection, classification, and quantification of animal vocalizations as individual acoustic events. Yet, variability in ambient noise, both over time and across sensors, hinders the reliability of current automated systems for sound event detection (SED), such as convolutional neural networks (CNN) in the time-frequency domain. In this article, we develop, benchmark, and combine several machine listening techniques to improve the generalizability of SED models across heterogeneous acoustic environments. As a case study, we consider the problem of detecting avian flight calls from a ten-hour recording of nocturnal bird migration, recorded by a network of six ARUs in the presence of heterogeneous background noise. Starting from a CNN yielding state-of-the-art accuracy on this task, we introduce two noise adaptation techniques, respectively integrating short-term (60 milliseconds) and long-term (30 minutes) context. First, we apply per-channel energy normalization (PCEN) in the time-frequency domain, which applies short-term automatic gain control to every subband in the mel-frequency spectrogram. Secondly, we replace the last dense layer in the network by a context-adaptive neural network (CA-NN) layer. Combining them yields state-of-the-art results that are unmatched by artificial data augmentation alone. We release a pre-trained version of our best performing system under the name of BirdVoxDetect, a ready-to-use detector of avian flight calls in field recordings.Comment: 32 pages, in English. Submitted to PLOS ONE journal in February 2019; revised August 2019; published October 201

arXiv.org e-Print Archive

Directory of Open Access Journals

Classic term weighting technique for mining web content outliers

Author: Mustapha Aida
Mustapha Norwati
Wan Zulkifeli Wan Rusila
Publication venue: Planetary Scientific Research Center
Publication date: 01/01/2012
Field of study

Outlier analysis has become a popular topic in the field of data mining but there have been less work on how to detect outliers in web content. Mining Web Content Outliers is used to detect irrelevant web content within a web portal. Term Frequency (TF) techniques from Information Retrieval (IR) have been used to detect the relevancy of a term in a web document. However, when document length varies, relative frequency is preferred. This study used maximum frequency normalization and applied Inverse Document Frequency (IDF) weighting technique which is a traditional term weighting method in IR to use the value of less frequent terms among documents which are considered as more discriminative than frequent terms. The dataset is from The 20 Newsgroups Dataset. TF.IDF is used in dissimilarity measure and the result achieves up to 91.10% of accuracy, which is about 17.77% higher than the previous technique

Universiti Putra Malaysia Institutional Repository

Fusion of tf.idf Weighted Bag of Visual Features for Image Classification

Author: Barat Cécile
Ducottet Christophe
Moulin Christophe
Publication venue: HAL CCSD
Publication date: 23/06/2010
Field of study

International audienceImage representation using bag of visual words approach is commonly used in image classification. Features are extracted from images and clustered into a visual vocabulary. Images can then be represented as a normalized histogram of visual words similarly to textual documents represented as a weighted vector of terms. As a result, text categorization techniques are applicable to image classification. In this paper, our contribution is twofold. First, we propose a suitable Term-Frequency and Inverse Document Frequency weighting scheme to characterize the importance of visual words. Second, we present a method to fuse different bag-of-words obtained with different vocabularies. We show that using our tf.idf normalization and the fusion leads to better classification rates than other normalization methods, other fusion schemes or other approaches evaluated on the SIMPLIcity collection

HAL-UJM

Notes on the pervasiveness of injuries in professional wrestling in the Atlantic Canadian circuit, as seen from ringside

Author: Mason Fred
Publication venue: The Repository at St. Cloud State
Publication date: 19/07/2023
Field of study

The frequency of injuries in professional wrestling is explored through 4 cases observed in independent wrestling shows in Atlantic Canada, while conducting ethnographic research into professional wrestling fans. Both veteran and younger professional wrestlers suffer frequent and chronic injuries, and tend to manage them as part of a normalization of pain and injury, with negative impacts for the lives and long-term health

St. Cloud State University

Pre Processing Techniques for Arabic Documents Clustering

Author: Alhanjouri Mohammed A.
Publication venue: 'Vandana Publications'
Publication date: 01/01/2017
Field of study

Clustering of text documents is an important technique for documents retrieval. It aims to organize documents into meaningful groups or clusters. Preprocessing text plays a main role in enhancing clustering process of Arabic documents. This research examines and compares text preprocessing techniques in Arabic document clustering. It also studies effectiveness of text preprocessing techniques: term pruning, term weighting using (TF-IDF), morphological analysis techniques using (root-based stemming, light stemming, and raw text), and normalization. Experimental work examined the effect of clustering algorithms using a most widely used partitional algorithm, K-means, compared with other clustering partitional algorithm, Expectation Maximization (EM) algorithm. Comparison between the effect of both Euclidean Distance and Manhattan similarity measurement function was attempted in order to produce best results in document clustering. Results were investigated by measuring evaluation of clustered documents in many cases of preprocessing techniques. Experimental results show that evaluation of document clustering can be enhanced by implementing term weighting (TF-IDF) and term pruning with small value for minimum term frequency. In morphological analysis, light stemming, is found more appropriate than root-based stemming and raw text. Normalization, also improved clustering process of Arabic documents, and evaluation is enhanced

Institutional Repository of the Islamic University of Gaza