66,853 research outputs found
Topic-based mixture language modelling
This paper describes an approach for constructing a mixture of language models based on simple statistical notions of semantics using probabilistic models developed for information retrieval. The approach encapsulates corpus-derived semantic information and is able to model varying styles of text. Using such information, the corpus texts are clustered in an unsupervised manner and a mixture of topic-specific language models is automatically created. The principal contribution of this work is to characterise the document space resulting from information retrieval techniques and to demonstrate the approach for mixture language modelling.
A comparison is made between manual and automatic clustering in order to elucidate how the global content information is expressed in the space. We also compare (in terms of association with manual clustering and language modelling accuracy) alternative term-weighting schemes and the effect of singular value decomposition dimension reduction (latent semantic analysis). Test set perplexity results using the British National Corpus indicate that the approach can improve the potential of statistical language modelling. Using an adaptive procedure, the conventional model may be tuned to track text data with a slight increase in computational cost
Neural networks and spectra feature selection for retrival of hot gases temperature profiles
Proceeding of: International Conference on Computational Intelligence for Modelling, Control and Automation, 2005 and International Conference on Intelligent Agents, Web Technologies and Internet Commerce, Vienna, Austria 28-30 Nov. 2005Neural networks appear to be a promising tool to solve the so-called inverse problems focused to obtain a retrieval of certain physical properties related to the radiative transference of energy. In this paper the capability of neural networks to retrieve the temperature profile in a combustion environment is proposed. Temperature profile retrieval will be obtained from the measurement of the spectral distribution of energy radiated by the hot gases (combustion products) at wavelengths corresponding to the infrared region. High spectral resolution is usually needed to gain a certain accuracy in the retrieval process. However, this great amount of information makes mandatory a reduction of the dimensionality of the problem. In this sense a careful selection of wavelengths in the spectrum must be performed. With this purpose principal component analysis technique is used to automatically determine those wavelengths in the spectrum that carry relevant information on temperature distribution. A multilayer perceptron will be trained with the different energies associated to the selected wavelengths. The results presented show that multilayer perceptron combined with principal component analysis is a suitable alternative in this field.Publicad
Web news classification using neural networks based on PCA
In this paper, we propose a news web page classification method (WPCM). The WPCM uses a neural network with inputs obtained by both the principal components and class profile-based features (CPBF). The fixed number of regular words from each class will be used as a feature vectors with the reduced features from the PCA. These feature vectors are then used as the input to the neural networks for classification. The experimental evaluation demonstrates that the WPCM provides acceptable classification accuracy with the sports news datasets
A D.C. Programming Approach to the Sparse Generalized Eigenvalue Problem
In this paper, we consider the sparse eigenvalue problem wherein the goal is
to obtain a sparse solution to the generalized eigenvalue problem. We achieve
this by constraining the cardinality of the solution to the generalized
eigenvalue problem and obtain sparse principal component analysis (PCA), sparse
canonical correlation analysis (CCA) and sparse Fisher discriminant analysis
(FDA) as special cases. Unlike the -norm approximation to the
cardinality constraint, which previous methods have used in the context of
sparse PCA, we propose a tighter approximation that is related to the negative
log-likelihood of a Student's t-distribution. The problem is then framed as a
d.c. (difference of convex functions) program and is solved as a sequence of
convex programs by invoking the majorization-minimization method. The resulting
algorithm is proved to exhibit \emph{global convergence} behavior, i.e., for
any random initialization, the sequence (subsequence) of iterates generated by
the algorithm converges to a stationary point of the d.c. program. The
performance of the algorithm is empirically demonstrated on both sparse PCA
(finding few relevant genes that explain as much variance as possible in a
high-dimensional gene dataset) and sparse CCA (cross-language document
retrieval and vocabulary selection for music retrieval) applications.Comment: 40 page
Spectral high resolution feature selection for retrieval of combustion temperature profiles
Proceeding of: 7th International Conference on Intelligent Data Engineering and Automated Learning, IDEAL 2006 (Burgos, Spain, September 20-23, 2006)The use of high spectral resolution measurements to obtain a retrieval of certain physical properties related with the radiative transfer of energy leads a priori to a better accuracy. But this improvement in accuracy is not easy to achieve due to the great amount of data which makes difficult any treatment over it and it's redundancies. To solve this problem, a pick selection based on principal component analysis has been adopted in order to make the mandatory feature selection over the different channels. In this paper, the capability to retrieve the temperature profile in a combustion environment using neural networks jointly with this spectral high resolution feature selection method is studied.Publicad
Multivariate texture discrimination using a principal geodesic classifier
A new texture discrimination method is presented for classification and retrieval of colored textures represented in the wavelet domain. The interband correlation structure is modeled by multivariate probability models which constitute a Riemannian manifold. The presented method considers the shape of the class on the manifold by determining the principal geodesic of each class. The method, which we call principal geodesic classification, then determines the shortest distance from a test texture to the principal geodesic of each class. We use the Rao geodesic distance (GD) for calculating distances on the manifold. We compare the performance of the proposed method with distance-to-centroid and knearest neighbor classifiers and of the GD with the Euclidean distance. The principal geodesic classifier coupled with the GD yields better results, indicating the usefulness of effectively and concisely quantifying the variability of the classes in the probabilistic feature space
Semantic distillation: a method for clustering objects by their contextual specificity
Techniques for data-mining, latent semantic analysis, contextual search of
databases, etc. have long ago been developed by computer scientists working on
information retrieval (IR). Experimental scientists, from all disciplines,
having to analyse large collections of raw experimental data (astronomical,
physical, biological, etc.) have developed powerful methods for their
statistical analysis and for clustering, categorising, and classifying objects.
Finally, physicists have developed a theory of quantum measurement, unifying
the logical, algebraic, and probabilistic aspects of queries into a single
formalism. The purpose of this paper is twofold: first to show that when
formulated at an abstract level, problems from IR, from statistical data
analysis, and from physical measurement theories are very similar and hence can
profitably be cross-fertilised, and, secondly, to propose a novel method of
fuzzy hierarchical clustering, termed \textit{semantic distillation} --
strongly inspired from the theory of quantum measurement --, we developed to
analyse raw data coming from various types of experiments on DNA arrays. We
illustrate the method by analysing DNA arrays experiments and clustering the
genes of the array according to their specificity.Comment: Accepted for publication in Studies in Computational Intelligence,
Springer-Verla
- …