Search CORE

60 research outputs found

Kernel Methods for Document Filtering

Author: Cancedda N
Cesa-Bianchi N
Conconi A
Gentile C
Goutte C
Graepel T
Li Y
Renders J-M
Shawe-Taylor J
Vinokourov A
Publication venue
Publication date: 01/01/2002
Field of study

This paper describes the algorithms implemented by the KerMIT consortium for its participation in the Trec 2002 Filtering track. The consortium submitted runs for the routing task using a linear SVM, for the batch task using the same SVM in combination with an innovation threshold-selection mechanism, and for the adaptive task using both a second-order perceptron and a combination of SVM and perceptron with uneven margin. Results seem to indicate that these algorithm performed relatively well on the extensive TREC benchmark

CiteSeerX

Southampton (e-Prints Soton)

Archivio istituzionale della ricerca - Università dell'Insubria

UCL Discovery

Infinite factorization of multiple non-parametric views

Author: A. Gelman
A. Klami
A. Klami
A. Rodriguez
A. Vinokourov
Arto Klami
C. Archambeau
C. Rasmussen
D. Blackwell
D. Blei
D. Cohn
D. Lee
D. M. Blei
D. M. Roy
G. Englebienne
I. Rivals
I. S. Dhillon
Janne Sinkkonen
K. Barnard
M. Welling
Mark Girolami
N. Friedman
N. L. Johnson
R. M. Neal
S. Becker
S. Rogers
Samuel Kaski
Simon Rogers
T. Hofmann
Y. W. Teh
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2010
Field of study

Combined analysis of multiple data sources has increasing application interest, in particular for distinguishing shared and source-specific aspects. We extend this rationale of classical canonical correlation analysis into a flexible, generative and non-parametric clustering setting, by introducing a novel non-parametric hierarchical mixture model. The lower level of the model describes each source with a flexible non-parametric mixture, and the top level combines these to describe commonalities of the sources. The lower-level clusters arise from hierarchical Dirichlet Processes, inducing an infinite-dimensional contingency table between the views. The commonalities between the sources are modeled by an infinite block model of the contingency table, interpretable as non-negative factorization of infinite matrices, or as a prior for infinite contingency tables. With Gaussian mixture components plugged in for continuous measurements, the model is applied to two views of genes, mRNA expression and abundance of the produced proteins, to expose groups of genes that are co-regulated in either or both of the views. Cluster analysis of co-expression is a standard simple way of screening for co-regulation, and the two-view analysis extends the approach to distinguishing between pre- and post-translational regulation

CUED - Cambridge University Engineering Department

Экологические аспекты безопасной утилизации отходов агропромышленного комплекса

Author: Dadashev M.N.
Filenko D.G.
Kapustin M.A.
Kobelev K.V.
Krupnov V.A.
Vinokourov V.A.
Publication venue: Федеральное государственное автономное образовательное учреждение высшего образования Российский университет дружбы народов (РУДН)
Publication date
Field of study

It is proposed to use the waste of brewing industry as organo-mineral fertilizers. Techno-logical line for recycling brewing jeast and frieseiguhr precipitate filtration has been proposed. Natural geolites and cement dust are recommended to be used as a source of potassium. The main advantages of the proposed technology are: a high degree of environmental safety, both the technological process and the desired product; high degree of safe disposal of the securities of secondary raw materials; lack in technological cycle of harmful and dangerous for animals and grow-tive world of chemical reagents; use a technological cycle inexpensive, af-fordable domestic equipment; environmental and economic benefits for the refineries.В статье предложено использовать отходы пивоваренного производства в качестве органо-минеральных удобрений. Обоснована технологическая линия по утилизации отходов пивоваренных дрожжей и кизельгуровых фильтрационных осадков. В качестве источника калия рекомендуется использовать природные цеолиты и цементную пыль. Основными преимуществами предлагаемой технологии являются: высокая степень экологической безопасности как самого технологического процесса, так и целевого продукта; высокая степень безопасной утилизации ценных вторичных сырьевых ресурсов; отсутствие в технологическом цикле вредных и опасных для животного и растительного мира химических реагентов; использование в технологическом цикле недорогого, доступного отечественного оборудования; экологическая и экономическая выгода для перерабатывающих предприятий

RUDN Repository

Continuous Space Models for CLIR

Author: Aggarwal
Ballesteros
Blei
Bojar
Bromley
Chandar
Deerwester
Diamantaras
Dumais
Gabrilovich
Gao
Gao
Gupta
Gupta
Hermann
Hiemstra
Hinton
Hinton
Hofmann
Huang
Järvelin
Klementiev
Koehn
Lauly
Le
Manning
Mikolov
Mikolov
Mimno
Munteanu
Nie
Paolo Rosso
Parth Gupta
Platt
Rafael E. Banchs
Rocchio
Salakhutdinov
Skopal
Socher
Türe
Vinokourov
Xu
Yih
Zou
Publication venue: 'Elsevier BV'
Publication date: 01/01/2017
Field of study

[EN] We present and evaluate a novel technique for learning cross-lingual continuous space models to aid cross-language information retrieval (CLIR). Our model, which is referred to as external-data composition neural network (XCNN), is based on a composition function that is implemented on top of a deep neural network that provides a distributed learning framework. Different from most existing models, which rely only on available parallel data for training, our learning framework provides a natural way to exploit monolingual data and its associated relevance metadata for learning continuous space representations of language. Cross-language extensions of the obtained models can then be trained by using a small set of parallel data. This property is very helpful for resource-poor languages, therefore, we carry out experiments on the English-Hindi language pair. On the conducted comparative evaluation, the proposed model is shown to outperform state-of-the-art continuous space models with statistically significant margin on two different tasks: parallel sentence retrieval and ad-hoc retrieval.We thank German Sanchis Trilles for helping in conducting experiments with machine translation. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the GeForce Titan GPU used for this research. The research of the first author was supported by FPI grant of UPV. The research of the third author is supported by the SomEMBED TIN2015-71147-C2-1-P MINECO research project and by the Generalitat Valenciana under the grant ALMAMATER (PrometeolI/2014/030).Gupta, P.; Banchs, R.; Rosso, P. (2017). Continuous Space Models for CLIR. Information Processing & Management. 53(2):359-370. https://doi.org/10.1016/j.ipm.2016.11.002S35937053

Crossref

RiuNet

Performance measurement framework for hierarchical text classification

Author: Chakrabarti
Cohen
D'Alessio
Dumais
Dumais
Gaussier
Greiner
Joachims
Joachims
Koller
Labrou
Larkey
Lewis
Lewis
McCallum
McCallum
McCallum
Mitchell
Mladenic
Rijsbergen
Robertson
Sasaki
Sebastiani
Toutanova
Vinokourov
Wang
Wang
Weigend
Yang
Publication venue: 'Wiley'
Publication date: 01/01/2003
Field of study

Hierarchical text classification or simply hierarchical classification refers to assigning a document to one or more suitable categories from a hierarchical category space. In our literature survey, we have found that the existing hierarchical classification experiments used a variety of measures to evaluate performance. These performance measures often assume independence between categories and do not consider documents misclassified into categories that are similar or not far from the correct categories in the category tree. In this paper, we therefore propose new performance measures for hierarchical classification. The proposed performance measures consist of category similarity measures and distance based measures that consider the contributions of misclassified documents. Our experiments on hierarchical classification methods based on SVM classifiers and binary Nave Bayes classifiers showed that SVM classifiers perform better than Nave Bayes classifiers on Reuters21578 collection according to the extended measures. A new classifier-centric measure called blocking measure is also defined to examine the performance of subtree classifiers in a top-down level-based hierarchical classification method

CiteSeerX

Crossref

Institutional Knowledge at Singapore Management University

Boosting multi-label hierarchical text categorization

Author: A. S. Weigend
A. Vinokourov
Andrea Esuli
C. Apté
D. D. Lewis
Fabrizio Sebastiani
M. Ceci
M. R. Spiegel
M. Ruiz
R. E. Schapire
R. E. Schapire
S. Chakrabarti
T. Y. Liu
Tiziano Fagni
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

The issue of the strengthening public role in making environmentally significant decisions at the international and national level, with special reference to Russian legislation

Author: Aleksey Anisimov
Anatoliy Ryzhenkov
Anisimov A.P.
Anisimov A.P.
Baskin Y.Y.
Belokrylova E.A.
Belokrylova E.A.
Brinchuk M.M.
Gogaeva M.T.
Napolsky H.
Petrenko V.F.
Tikhomirov Y.A.
Vasilieva M.I.
Veselov A.K.
Vinokourov Y.E.
Zhemchuzhnova E.A.
Publication venue: 'Akademiai Kiado Zrt.'
Publication date
Field of study

Crossref

Classifying web documents in a hierarchy of categories: a comprehensive study

Author: A. McCallum
A. McCallum
A. S. Weigend
A. Sun
A. Vinokourov
C. Apté
D. D. Lewis
D. D. Lewis
D. Koller
D. Malerba
D. Mladenić
D. Mladenić
D. Mladenić
D. Tikk
Donato Malerba
E.-H. Han
F. Debole
F. Esposito
F. Sebastiani
G. Miller
G. Salton
H. Blockeel
H. T. Ng
J. Rocchio
J. Zhang
L. Galavotti
M. Ceci
M. Craven
M. E. Ruiz
M. F. Porter
M. Sahami
Michelangelo Ceci
P. Domingos
R. E. Schapire
R. E. Schapire
S. Dumais
S. Dumais
S. Kim
T. Hastie
T. Joachims
T. Joachims
T. Mitchell
T. Theeramunkong
V. Lertnattee
V. Vapnik
W. T. Chuang
Y. Yang
Y. Yang
Y. Yang
Z. Zheng
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

String Kernels, Fisher Kernels and Finite State Automata

Author: Saunders C.
Shawe-Taylor J.
Vinokourov A.
Publication venue
Publication date: 01/01/2002
Field of study

In this paper we show how the generation of documents can be thought of as a k-stage Markov process, which leads to a Fisher kernel from which the n-gram and string kernels can be re-constructed. The Fisher kernel view gives a more flexible insight into the string kernel and suggests how it can be parametrised in a way that reflects the statistics of the training corpus. Furthermore, the probabilistic modelling approach suggests extending the Markov process to consider sub-sequences of varying length, rather than the standard fixed-length approach used in the string kernel. We give a procedure for determining which sub-sequences are informative features and hence generate a Finite State Machine model, which can again be used to obtain a Fisher kernel. By adjusting the parametrisation we can also influence the weighting received by the features. In this way we are able to obtain a logarithmic weighting in a Fisher kernel. Finally, experiments are reported comparing the different kernels using the standard Bag of Words kernel as a baseline

CiteSeerX

Southampton (e-Prints Soton)

A probabilistic framework for mismatch and profile string kernels

Author: Saunders C. J.
Soklakov A. N.
Vinokourov A.
Publication venue
Publication date: 01/01/2005
Field of study

There has recently been numerous applications of kernel methods in the field of bioinformatics. In particular, the problem of protein homology has served as a benchmark for the performance of many new kernels which operate directly on strings (such as amino-acid sequences). Several new kernels have been developed and successfully applied to this type of data, including spectrum, string, mismatch, and profile kernels. In this paper we introduce a general probabilistic framework for string kernels which uses the fisher-kernel approach and includes spectrum, mismatch and profile kernels, among others, as special cases. The use of a probabilistic model however provides additional flexibility both in definition and for the re-weighting of features through feature selection methods, prior knowledge or semi-supervised approaches which use data repositories such as BLAST. We give details of the framework and also give preliminary experimental results which show the applicability of the technique

Southampton (e-Prints Soton)