60 research outputs found

    Kernel Methods for Document Filtering

    No full text
    This paper describes the algorithms implemented by the KerMIT consortium for its participation in the Trec 2002 Filtering track. The consortium submitted runs for the routing task using a linear SVM, for the batch task using the same SVM in combination with an innovation threshold-selection mechanism, and for the adaptive task using both a second-order perceptron and a combination of SVM and perceptron with uneven margin. Results seem to indicate that these algorithm performed relatively well on the extensive TREC benchmark

    Infinite factorization of multiple non-parametric views

    Get PDF
    Combined analysis of multiple data sources has increasing application interest, in particular for distinguishing shared and source-specific aspects. We extend this rationale of classical canonical correlation analysis into a flexible, generative and non-parametric clustering setting, by introducing a novel non-parametric hierarchical mixture model. The lower level of the model describes each source with a flexible non-parametric mixture, and the top level combines these to describe commonalities of the sources. The lower-level clusters arise from hierarchical Dirichlet Processes, inducing an infinite-dimensional contingency table between the views. The commonalities between the sources are modeled by an infinite block model of the contingency table, interpretable as non-negative factorization of infinite matrices, or as a prior for infinite contingency tables. With Gaussian mixture components plugged in for continuous measurements, the model is applied to two views of genes, mRNA expression and abundance of the produced proteins, to expose groups of genes that are co-regulated in either or both of the views. Cluster analysis of co-expression is a standard simple way of screening for co-regulation, and the two-view analysis extends the approach to distinguishing between pre- and post-translational regulation

    Экологические аспекты безопасной утилизации отходов агропромышленного комплекса

    Get PDF
    It is proposed to use the waste of brewing industry as organo-mineral fertilizers. Techno-logical line for recycling brewing jeast and frieseiguhr precipitate filtration has been proposed. Natural geolites and cement dust are recommended to be used as a source of potassium. The main advantages of the proposed technology are: a high degree of environmental safety, both the technological process and the desired product; high degree of safe disposal of the securities of secondary raw materials; lack in technological cycle of harmful and dangerous for animals and grow-tive world of chemical reagents; use a technological cycle inexpensive, af-fordable domestic equipment; environmental and economic benefits for the refineries.В статье предложено использовать отходы пивоваренного производства в качестве органо-минеральных удобрений. Обоснована технологическая линия по утилизации отходов пивоваренных дрожжей и кизельгуровых фильтрационных осадков. В качестве источника калия рекомендуется использовать природные цеолиты и цементную пыль. Основными преимуществами предлагаемой технологии являются: высокая степень экологической безопасности как самого технологического процесса, так и целевого продукта; высокая степень безопасной утилизации ценных вторичных сырьевых ресурсов; отсутствие в технологическом цикле вредных и опасных для животного и растительного мира химических реагентов; использование в технологическом цикле недорогого, доступного отечественного оборудования; экологическая и экономическая выгода для перерабатывающих предприятий

    Continuous Space Models for CLIR

    Full text link
    [EN] We present and evaluate a novel technique for learning cross-lingual continuous space models to aid cross-language information retrieval (CLIR). Our model, which is referred to as external-data composition neural network (XCNN), is based on a composition function that is implemented on top of a deep neural network that provides a distributed learning framework. Different from most existing models, which rely only on available parallel data for training, our learning framework provides a natural way to exploit monolingual data and its associated relevance metadata for learning continuous space representations of language. Cross-language extensions of the obtained models can then be trained by using a small set of parallel data. This property is very helpful for resource-poor languages, therefore, we carry out experiments on the English-Hindi language pair. On the conducted comparative evaluation, the proposed model is shown to outperform state-of-the-art continuous space models with statistically significant margin on two different tasks: parallel sentence retrieval and ad-hoc retrieval.We thank German Sanchis Trilles for helping in conducting experiments with machine translation. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the GeForce Titan GPU used for this research. The research of the first author was supported by FPI grant of UPV. The research of the third author is supported by the SomEMBED TIN2015-71147-C2-1-P MINECO research project and by the Generalitat Valenciana under the grant ALMAMATER (PrometeolI/2014/030).Gupta, P.; Banchs, R.; Rosso, P. (2017). Continuous Space Models for CLIR. Information Processing & Management. 53(2):359-370. https://doi.org/10.1016/j.ipm.2016.11.002S35937053

    Performance measurement framework for hierarchical text classification

    Get PDF
    Hierarchical text classification or simply hierarchical classification refers to assigning a document to one or more suitable categories from a hierarchical category space. In our literature survey, we have found that the existing hierarchical classification experiments used a variety of measures to evaluate performance. These performance measures often assume independence between categories and do not consider documents misclassified into categories that are similar or not far from the correct categories in the category tree. In this paper, we therefore propose new performance measures for hierarchical classification. The proposed performance measures consist of category similarity measures and distance based measures that consider the contributions of misclassified documents. Our experiments on hierarchical classification methods based on SVM classifiers and binary Nave Bayes classifiers showed that SVM classifiers perform better than Nave Bayes classifiers on Reuters21578 collection according to the extended measures. A new classifier-centric measure called blocking measure is also defined to examine the performance of subtree classifiers in a top-down level-based hierarchical classification method

    String Kernels, Fisher Kernels and Finite State Automata

    No full text
    In this paper we show how the generation of documents can be thought of as a k-stage Markov process, which leads to a Fisher kernel from which the n-gram and string kernels can be re-constructed. The Fisher kernel view gives a more flexible insight into the string kernel and suggests how it can be parametrised in a way that reflects the statistics of the training corpus. Furthermore, the probabilistic modelling approach suggests extending the Markov process to consider sub-sequences of varying length, rather than the standard fixed-length approach used in the string kernel. We give a procedure for determining which sub-sequences are informative features and hence generate a Finite State Machine model, which can again be used to obtain a Fisher kernel. By adjusting the parametrisation we can also influence the weighting received by the features. In this way we are able to obtain a logarithmic weighting in a Fisher kernel. Finally, experiments are reported comparing the different kernels using the standard Bag of Words kernel as a baseline

    A probabilistic framework for mismatch and profile string kernels

    No full text
    There has recently been numerous applications of kernel methods in the field of bioinformatics. In particular, the problem of protein homology has served as a benchmark for the performance of many new kernels which operate directly on strings (such as amino-acid sequences). Several new kernels have been developed and successfully applied to this type of data, including spectrum, string, mismatch, and profile kernels. In this paper we introduce a general probabilistic framework for string kernels which uses the fisher-kernel approach and includes spectrum, mismatch and profile kernels, among others, as special cases. The use of a probabilistic model however provides additional flexibility both in definition and for the re-weighting of features through feature selection methods, prior knowledge or semi-supervised approaches which use data repositories such as BLAST. We give details of the framework and also give preliminary experimental results which show the applicability of the technique
    corecore