66 research outputs found

    Link Graph Analysis for Adult Images Classification

    Full text link
    In order to protect an image search engine's users from undesirable results adult images' classifier should be built. The information about links from websites to images is employed to create such a classifier. These links are represented as a bipartite website-image graph. Each vertex is equipped with scores of adultness and decentness. The scores for image vertexes are initialized with zero, those for website vertexes are initialized according to a text-based website classifier. An iterative algorithm that propagates scores within a website-image graph is described. The scores obtained are used to classify images by choosing an appropriate threshold. The experiments on Internet-scale data have shown that the algorithm under consideration increases classification recall by 17% in comparison with a simple algorithm which classifies an image as adult if it is connected with at least one adult site (at the same precision level).Comment: 7 pages. Young Scientists Conference, 4th Russian Summer School in Information Retrieva

    Game interpretation of Kolmogorov complexity

    Full text link
    The Kolmogorov complexity function K can be relativized using any oracle A, and most properties of K remain true for relativized versions. In section 1 we provide an explanation for this observation by giving a game-theoretic interpretation and showing that all "natural" properties are either true for all sufficiently powerful oracles or false for all sufficiently powerful oracles. This result is a simple consequence of Martin's determinacy theorem, but its proof is instructive: it shows how one can prove statements about Kolmogorov complexity by constructing a special game and a winning strategy in this game. This technique is illustrated by several examples (total conditional complexity, bijection complexity, randomness extraction, contrasting plain and prefix complexities).Comment: 11 pages. Presented in 2009 at the conference on randomness in Madison

    FLOating-Window Projective Separator (FloWPS): A Data Trimming Tool for Support Vector Machines (SVM) to Improve Robustness of the Classifier

    Get PDF
    Here, we propose a heuristic technique of data trimming for SVM termed FLOating Window Projective Separator (FloWPS), tailored for personalized predictions based on molecular data. This procedure can operate with high throughput genetic datasets like gene expression or mutation profiles. Its application prevents SVM from extrapolation by excluding non-informative features. FloWPS requires training on the data for the individuals with known clinical outcomes to create a clinically relevant classifier. The genetic profiles linked with the outcomes are broken as usual into the training and validation datasets. The unique property of FloWPS is that irrelevant features in validation dataset that don’t have significant number of neighboring hits in the training dataset are removed from further analyses. Next, similarly to the k nearest neighbors (kNN) method, for each point of a validation dataset, FloWPS takes into account only the proximal points of the training dataset. Thus, for every point of a validation dataset, the training dataset is adjusted to form a floating window. FloWPS performance was tested on ten gene expression datasets for 992 cancer patients either responding or not on the different types of chemotherapy. We experimentally confirmed by leave-one-out cross-validation that FloWPS enables to significantly increase quality of a classifier built based on the classical SVM in most of the applications, particularly for polynomial kernels

    Induced Layered Clusters, Hereditary Mappings and Convex Geometries

    Get PDF
    A method for structural clustering proposed by the authors is extended to the case when there are externally defined restrictions on the relations between sets and their elements. This framework appears to be related to order-theoretic concepts of the hereditary mappings and convex geometries, which enables us to givecharacterizations of those in terms of the monotone linkage functions. Key words: layered cluster, monotone linkage, greedy optimization, convex geometry, hereditary mapping.

    Relation between Protein Structure, Sequence Homology and Composition of Amino Acids

    No full text
    . A method of quantitative comparison of two classifications rules applied to protein folding problem is presented. Classification of proteins based on sequence homology and based on amino acid composition were compared and analyzed according to this approach. The coefficient of correlation between these classification methods and the procedure of estimation of robustness of the coefficient are discussed. RRR 6-95 Page 1 1 Introduction One of the most powerful methods of protein structure prediction is the model building by homology (Hilbert et al, 1993). Chothia and Lesk (1986) suggested that if two sequences can be aligned with 50% or greater residue identity they have a similar fold. This threshold of 50% is usually used as a "safe definition of sequence homology" (Pascarella & Argos, 1992) and in conventional opinion grants a reasonable confidence that a protein sequence has chain conformation of the template excluding less conserved regions. But it was shown that structure inform..

    Optimization Algorithms for Separable Functions With Tree-Like Adjacency of Variables and Their Application to the Analysis of Massive Data Sets

    No full text
    A massive data set is considered as a set of experimentally acquired values of a number of variables each of which is associated with the respective node of an undirected adjacency graph that presets the fixed structure of the data set. The class of data analysis problems under consideration is outlined by the assumption that the ultimate aim of processing can be represented as a transformation of the original data array into a secondary array of the same structure but with node variables of, generally speaking, different nature, i.e. different ranges. Such a generalized problem is set as the formal problem of optimization (minimization or maximization) of a real-valued objective function of all the node variables. The objective function is assumed to consist of additive constituents of one or two arguments, respectively, node and edge functions. The former of them carry the data-dependent information on the sought-for values of the secondary variables, whereas the latter ones are mean..

    Combinatorial clustering for textual data representation in machine learning models.” http://www.datalaundering.com/download/theoretic.pdf

    No full text
    In text stream analysis one of the main problems is finding an effective method to classify documents fast and correctly. This is the reason why dimensionality reduction and related methods of representation of significant information are critical to develop a good text classifier. In this report we describe a novel purely combinatorial approach to obtain a meaningful representation of text data. There are two basic ideas that we realized in the current development of this approach. Namely, (1) Layered Clusters which induce over the entire data a stratification in a tower structure like a nesting doll (Russian Matreshka) [1][2], and, (2) parallel clustering of documents and their features (frequencies of words in our case). The clusters are sub-matrices of data which include each other according to the ordering given by the clustering model: the deepest cluster-matrix represents the largest weighted quasi-clique if the input data-matrix would be interpreted as a hypergraph; its effective weight is also the largest possible; the second cluster includes the first one and represents the second level of a quasi-clique with less valu
    corecore