66 research outputs found
Link Graph Analysis for Adult Images Classification
In order to protect an image search engine's users from undesirable results
adult images' classifier should be built. The information about links from
websites to images is employed to create such a classifier. These links are
represented as a bipartite website-image graph. Each vertex is equipped with
scores of adultness and decentness. The scores for image vertexes are
initialized with zero, those for website vertexes are initialized according to
a text-based website classifier. An iterative algorithm that propagates scores
within a website-image graph is described. The scores obtained are used to
classify images by choosing an appropriate threshold. The experiments on
Internet-scale data have shown that the algorithm under consideration increases
classification recall by 17% in comparison with a simple algorithm which
classifies an image as adult if it is connected with at least one adult site
(at the same precision level).Comment: 7 pages. Young Scientists Conference, 4th Russian Summer School in
Information Retrieva
Game interpretation of Kolmogorov complexity
The Kolmogorov complexity function K can be relativized using any oracle A,
and most properties of K remain true for relativized versions. In section 1 we
provide an explanation for this observation by giving a game-theoretic
interpretation and showing that all "natural" properties are either true for
all sufficiently powerful oracles or false for all sufficiently powerful
oracles. This result is a simple consequence of Martin's determinacy theorem,
but its proof is instructive: it shows how one can prove statements about
Kolmogorov complexity by constructing a special game and a winning strategy in
this game. This technique is illustrated by several examples (total conditional
complexity, bijection complexity, randomness extraction, contrasting plain and
prefix complexities).Comment: 11 pages. Presented in 2009 at the conference on randomness in
Madison
FLOating-Window Projective Separator (FloWPS): A Data Trimming Tool for Support Vector Machines (SVM) to Improve Robustness of the Classifier
Here, we propose a heuristic technique of data trimming for SVM termed FLOating Window Projective Separator (FloWPS), tailored for personalized predictions based on molecular data. This procedure can operate with high throughput genetic datasets like gene expression or mutation profiles. Its application prevents SVM from extrapolation by excluding non-informative features. FloWPS requires training on the data for the individuals with known clinical outcomes to create a clinically relevant classifier. The genetic profiles linked with the outcomes are broken as usual into the training and validation datasets. The unique property of FloWPS is that irrelevant features in validation dataset that don’t have significant number of neighboring hits in the training dataset are removed from further analyses. Next, similarly to the k nearest neighbors (kNN) method, for each point of a validation dataset, FloWPS takes into account only the proximal points of the training dataset. Thus, for every point of a validation dataset, the training dataset is adjusted to form a floating window. FloWPS performance was tested on ten gene expression datasets for 992 cancer patients either responding or not on the different types of chemotherapy. We experimentally confirmed by leave-one-out cross-validation that FloWPS enables to significantly increase quality of a classifier built based on the classical SVM in most of the applications, particularly for polynomial kernels
Induced Layered Clusters, Hereditary Mappings and Convex Geometries
A method for structural clustering proposed by the authors is extended to the case when there are externally defined restrictions on the relations between sets and their elements. This framework appears to be related to order-theoretic concepts of the hereditary mappings and convex geometries, which enables us to givecharacterizations of those in terms of the monotone linkage functions. Key words: layered cluster, monotone linkage, greedy optimization, convex geometry, hereditary mapping.
Relation between Protein Structure, Sequence Homology and Composition of Amino Acids
. A method of quantitative comparison of two classifications rules applied to protein folding problem is presented. Classification of proteins based on sequence homology and based on amino acid composition were compared and analyzed according to this approach. The coefficient of correlation between these classification methods and the procedure of estimation of robustness of the coefficient are discussed. RRR 6-95 Page 1 1 Introduction One of the most powerful methods of protein structure prediction is the model building by homology (Hilbert et al, 1993). Chothia and Lesk (1986) suggested that if two sequences can be aligned with 50% or greater residue identity they have a similar fold. This threshold of 50% is usually used as a "safe definition of sequence homology" (Pascarella & Argos, 1992) and in conventional opinion grants a reasonable confidence that a protein sequence has chain conformation of the template excluding less conserved regions. But it was shown that structure inform..
Optimization Algorithms for Separable Functions With Tree-Like Adjacency of Variables and Their Application to the Analysis of Massive Data Sets
A massive data set is considered as a set of experimentally acquired values of a number of variables each of which is associated with the respective node of an undirected adjacency graph that presets the fixed structure of the data set. The class of data analysis problems under consideration is outlined by the assumption that the ultimate aim of processing can be represented as a transformation of the original data array into a secondary array of the same structure but with node variables of, generally speaking, different nature, i.e. different ranges. Such a generalized problem is set as the formal problem of optimization (minimization or maximization) of a real-valued objective function of all the node variables. The objective function is assumed to consist of additive constituents of one or two arguments, respectively, node and edge functions. The former of them carry the data-dependent information on the sought-for values of the secondary variables, whereas the latter ones are mean..
Combinatorial clustering for textual data representation in machine learning models.” http://www.datalaundering.com/download/theoretic.pdf
In text stream analysis one of the main problems is finding an effective method to classify documents fast and correctly. This is the reason why dimensionality reduction and related methods of representation of significant information are critical to develop a good text classifier. In this report we describe a novel purely combinatorial approach to obtain a meaningful representation of text data. There are two basic ideas that we realized in the current development of this approach. Namely, (1) Layered Clusters which induce over the entire data a stratification in a tower structure like a nesting doll (Russian Matreshka) [1][2], and, (2) parallel clustering of documents and their features (frequencies of words in our case). The clusters are sub-matrices of data which include each other according to the ordering given by the clustering model: the deepest cluster-matrix represents the largest weighted quasi-clique if the input data-matrix would be interpreted as a hypergraph; its effective weight is also the largest possible; the second cluster includes the first one and represents the second level of a quasi-clique with less valu
- …