Search CORE

12,446 research outputs found

On Finding the Jaccard Center

Author: Bury Marc
Schwiegelshohn Chris
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 44th International Colloquium on Automata, Languages, and Programming (ICALP 2017)
Publication date: 01/01/2017
Field of study

We initiate the study of finding the Jaccard center of a given collection N of sets. For two sets X,Y, the Jaccard index is defined as |Xcap Y|/|Xcup Y| and the corresponding distance is 1-|Xcap Y|/|Xcup Y|. The Jaccard center is a set C minimizing the maximum distance to any set of N. We show that the problem is NP-hard to solve exactly, and that it admits a PTAS while no FPTAS can exist unless P = NP. Furthermore, we show that the problem is fixed parameter tractable in the maximum Hamming norm between Jaccard center and any input set. Our algorithms are based on a compression technique similar in spirit to coresets for the Euclidean 1-center problem. In addition, we also show that, contrary to the previously studied median problem by Chierichetti et al. (SODA 2010), the continuous version of the Jaccard center problem admits a simple polynomial time algorithm

Dagstuhl Research Online Publication Server

Archivio della ricerca- Università di Roma La Sapienza

Polynomial Time Approximation Schemes for All 1-Center Problems on Metric Rational Set Similarities

Author: Bury M.
Gentili M.
Schwiegelshohn C.
Sorella M.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2021
Field of study

In this paper, we investigate algorithms for finding centers of a given collection N of sets. In particular, we focus on metric rational set similarities, a broad class of similarity measures including Jaccard and Hamming. A rational set similarity S is called metric if D= 1 - S is a distance function. We study the 1-center problem on these metric spaces. The problem consists of finding a set C that minimizes the maximum distance of C to any set of N. We present a general framework that computes a (1 + ε) approximation for any metric rational set similarity

Archivio della ricerca- Università di Roma La Sapienza

Predicting B Cell Receptor Substitution Profiles Using Public Repertoire Data

Author: Davidsen Kristian
Dhar Amrit
Matsen IV Frederick A.
Minin Vladimir N.
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 18/02/2018
Field of study

B cells develop high affinity receptors during the course of affinity maturation, a cyclic process of mutation and selection. At the end of affinity maturation, a number of cells sharing the same ancestor (i.e. in the same "clonal family") are released from the germinal center, their amino acid frequency profile reflects the allowed and disallowed substitutions at each position. These clonal-family-specific frequency profiles, called "substitution profiles", are useful for studying the course of affinity maturation as well as for antibody engineering purposes. However, most often only a single sequence is recovered from each clonal family in a sequencing experiment, making it impossible to construct a clonal-family-specific substitution profile. Given the public release of many high-quality large B cell receptor datasets, one may ask whether it is possible to use such data in a prediction model for clonal-family-specific substitution profiles. In this paper, we present the method "Substitution Profiles Using Related Families" (SPURF), a penalized tensor regression framework that integrates information from a rich assemblage of datasets to predict the clonal-family-specific substitution profile for any single input sequence. Using this framework, we show that substitution profiles from similar clonal families can be leveraged together with simulated substitution profiles and germline gene sequence information to improve prediction. We fit this model on a large public dataset and validate the robustness of our approach on an external dataset. Furthermore, we provide a command-line tool in an open-source software package (https://github.com/krdav/SPURF) implementing these ideas and providing easy prediction using our pre-fit models.Comment: 23 page

arXiv.org e-Print Archive

Directory of Open Access Journals

eScholarship - University of California

FigShare

SEED: efficient clustering of next-generation sequences.

Author: Bao Ergude
Girke Thomas
Jiang Tao
Kaloshian Isgouhi
Publication venue: eScholarship, University of California
Publication date: 02/08/2011
Field of study

MotivationSimilarity clustering of next-generation sequences (NGS) is an important computational problem to study the population sizes of DNA/RNA molecules and to reduce the redundancies in NGS data. Currently, most sequence clustering algorithms are limited by their speed and scalability, and thus cannot handle data with tens of millions of reads.ResultsHere, we introduce SEED-an efficient algorithm for clustering very large NGS sets. It joins sequences into clusters that can differ by up to three mismatches and three overhanging residues from their virtual center. It is based on a modified spaced seed method, called block spaced seeds. Its clustering component operates on the hash tables by first identifying virtual center sequences and then finding all their neighboring sequences that meet the similarity parameters. SEED can cluster 100 million short read sequences in <4 h with a linear time and memory performance. When using SEED as a preprocessing tool on genome/transcriptome assembly data, it was able to reduce the time and memory requirements of the Velvet/Oasis assembler for the datasets used in this study by 60-85% and 21-41%, respectively. In addition, the assemblies contained longer contigs than non-preprocessed data as indicated by 12-27% larger N50 values. Compared with other clustering tools, SEED showed the best performance in generating clusters of NGS data similar to true cluster results with a 2- to 10-fold better time performance. While most of SEED's utilities fall into the preprocessing area of NGS data, our tests also demonstrate its efficiency as stand-alone tool for discovering clusters of small RNA sequences in NGS data from unsequenced organisms.AvailabilityThe SEED software can be downloaded for free from this site: http://manuals.bioinformatics.ucr.edu/home/[email protected] informationSupplementary data are available at Bioinformatics online

PubMed Central

eScholarship - University of California

Finding missing edges in networks based on their community structure

Author: Gregory Steve
Yan Bowen
Publication venue: 'American Physical Society (APS)'
Publication date: 01/05/2012
Field of study

Many edge prediction methods have been proposed, based on various local or global properties of the structure of an incomplete network. Community structure is another significant feature of networks: Vertices in a community are more densely connected than average. It is often true that vertices in the same community have "similar" properties, which suggests that missing edges are more likely to be found within communities than elsewhere. We use this insight to propose a strategy for edge prediction that combines existing edge prediction methods with community detection. We show that this method gives better prediction accuracy than existing edge prediction methods alone.Comment: 7 pages, 6 figure

arXiv.org e-Print Archive

Explore Bristol Research

Improvised Salient Object Detection and Manipulation

Author: Maity Abhishek
Publication venue
Publication date: 10/11/2015
Field of study

In case of salient subject recognition, computer algorithms have been heavily relied on scanning of images from top-left to bottom-right systematically and apply brute-force when attempting to locate objects of interest. Thus, the process turns out to be quite time consuming. Here a novel approach and a simple solution to the above problem is discussed. In this paper, we implement an approach to object manipulation and detection through segmentation map, which would help to desaturate or, in other words, wash out the background of the image. Evaluation for the performance is carried out using the Jaccard index against the well-known Ground-truth target box technique.Comment: 7 page

arXiv.org e-Print Archive

ZENODO

Benchmarking of Cluster Indices

Author: Scherl Marcus
Publication venue
Publication date: 01/01/2010
Field of study

Open Access LMU