21 research outputs found
Multi-Dimensional Joins
We present three novel algorithms for performing multi-dimensional
joins and an in-depth survey and analysis of a low-dimensional
spatial join. The first algorithm, the Iterative Spatial Join,
performs a spatial join on low-dimensional data and is based
on a plane-sweep technique.
As we show analytically and experimentally,
the Iterative Spatial Join performs well when internal memory is
limited, compared to competing methods. This suggests that
the Iterative Spatial Join would be useful for very large data sets
or in situations where internal memory is a shared resource and
is therefore limited, such as with today's database engines which
share internal memory amongst several queries. Furthermore, the
performance of the Iterative Spatial Join is predictable and has
no parameters which need to be tuned, unlike other algorithms.
The second algorithm, the Quickjoin algorithm,
performs a higher-dimensional
similarity join in which pairs of objects that lie within a
certain distance epsilon of each other are reported.
The Quickjoin algorithm overcomes drawbacks of competing methods,
such as requiring embedding methods on the data first or using
multi-dimensional indices, which limit
the ability to discriminate between objects in each
dimension, thereby degrading performance.
A formal analysis is provided of the Quickjoin method, and
experiments show that the Quickjoin method significantly outperforms
competing methods.
The third algorithm adapts
incremental join techniques to improve the
speed of calculating the Hausdorff distance, which
is used in applications such as image matching, image analysis,
and surface approximations.
The nearest neighbor incremental join technique for indices that
are based on hierarchical containment use a priority queue
of index node pairs and bounds on the distance values between
pairs, both of which need to modified in order to calculate the
Hausdorff distance. Results of experiments are described that
confirm the performance improvement.
Finally, a survey is provided which
instead of just summarizing the literature and presenting each
technique in its entirety, describes distinct components of
the different techniques, and each technique is decomposed into
an overall framework for performing a spatial join
Joint amalgamation of most parsimonious reconciled gene trees.
MOTIVATION
Traditionally, gene phylogenies have been reconstructed solely on the basis of molecular sequences; this, however, often does not provide enough information to distinguish between statistically equivalent relationships. To address this problem, several recent methods have incorporated information on the species phylogeny in gene tree reconstruction, leading to dramatic improvements in accuracy. Although probabilistic methods are able to estimate all model parameters but are computationally expensive, parsimony methods-generally computationally more efficient-require a prior estimate of parameters and of the statistical support.
RESULTS
Here, we present the Tree Estimation using Reconciliation (TERA) algorithm, a parsimony based, species tree aware method for gene tree reconstruction based on a scoring scheme combining duplication, transfer and loss costs with an estimate of the sequence likelihood. TERA explores all reconciled gene trees that can be amalgamated from a sample of gene trees. Using a large scale simulated dataset, we demonstrate that TERA achieves the same accuracy as the corresponding probabilistic method while being faster, and outperforms other parsimony-based methods in both accuracy and speed. Running TERA on a set of 1099 homologous gene families from complete cyanobacterial genomes, we find that incorporating knowledge of the species tree results in a two thirds reduction in the number of apparent transfer events
Conceptual robustness in simultaneous engineering: An extension of Taguchi's parameter design
Simultaneous engineering processes involve multifunctional teams; team members simultaneously make decisions about many parts of the product-production system and aspects of the product life cycle. This paper argues that such simultaneous distributed decisions should be based on communications about sets of possibilities rather than single solutions. By extending Taguchi's parameter design concepts, we develop a robust and distributed decision-making procedure based on such communications. The procedure shows how a member of a design team can make appropriate decisions based on incomplete information from the other members of the team. More specifically, it (1) treats variations among the designs considered by other members of the design team as conceptual noise; (2) shows how to incorporate such noises into decisions that are robust against these variations; (3) describes a method for using the same data to provide preference information back to the other team members; and (4) provides a procedure for determining whether to release the conceptually robust design or to wait for further decisions by others. The method is demonstrated by part of a distributed design process for a rotary CNC milling machine. While Taguchi's approach is used as a starting point because it is widely known, these results can be generalized to use other robust decision techniques.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/45879/1/163_2005_Article_BF01608400.pd
Spatial Join Techniques
A variety of techniques for performing a spatial join are reviewed. Instead of just summarizing the literature and presenting each technique in its entirety, distinct components of the different techniques are described and each is decomposed into an overall framework for performing a spatial join. A typical spatial join technique consists of the following components: partitioning the data, performing internal-memory spatial joins on subsets of the data, and checking if the full polygons intersect. Each technique is decomposed into these components and each component addressed in a separate section so as to compare and contrast similar aspects of each technique. The goal of this survey is to describe the algorithms within each component in detail, comparing and contrasting competing methods, thereby enabling further analysis and experimentation with each component and allowing the best algorithms for a particular situation to be built piecemeal, or, even better, enabling an optimizer to choose which algorithms to use. Categories and Subject Descriptors: H.2.4 [Database Management]: Systems—Query processing; H.2.8 [Database Management]: Database Applications—Spatial databases and GI
Tissue-specific and ubiquitous expression patterns from alternative promoters of human genes.
Transcriptome diversity provides the key to cellular identity. One important contribution to expression diversity is the use of alternative promoters, which creates mRNA isoforms by expanding the choice of transcription initiation sites of a gene. The proximity of the basal promoter to the transcription initiation site enables prediction of a promoter's location based on the gene annotations. We show that annotation of alternative promoters regulating expression of transcripts with distinct first exons enables a novel methodology to quantify expression levels and tissue specificity of mRNA isoforms.The use of distinct alternative first exons in 3,296 genes was examined using exon-microarray data from 11 human tissues. Comparing two transcripts from each gene we found that the activity of alternative promoters (i.e., P1 and P2) was not correlated through tissue specificity or level of expression. Furthermore neither P1 nor P2 conferred any bias for tissue-specific or ubiquitous expression. Genes associated with specific diseases produced transcripts whose limited expression patterns were consistent with the tissue affected in disease. Notably, genes that were historically designated as tissue-specific or housekeeping had alternative isoforms that showed differential expression. Furthermore, only a small number of alternative promoters showed expression exclusive to a single tissue indicating that "tissue preference" provides a better description of promoter activity than tissue specificity. When compared to gene expression data in public databases, as few as 22% of the genes had detailed information for more than one isoform, whereas the remainder collapsed the expression patterns from individual transcripts into one profile.We describe a computational pipeline that uses microarray data to assess the level of expression and breadth of tissue profiles for transcripts with distinct first exons regulated by alternative promoters. We conclude that alternative promoters provide individualized regulation that is confirmed through expression levels, tissue preference and chromatin modifications. Although the selective use of alternative promoters often goes uncharacterized in gene expression analyses, transcripts produced in this manner make unique contributions to the cell that requires further exploration
Resolution and reconciliation of non-binary gene trees with transfers, duplications and losses
International audienceGene trees reconstructed from sequence alignments contain poorly supported branches when the phylogenetic signal in the sequences is insufficient to determine them all. When a species tree is available, the signal of gains and losses of genes can be used to correctly resolve the unsupported parts of the gene history. However finding a most parsimonious binary resolution of a non-binary tree obtained by contracting the unsupported branches is NP-hard if transfer events are considered as possible gene scale events, in addition to gene origination, duplication and loss. We propose an exact, parameterized algorithm to solve this problem in single-exponential time, where the parameter is the number of connected branches of the gene tree that show low support from the sequence alignment or, equivalently, the maximum number of children of any node of the gene tree once the low-support branches have been collapsed. This improves on the best known algorithm by an exponential factor. We propose a way to choose among optimal solutions based on the available information. We show the usability of this principle on several simulated and biological datasets. The results are comparable in quality to several other tested methods having similar goals, but our approach provides a lower running time and a guarantee that the produced solution is optimal
A fast method for calculating reliable event supports in tree reconciliations via Pareto optimality
Background: Given a gene and a species tree, reconciliation methods attempt to retrieve the macro-evolutionary events that best explain the discrepancies between the two tree topologies. The DTL parsimonious approach searches for a most parsimonious reconciliation between a gene tree and a (dated) species tree, considering four possible macro-evolutionary events (speciation, duplication, transfer, and loss) with specific costs. Unfortunately, many events are erroneously predicted due to errors in the input trees, inappropriate input cost values or because of the existence of several equally parsimonious scenarios. It is thus crucial to provide a measure of the reliability for predicted events. It has been recently proposed that the reliability of an event can be estimated via its frequency in the set of most parsimonious reconciliations obtained using a variety of reasonable input cost vectors. To compute such a support, a straightforward but time-consuming approach is to generate the costs slightly departing from the original ones, independently compute the set of all most parsimonious reconciliations for each vector, and combine these sets a posteriori. Another proposed approach uses Pareto-optimality to partition cost values into regions which induce reconciliations with the same number of DTL events. The support of an event is then defined as its frequency in the set of regions. However, often, the number of regions is not large enough to provide reliable supports. Results: We present here a method to compute efficiently event supports via a polynomial-sized graph, which can represent all reconciliations for several different costs. Moreover, two methods are proposed to take into account alternative input costs: either explicitly providing an input cost range or allowing a tolerance for the over cost of a reconciliation. Our methods are faster than the region based method, substantially faster than the sampling-costs approach, and have a higher event-prediction accuracy on simulated data. Conclusions: We propose a new approach to improve the accuracy of event supports for parsimonious reconciliation methods to account for uncertainty in the input costs. Furthermore, because of their speed, our methods can be used on large gene families. Our algorithms are implemented in the ecceTERA program, freely available from http://mbb.univ-montp2.fr/MBB/