306,451 research outputs found

    Fast Statistical Alignment

    Get PDF
    We describe a new program for the alignment of multiple biological sequences that is both statistically motivated and fast enough for problem sizes that arise in practice. Our Fast Statistical Alignment program is based on pair hidden Markov models which approximate an insertion/deletion process on a tree and uses a sequence annealing algorithm to combine the posterior probabilities estimated from these models into a multiple alignment. FSA uses its explicit statistical model to produce multiple alignments which are accompanied by estimates of the alignment accuracy and uncertainty for every column and character of the alignment—previously available only with alignment programs which use computationally-expensive Markov Chain Monte Carlo approaches—yet can align thousands of long sequences. Moreover, FSA utilizes an unsupervised query-specific learning procedure for parameter estimation which leads to improved accuracy on benchmark reference alignments in comparison to existing programs. The centroid alignment approach taken by FSA, in combination with its learning procedure, drastically reduces the amount of false-positive alignment on biological data in comparison to that given by other methods. The FSA program and a companion visualization tool for exploring uncertainty in alignments can be used via a web interface at http://orangutan.math.berkeley.edu/fsa/, and the source code is available at http://fsa.sourceforge.net/

    Separating intrinsic alignment and galaxy-galaxy lensing

    Full text link
    The coherent physical alignment of galaxies is an important systematic for gravitational lensing studies as well as a probe of the physical mechanisms involved in galaxy formation and evolution. We develop a formalism for treating this intrinsic alignment (IA) in the context of galaxy-galaxy lensing and present an improved method for measuring IA contamination, which can arise when sources physically associated with the lens are placed behind the lens due to photometric redshift scatter. We apply the technique to recent Sloan Digital Sky Survey (SDSS) measurements of Luminous Red Galaxy lenses and typical (L*) source galaxies with photometric redshifts selected from the SDSS imaging data. Compared to previous measurements, this method has the advantage of being fully self-consistent in its treatment of the IA and lensing signals, solving for the two simultaneously. We find an IA signal consistent with zero, placing tight constraints on both the magnitude of the IA effect and its potential contamination to the lensing signal. While these constraints depend on source selection and redshift quality, the method can be applied to any measurement that uses photometric redshifts. We obtain a model-independent upper-limit of roughly 10% IA contamination for projected separations of approximately 0.1-100 Mpc/h. With more stringent photo-z cuts and reasonable assumptions about the physics of intrinsic alignments, this upper limit is reduced to 1-2%. These limits are well below the statistical error of the current lensing measurements. Our results suggest that IA will not present intractable challenges to the next generation of galaxy-galaxy lensing experiments, and the methods presented here should continue to aid in our understanding of alignment processes and in the removal of IA from the lensing signal.Comment: 31 pages, 8 Figures. Minor changes to reflect published versio

    Gapped alignment of protein sequence motifs through Monte Carlo optimization of a hidden Markov model

    Get PDF
    BACKGROUND: Certain protein families are highly conserved across distantly related organisms and belong to large and functionally diverse superfamilies. The patterns of conservation present in these protein sequences presumably are due to selective constraints maintaining important but unknown structural mechanisms with some constraints specific to each family and others shared by a larger subset or by the entire superfamily. To exploit these patterns as a source of functional information, we recently devised a statistically based approach called contrast hierarchical alignment and interaction network (CHAIN) analysis, which infers the strengths of various categories of selective constraints from co-conserved patterns in a multiple alignment. The power of this approach strongly depends on the quality of the multiple alignments, which thus motivated development of theoretical concepts and strategies to improve alignment of conserved motifs within large sets of distantly related sequences. RESULTS: Here we describe a hidden Markov model (HMM), an algebraic system, and Markov chain Monte Carlo (MCMC) sampling strategies for alignment of multiple sequence motifs. The MCMC sampling strategies are useful both for alignment optimization and for adjusting position specific background amino acid frequencies for alignment uncertainties. Associated statistical formulations provide an objective measure of alignment quality as well as automatic gap penalty optimization. Improved alignments obtained in this way are compared with PSI-BLAST based alignments within the context of CHAIN analysis of three protein families: G(iα )subunits, prolyl oligopeptidases, and transitional endoplasmic reticulum (p97) AAA+ ATPases. CONCLUSION: While not entirely replacing PSI-BLAST based alignments, which likewise may be optimized for CHAIN analysis using this approach, these motif-based methods often more accurately align very distantly related sequences and thus can provide a better measure of selective constraints. In some instances, these new approaches also provide a better understanding of family-specific constraints, as we illustrate for p97 ATPases. Programs implementing these procedures and supplementary information are available from the authors

    Combining Linguistic and Machine Learning Techniques for Word Alignment Improvement

    Get PDF
    Alignment of words, i.e., detection of corresponding units between two sentences that are translations of each other, has been shown to be crucial for the success of many NLP applications such as statistical machine translation (MT), construction of bilingual lexicons, word-sense disambiguation, and projection of resources between languages. With the availability of large parallel texts, statistical word alignment systems have proven to be quite successful on many language pairs. However, these systems are still faced with several challenges due to the complexity of the word alignment problem, lack of enough training data, difficulty learning statistics correctly, translation divergences, and lack of a means for incremental incorporation of linguistic knowledge. This thesis presents two new frameworks to improve existing word alignments using supervised learning techniques. In the first framework, two rule-based approaches are introduced. The first approach, Divergence Unraveling for Statistical MT (DUSTer), specifically targets translation divergences and corrects the alignment links related to them using a set of manually-crafted, linguistically-motivated rules. In the second approach, Alignment Link Projection (ALP), the rules are generated automatically by adapting transformation-based error-driven learning to the word alignment problem. By conditioning the rules on initial alignment and linguistic properties of the words, ALP manages to categorize the errors of the initial system and correct them. The second framework, Multi-Align, is an alignment combination framework based on classifier ensembles. The thesis presents a neural-network based implementation of Multi-Align, called NeurAlign. By treating individual alignments as classifiers, NeurAlign builds an additional model to learn how to combine the input alignments effectively. The evaluations show that the proposed techniques yield significant improvements (up to 40% relative error reduction) over existing word alignment systems on four different language pairs, even with limited manually annotated data. Moreover, all three systems allow an easy integration of linguistic knowledge into statistical models without the need for large modifications to existing systems. Finally, the improvements are analyzed using various measures, including the impact of improved word alignments in an external application---phrase-based MT

    Development of a Statistical Theory-Based Capital Cost Estimating Methodology for Light Rail Transit Corridor Evaluation Under Varying Alignment Characteristics

    Get PDF
    The context of this research is the investigation and application of an approach to develop an effective evaluation methodology for establishing the investment worthiness of a range of potential Light Rail Transit (LRT) major system improvements (alternatives). Central to addressing mobility needs in a corridor is the ability to estimate capital costs at the planning level through a reliable and replicable methodology. This research extends the present state of practice that relies primarily on either cost averages (by review of cost data of implemented LRT projects) or cost categories in high and low cost ranges. The current methodologies often cannot produce accurate estimates due to lack of engineering data at the planning level of project development. This research strives to improve current practice by developing a prediction model for the system costs based on specific project alignment characteristics. The review of the literature reflects a wide range of estimates of capital cost within each of the contemporary mass transit modes. The primary problem addressed in this research is the challenge associated with producing capital cost estimates at the planning level for the LRT mode of public transportation in the study corridor. Furthermore, the capital cost estimates for each mode of public transportation under consideration must be sensitive to a range of independent variables, such as vertical and horizontal alignment characteristics, environmentally sensitive areas, urban design and other unique cost-controlling factors. The current available methodologies for estimating capital cost at the planning level, by transit mode for alternative alignments, have limitations. The focus of this research is the development of a statistical theory-based, capital cost-estimating methodology for use at the planning level for transit system evaluations. Model development activities include sample size selection, model framework and selection, and model development and testing. The developed model utilizes statistical theory to enhance the quality of capital cost estimation for LRT investments by varying alignment characteristics. This research has identified that alignment guideway length and station elements (by grade type) are the best predictors of LRT cost per mile at the planning level of project development. For the purpose of validating the regression model developed for this research, one LRT system was removed from the data set and run through the final multiple linear regression model equation to assess the model’s predictive accuracy. Comparing the model’s estimated cost to the projects final construction cost resulted in a 26.9% error. The percentage error seems somewhat high but acceptable at the planning level, since a 30% contingency (or higher) is typically applied to early level cost estimates. Additionally, a comparison was made for all LRT systems used in the model estimation and the percent error range is from 2.4% to 111.5% with just over 60% of the project’s predicted cost estimate within 30% or better. The model appears to be a useful tool for estimating LRT cost per mile at the planning level when only limited alignment data is available. However, further development of improved predictive models will become possible when additional LRT system data becomes available

    Tracking relevant alignment characteristics for machine translation

    Get PDF
    In most statistical machine translation (SMT) systems, bilingual segments are extracted via word alignment. In this paper we compare alignments tuned directly according to alignment F-score and BLEU score in order to investigate the alignment characteristics that are helpful in translation. We report results for two different SMT systems (a phrase-based and an n-gram-based system) on Chinese to English IWSLT data, and Spanish to English European Parliament data. We give alignment hints to improve BLEU score, depending on the SMT system used and the type of corpus

    Parallel Treebanks in Phrase-Based Statistical Machine Translation

    Get PDF
    Given much recent discussion and the shift in focus of the field, it is becoming apparent that the incorporation of syntax is the way forward for the current state-of-the-art in machine translation (MT). Parallel treebanks are a relatively recent innovation and appear to be ideal candidates for MT training material. However, until recently there has been no other means to build them than by hand. In this paper, we describe how we make use of new tools to automatically build a large parallel treebank and extract a set of linguistically motivated phrase pairs from it. We show that adding these phrase pairs to the translation model of a baseline phrase-based statistical MT (PBSMT) system leads to significant improvements in translation quality. We describe further experiments on incorporating parallel treebank information into PBSMT, such as word alignments. We investigate the conditions under which the incorporation of parallel treebank data performs optimally. Finally, we discuss the potential of parallel treebanks in other paradigms of MT

    Fiber-Flux Diffusion Density for White Matter Tracts Analysis: Application to Mild Anomalies Localization in Contact Sports Players

    Full text link
    We present the concept of fiber-flux density for locally quantifying white matter (WM) fiber bundles. By combining scalar diffusivity measures (e.g., fractional anisotropy) with fiber-flux measurements, we define new local descriptors called Fiber-Flux Diffusion Density (FFDD) vectors. Applying each descriptor throughout fiber bundles allows along-tract coupling of a specific diffusion measure with geometrical properties, such as fiber orientation and coherence. A key step in the proposed framework is the construction of an FFDD dissimilarity measure for sub-voxel alignment of fiber bundles, based on the fast marching method (FMM). The obtained aligned WM tract-profiles enable meaningful inter-subject comparisons and group-wise statistical analysis. We demonstrate our method using two different datasets of contact sports players. Along-tract pairwise comparison as well as group-wise analysis, with respect to non-player healthy controls, reveal significant and spatially-consistent FFDD anomalies. Comparing our method with along-tract FA analysis shows improved sensitivity to subtle structural anomalies in football players over standard FA measurements
    corecore