426 research outputs found

    Survey on Instruction Selection: An Extensive and Modern Literature Review

    Full text link
    Instruction selection is one of three optimisation problems involved in the code generator backend of a compiler. The instruction selector is responsible of transforming an input program from its target-independent representation into a target-specific form by making best use of the available machine instructions. Hence instruction selection is a crucial part of efficient code generation. Despite on-going research since the late 1960s, the last, comprehensive survey on the field was written more than 30 years ago. As new approaches and techniques have appeared since its publication, this brings forth a need for a new, up-to-date review of the current body of literature. This report addresses that need by performing an extensive review and categorisation of existing research. The report therefore supersedes and extends the previous surveys, and also attempts to identify where future research should be directed.Comment: Major changes: - Merged simulation chapter with macro expansion chapter - Addressed misunderstandings of several approaches - Completely rewrote many parts of the chapters; strengthened the discussion of many approaches - Revised the drawing of all trees and graphs to put the root at the top instead of at the bottom - Added appendix for listing the approaches in a table See doc for more inf

    Efficient Algorithms for Unrelated Parallel Machine Scheduling Considering Time of Use Pricing and Demand Charges

    Get PDF
    There is an ever-increasing focus on sustainability and energy consumption worldwide. Manufacturing is one of the major areas where energy reduction is not only environmentally beneficial, but also incredibly financially beneficial. These industrial consumers pay for their electricity according to prices that fluctuate throughout the day. These price fluctuations are in place to shift consumption away from “peak” times, when electricity is in the highest demand. In addition to this consumption cost, industrial consumers are charged according to their highest level of demand in a given window of time in the form of demand charges. This paper presents multiple solution methods to solve a parallel machine shop scheduling problem to minimize the total energy cost of the production schedule under Time of Use (TOU) and demand charge pricing. The greedy heuristic and genetic algorithm developed are designed to provide efficient solutions to this problem. The results of these methods are compared to a previously developed integer program (IP) solved using CPLEX with respect to the quality of the solution and the computational time required to solve it. Findings of these tests show that the greedy heuristic handles the test problems with only a small optimality gap to the genetic algorithm and optimal IP solution. The largest test problems could not be solved by the genetic algorithm in the provided time period due to difficulty generating an initial solution population. However, when successful the genetic algorithm performed comparably to the CPLEX solver in terms solution quality yet provided faster solve times

    Interruptional Activity and Simulation of Transposable Elements

    Get PDF
    Transposable elements (TEs) are interspersed DNA sequences that can move or copy to new positions within a genome. The active TEs along with the remnants of many transposition events over millions of years constitute 46.69% of the human genome. TEs are believed to promote speciation and their activities play a significant role in human disease. The 22 AluY and 6 AluS TE subfamilies have been the most active TEs in recent human history, whose transposition has been implicated in several inherited human diseases and in various forms of cancer by integrating into genes. Therefore, understanding the transposition activities is very important. Recently, there has been some work done to quantify the activity levels of active Alu transposable elements based on variation in the sequence. Here, given this activity data, an analysis of TE activity based on the position of mutations is conducted. Two different methods/simulations are created to computationally predict so-called harmful mutation regions in the consensus sequence of a TE; that is, mutations that occur in these regions decrease the transposition activities dramatically. The methods are applied to AluY, the youngest and most active Alu subfamily, to identify the harmful regions laying in its consensus, and verifications are presented using the activity of AluY elements and the secondary structure of the AluYa5 RNA, providing evidence that the method is successfully identifying harmful mutation regions. A supplementary simulation also shows that the identified harmful regions covering the AluYa5 RNA functional regions are not occurring by chance. Therefore, mutations within the harmful regions alter the mobile activity levels of active AluY elements. One of the methods is then applied to two additional TE families: the Alu family and L1 family, in detecting the harmful regions in these elements computationally. Understanding and predicting the evolution of these TEs is of interest in understanding their powerful evolutionary force in shaping their host genomes. In this thesis, a formal model of TE fragments and their interruptions is devised that provides definitions that are compatible with biological nomenclature, while still providing a suitable formal foundation for computational analysis. Essentially, this model is used for fixing terminology that was misleading in the literature, and it helps to describe further TE problems in a precise way. Indeed, later chapters include two other models built on top of this model: the sequential interruption model and the recursive interruption model, both used to analyze their activity throughout evolution. The sequential interruption model is defined between TEs that occur in a genomic sequence to estimate how often TEs interrupt other TEs, which has been shown to be useful in predicting their ages and their activity throughout evolution. Here, this prediction from the sequential interruptions is shown to be closely related to a classic matrix optimization problem: the Linear Ordering Problem (LOP). By applying a well-studied method of solving the LOP, Tabu search, to the sequential interruption model, a relative age order of all TEs in the human genome is predicted from a single genome. A comparison of the TE ordering between Tabu search and the method used in [47] shows that Tabu search solves the TE problem exceedingly more efficiently, while it still achieves a more accurate result. As a result of the improved efficiency, a prediction on all human TEs is constructed, whereas it was previously only predicted for a minority fraction of the set of the human TEs. When many insertions occurred throughout the evolution of a genomic sequence, the interruptions nest in a recursive pattern. The nested TEs are very helpful in revealing the age of the TEs, but cannot be fully represented by the sequential interruption model. In the recursive interruption model, a specific context- free grammar is defined, describing a general and simple way to capture the recursive nature in which TEs nest themselves into other TEs. Then, each production of the context-free grammar is associated with a probability to convert the context-free grammar into a stochastic context-free grammar that maximizes the applications of the productions corresponding to TE interruptions. A modified version of an algorithm to parse context-free grammars, the CYK algorithm, that takes into account these probabilities is then used to find the most likely parse tree(s) predicting the TE nesting in an efficient fashion. The recursive interruption model produces small parse trees representing local TE interruptions in a genome. These parse trees are a natural way of grouping TE fragments in a genomic sequence together to form interruptions. Next, some tree adjustment operations are given to simplify these parse trees and obtain more standard evolutionary trees. Then an overall TE-interaction network is created by merging these standard evolutionary trees into a weighted directed graph. This TE-interaction network is a rich representation of the predicted interactions between all TEs throughout evolution and is a powerful tool to predict the insertion evolution of these TEs. It is applied to the human genome, but can be easily applied to other genomes. Furthermore, it can also be applied to multiple related genomes where common TEs exist in order to study the interactions between TEs and the genomes. Lastly, a simulation of TE transpositions throughout evolution is developed. This is especially helpful in understanding the dynamics of how TEs evolve and impact their host genomes. Also, it is used as a verification technique for the previous theoretical models in the thesis. By feeding the simulated TE remnants and activity data into the theoretical models, a relative age order is predicted using the sequential interruption model, and a quantified correlation between this predicted order and the input age order in the simulation can be calculated. Then, a TE-interaction network is constructed using the recursive interruption model on the simulated data, which can also be converted into a linear age order by feeding the adjacency matrix of the network to Tabu search. Another correlation is calculated between the predicted age order from the recursive interruption model and the input age order. An average correlation of ten simulations is calculated for each model, which suggests that in general, the recursive interruption model performs better than the sequential interruption model in predicting a correct relative age order of TEs. Indeed, the recursive interruption model achieves an average correlation value of ρ = 0.939 with the correct simulated answer

    Prospects and limitations of full-text index structures in genome analysis

    Get PDF
    The combination of incessant advances in sequencing technology producing large amounts of data and innovative bioinformatics approaches, designed to cope with this data flood, has led to new interesting results in the life sciences. Given the magnitude of sequence data to be processed, many bioinformatics tools rely on efficient solutions to a variety of complex string problems. These solutions include fast heuristic algorithms and advanced data structures, generally referred to as index structures. Although the importance of index structures is generally known to the bioinformatics community, the design and potency of these data structures, as well as their properties and limitations, are less understood. Moreover, the last decade has seen a boom in the number of variant index structures featuring complex and diverse memory-time trade-offs. This article brings a comprehensive state-of-the-art overview of the most popular index structures and their recently developed variants. Their features, interrelationships, the trade-offs they impose, but also their practical limitations, are explained and compared

    Computational analysis of human genomic variants and lncRNAs from sequence data

    Get PDF
    The high-throughput sequencing technologies have been developed and applied to the human genome studies for nearly 20 years. These technologies have provided numerous research applications and have significantly expanded our knowledge about the human genome. In this thesis, computational methods that utilize sequence data to study human genomic variants and transcripts were evaluated and developed. Indel represents insertion and deletion, which are two types of common genomic variants that are widespread in the human genome. Detecting indels from human genomes is the crucial step for diagnosing indel related genomic disorders and may potentially identify novel indel makers for studying certain diseases. Compared with previous techniques, the high-throughput sequencing technologies, especially the next- generation sequencing (NGS) technology, enable to detect indels accurately and efficiently in wide ranges of genome. In the first part of the thesis, tools with indel calling abilities are evaluated with an assortment of indels and different NGS settings. The results show that the selection of tools and NGS settings impact on indel detection significantly, which provide suggestions for tool selection and future developments. In bioinformatics analysis, an indel’s position can be marked inconsistently on the reference genome, which may result in an indel having different but equivalent representations and cause troubles for downstream. This problem is related to the complex sequence context of the indels, for example, short tandem repeats (STRs), where the same short stretch of nucleotides is amplified. In the second part of the thesis, a novel computational tool VarSCAT was described, which has various functions for annotating the sequence context of variants, including ambiguous positions, STRs, and other sequence context features. Analysis of several high- confidence human variant sets with VarSCAT reveals that a large number of genomic variants, especially indels, have sequence features associated with STRs. In the human genome, not all genes and their transcripts are translated into proteins. Long non-coding ribonucleic acid (lncRNA) is a typical example. Sequence recognition built with machine learning models have improved significantly in recent years. In the last part of the thesis, several machine learning-based lncRNA prediction tools were evaluated on their predictions for coding potentiality of transcripts. The results suggest that tools based on deep learning identify lncRNAs best. Ihmisen genomivarianttien ja lncRNA:iden laskennallinen analyysi sekvenssiaineistosta Korkean suorituskyvyn sekvensointiteknologioita on kehitetty ja sovellettu ihmisen genomitutkimuksiin lähes 20 vuoden ajan. Nämä teknologiat ovat mahdollistaneet ihmisen genomin laaja-alaisen tutkimisen ja lisänneet merkittävästi tietoamme siitä. Tässä väitöstyössä arvioitiin ja kehitettiin sekvenssiaineistoa hyödyntäviä laskennallisia menetelmiä ihmisen genomivarianttien sekä transkriptien tutkimiseen. Indeli on yhteisnimitys lisäys- eli insertio-varianteille ja häviämä- eli deleetio-varianteille, joita esiintyy koko genomin alueella. Indelien tunnistaminen on ratkaisevaa geneettisten poikkeavuuksien diagnosoinnissa ja eri sairauksiin liittyvien uusien indeli-markkereiden löytämisessä. Aiempiin teknologioihin verrattuna korkean suorituskyvyn sekvensointiteknologiat, erityisesti seuraavan sukupolven sekvensointi (NGS) mahdollistavat indelien havaitsemisen tarkemmin ja tehokkaammin laajemmilta genomialueilta. Väitöstyön ensimmäisessä osassa indelien kutsumiseen tarkoitettuja laskentatyökaluja arvioitiin käyttäen laajaa valikoimaa indeleitä ja erilaisia NGS-asetuksia. Tulokset osoittivat, että työkalujen valinta ja NGS-asetukset vaikuttivat indelien tunnistukseen merkittävästi ja siten ne voivat ohjata työkalujen valinnassa ja kehitystyössä. Bioinformatiivisessa analyysissä saman indelin sijainti voidaan merkitä eri kohtiin referenssigenomia, joka voi aiheuttaa ongelmia loppupään analyysiin, kuten indeli-kutsujen arviointiin. Tämä ongelma liittyy sekvenssikontekstiin, koska variantit voivat sijoittua lyhyille perättäisille tandem-toistojaksoille (STR), jossa sama lyhyt nukleotidijakso on monistunut. Väitöstyön toisessa osassa kehitettiin laskentatyökalu VarSCAT, jossa on eri toimintoja, mm. monitulkintaisten sijaintitietojen, vierekkäisten alueiden ja STR-alueiden tarkasteluun. Luotettaviksi arvioitujen ihmisen varianttiaineistojen analyysi VarSCAT-työkalulla paljasti, että monien geneettisten varianttien ja erityisesti indelien ominaisuudet liittyvät STR-alueisiin. Kaikkia ihmisen geenejä ja niiden geenituotteita, kuten esimerkiksi ei-koodaavia RNA:ta (lncRNA) ei käännetä proteiiniksi. Koneoppimismenetelmissä ja sekvenssitunnistuksessa on tapahtunut huomattavaa parannusta viime vuosina. Väitöstyön viimeisessä osassa arvioitiin useiden koneoppimiseen perustuvien lncRNA-ennustustyökalujen ennusteita. Tulokset viittaavat siihen, että syväoppimiseen perustuvat työkalut tunnistavat lncRNA:t parhaiten

    A Survey on Region Extractors from Web Documents

    Get PDF
    Extracting information from web documents has become a research area in which new proposals sprout out year after year. This has motivated several researchers to work on surveys that attempt to provide an overall picture of the many existing proposals. Unfortunately, none of these surveys provide a complete picture, because they do not take region extractors into account. These tools are kind of preprocessors, because they help information extractors focus on the regions of a web document that contain relevant information. With the increasing complexity of web documents, region extractors are becoming a must to extract information from many websites. Beyond information extraction, region extractors have also found their way into information retrieval, focused web crawling, topic distillation, adaptive content delivery, mashups, and metasearch engines. In this paper, we survey the existing proposals regarding region extractors and compare them side by side.Ministerio de Educación y Ciencia TIN2007-64119Junta de Andalucía P07-TIC-2602Junta de Andalucía P08- TIC-4100Ministerio de Ciencia e Innovación TIN2008-04718-EMinisterio de Ciencia e Innovación TIN2010-21744Ministerio de Economía, Industria y Competitividad TIN2010-09809-EMinisterio de Ciencia e Innovación TIN2010-10811-EMinisterio de Ciencia e Innovación TIN2010-09988-
    corecore