86,531 research outputs found

    MEASURING OFFICE COMPLEXITY

    Get PDF
    An "office" can be described in terms of at least four different (but related) sets of descriptors: the physical, the social, the organizational, and the work-related. This paper focuses on work-related aspects of offices, and presents two measures of complexity in office work. The first measure, operational complexity, gauges the average difficulty, in terms of the cognitive resources required, to perform a "chunk" of office work. Independent of this, sequential complexity measures the potential number of task sequences which could be used to accomplish a given chunk of work. Sequential complexity increases as does the number of "special cases," "special cases of special cases," etc. for which the chunk of office work need be performed. In other words, it focuses on the complexity of interrelationships between individual office tasks, while operational complexity is concerned with the complexity of the individual tasks themselves. We then combine these measures into a an aggregate measure of overall complexity, combined complexity. The application of these measures is illustrated, using descriptions of order entry processes, for two hypothetical firms, employing job shop and assembly-line technologies, respectively. While these three measures hardly comprise an exhaustive catalogue of complexity in the "office" (or even in office work), we believe they provide a useful basis for both practical application and further theoretical extension.Information Systems Working Papers Serie

    Application of Cosine Similarity in Bioinformatics

    Get PDF
    Finding similar sequences to an input query sequence (DNA or proteins) from a sequence data set is an important problem in bioinformatics. It provides researchers an intuition of what could be related or how the search space can be reduced for further tasks. An exact brute-force nearest-neighbor algorithm used for this task has complexity O(m * n) where n is the database size and m is the query size. Such an algorithm faces time-complexity issues as the database and query sizes increase. Furthermore, the use of alignment-based similarity measures such as minimum edit distance adds an additional complexity to the exact algorithm. In this thesis, an alignment-free method based similarity measures such as cosine similarity and squared euclidean distance by representing sequences as vectors was investigated. The cosine-similarity based locality-sensitive hashing technique was used to reduce the number of pairwise comparisons while finding similar sequences to an input query. We evaluated our algorithm on a proteins dataset of size 100,000 sequences and found that our cosine-similarity based algorithm is 28 times faster than the exact algorithm and 13 times faster than the BLASTP[3] algorithm for finding similar sequences with percent identity greater than 90%. It also has 99.5% accuracy. We also developed a greedy incremental clustering algorithm based on our cosine-similarity nearest neighbor algorithm for removing redundant sequences in a protein dataset. We compared our clustering algorithm with a popular clustering algorithm CD-HIT. The clustering results on protein dataset of size 100000 show that our clustering algorithm generated clusters with accuracy almost equal to the CD-HIT algorithm accuracy. We further demonstrated two bioinformatics application where our cosine-similarity based algorithm can be used: an analysis of assembly data of various assemblers and a clustering of a protein dataset. Using our algorithm, we successfully compared the quality of assembly data of multiple de novo and genome-guided assemblers. Adviser: Jitender Deogu

    Are we there yet? : reliably estimating the completeness of plant genome sequences

    Get PDF
    Genome sequencing is becoming cheaper and faster thanks to the introduction of next-generation sequencing techniques. Dozens of new plant genome sequences have been released in recent years, ranging from small to gigantic repeat-rich or polyploid genomes. Most genome projects have a dual purpose: delivering a contiguous, complete genome assembly and creating a full catalog of correctly predicted genes. Frequently, the completeness of a species' gene catalog is measured using a set of marker genes that are expected to be present. This expectation can be defined along an evolutionary gradient, ranging from highly conserved genes to species-specific genes. Large-scale population resequencing studies have revealed that gene space is fairly variable even between closely related individuals, which limits the definition of the expected gene space, and, consequently, the accuracy of estimates used to assess genome and gene space completeness. We argue that, based on the desired applications of a genome sequencing project, different completeness scores for the genome assembly and/or gene space should be determined. Using examples from several dicot and monocot genomes, we outline some pitfalls and recommendations regarding methods to estimate completeness during different steps of genome assembly and annotation

    Assembly and Disassembly Planning by using Fuzzy Logic & Genetic Algorithms

    Full text link
    The authors propose the implementation of hybrid Fuzzy Logic-Genetic Algorithm (FL-GA) methodology to plan the automatic assembly and disassembly sequence of products. The GA-Fuzzy Logic approach is implemented onto two levels. The first level of hybridization consists of the development of a Fuzzy controller for the parameters of an assembly or disassembly planner based on GAs. This controller acts on mutation probability and crossover rate in order to adapt their values dynamically while the algorithm runs. The second level consists of the identification of theoptimal assembly or disassembly sequence by a Fuzzy function, in order to obtain a closer control of the technological knowledge of the assembly/disassembly process. Two case studies were analyzed in order to test the efficiency of the Fuzzy-GA methodologies

    A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data

    Full text link
    Deep shotgun sequencing and analysis of genomes, transcriptomes, amplified single-cell genomes, and metagenomes has enabled investigation of a wide range of organisms and ecosystems. However, sampling variation in short-read data sets and high sequencing error rates of modern sequencers present many new computational challenges in data interpretation. These challenges have led to the development of new classes of mapping tools and {\em de novo} assemblers. These algorithms are challenged by the continued improvement in sequencing throughput. We here describe digital normalization, a single-pass computational algorithm that systematizes coverage in shotgun sequencing data sets, thereby decreasing sampling variation, discarding redundant data, and removing the majority of errors. Digital normalization substantially reduces the size of shotgun data sets and decreases the memory and time requirements for {\em de novo} sequence assembly, all without significantly impacting content of the generated contigs. We apply digital normalization to the assembly of microbial genomic data, amplified single-cell genomic data, and transcriptomic data. Our implementation is freely available for use and modification

    Thermodynamic Depth of Causal States: When Paddling around in Occam's Pool Shallowness Is a Virtue

    Get PDF
    Thermodynamic depth is an appealing but flawed structural complexity measure. It depends on a set of macroscopic states for a system, but neither its original introduction by Lloyd and Pagels nor any follow-up work has considered how to select these states. Depth, therefore, is at root arbitrary. Computational mechanics, an alternative approach to structural complexity, provides a definition for a system's minimal, necessary causal states and a procedure for finding them. We show that the rate of increase in thermodynamic depth, or {\it dive}, is the system's reverse-time Shannon entropy rate, and so depth only measures degrees of macroscopic randomness, not structure. To fix this we redefine the depth in terms of the causal state representation---Ļµ\epsilon-machines---and show that this representation gives the minimum dive consistent with accurate prediction. Thus, Ļµ\epsilon-machines are optimally shallow.Comment: 11 pages, 9 figures, RevTe

    SEED: efficient clustering of next-generation sequences.

    Get PDF
    MotivationSimilarity clustering of next-generation sequences (NGS) is an important computational problem to study the population sizes of DNA/RNA molecules and to reduce the redundancies in NGS data. Currently, most sequence clustering algorithms are limited by their speed and scalability, and thus cannot handle data with tens of millions of reads.ResultsHere, we introduce SEED-an efficient algorithm for clustering very large NGS sets. It joins sequences into clusters that can differ by up to three mismatches and three overhanging residues from their virtual center. It is based on a modified spaced seed method, called block spaced seeds. Its clustering component operates on the hash tables by first identifying virtual center sequences and then finding all their neighboring sequences that meet the similarity parameters. SEED can cluster 100 million short read sequences in <4 h with a linear time and memory performance. When using SEED as a preprocessing tool on genome/transcriptome assembly data, it was able to reduce the time and memory requirements of the Velvet/Oasis assembler for the datasets used in this study by 60-85% and 21-41%, respectively. In addition, the assemblies contained longer contigs than non-preprocessed data as indicated by 12-27% larger N50 values. Compared with other clustering tools, SEED showed the best performance in generating clusters of NGS data similar to true cluster results with a 2- to 10-fold better time performance. While most of SEED's utilities fall into the preprocessing area of NGS data, our tests also demonstrate its efficiency as stand-alone tool for discovering clusters of small RNA sequences in NGS data from unsequenced organisms.AvailabilityThe SEED software can be downloaded for free from this site: http://manuals.bioinformatics.ucr.edu/home/[email protected] informationSupplementary data are available at Bioinformatics online
    • ā€¦
    corecore