86,531 research outputs found
MEASURING OFFICE COMPLEXITY
An "office" can be described in terms of at least four different (but related) sets of descriptors:
the physical, the social, the organizational, and the work-related. This paper focuses on work-related
aspects of offices, and presents two measures of complexity in office work. The first measure,
operational complexity, gauges the average difficulty, in terms of the cognitive resources required, to
perform a "chunk" of office work. Independent of this, sequential complexity measures the potential
number of task sequences which could be used to accomplish a given chunk of work. Sequential
complexity increases as does the number of "special cases," "special cases of special cases," etc. for
which the chunk of office work need be performed. In other words, it focuses on the complexity of
interrelationships between individual office tasks, while operational complexity is concerned with the
complexity of the individual tasks themselves. We then combine these measures into a an aggregate
measure of overall complexity, combined complexity. The application of these measures is
illustrated, using descriptions of order entry processes, for two hypothetical firms, employing job shop
and assembly-line technologies, respectively. While these three measures hardly comprise an
exhaustive catalogue of complexity in the "office" (or even in office work), we believe they provide a
useful basis for both practical application and further theoretical extension.Information Systems Working Papers Serie
Application of Cosine Similarity in Bioinformatics
Finding similar sequences to an input query sequence (DNA or proteins) from a sequence data set is an important problem in bioinformatics. It provides researchers an intuition of what could be related or how the search space can be reduced for further tasks. An exact brute-force nearest-neighbor algorithm used for this task has complexity O(m * n) where n is the database size and m is the query size. Such an algorithm faces time-complexity issues as the database and query sizes increase. Furthermore, the use of alignment-based similarity measures such as minimum edit distance adds an additional complexity to the exact algorithm. In this thesis, an alignment-free method based similarity measures such as cosine similarity and squared euclidean distance by representing sequences as vectors was investigated. The cosine-similarity based locality-sensitive hashing technique was used to reduce the number of pairwise comparisons while finding similar sequences to an input query. We evaluated our algorithm on a proteins dataset of size 100,000 sequences and found that our cosine-similarity based algorithm is 28 times faster than the exact algorithm and 13 times faster than the BLASTP[3] algorithm for finding similar sequences with percent identity greater than 90%. It also has 99.5% accuracy. We also developed a greedy incremental clustering algorithm based on our cosine-similarity nearest neighbor algorithm for removing redundant sequences in a protein dataset. We compared our clustering algorithm with a popular clustering algorithm CD-HIT. The clustering results on protein dataset of size 100000 show that our clustering algorithm generated clusters with accuracy almost equal to the CD-HIT algorithm accuracy. We further demonstrated two bioinformatics application where our cosine-similarity based algorithm can be used: an analysis of assembly data of various assemblers and a clustering of a protein dataset. Using our algorithm, we successfully compared the quality of assembly data of multiple de novo and genome-guided assemblers.
Adviser: Jitender Deogu
Are we there yet? : reliably estimating the completeness of plant genome sequences
Genome sequencing is becoming cheaper and faster thanks to the introduction of next-generation sequencing techniques. Dozens of new plant genome sequences have been released in recent years, ranging from small to gigantic repeat-rich or polyploid genomes. Most genome projects have a dual purpose: delivering a contiguous, complete genome assembly and creating a full catalog of correctly predicted genes. Frequently, the completeness of a species' gene catalog is measured using a set of marker genes that are expected to be present. This expectation can be defined along an evolutionary gradient, ranging from highly conserved genes to species-specific genes. Large-scale population resequencing studies have revealed that gene space is fairly variable even between closely related individuals, which limits the definition of the expected gene space, and, consequently, the accuracy of estimates used to assess genome and gene space completeness. We argue that, based on the desired applications of a genome sequencing project, different completeness scores for the genome assembly and/or gene space should be determined. Using examples from several dicot and monocot genomes, we outline some pitfalls and recommendations regarding methods to estimate completeness during different steps of genome assembly and annotation
Assembly and Disassembly Planning by using Fuzzy Logic & Genetic Algorithms
The authors propose the implementation of hybrid Fuzzy Logic-Genetic
Algorithm (FL-GA) methodology to plan the automatic assembly and disassembly
sequence of products. The GA-Fuzzy Logic approach is implemented onto two
levels. The first level of hybridization consists of the development of a Fuzzy
controller for the parameters of an assembly or disassembly planner based on
GAs. This controller acts on mutation probability and crossover rate in order
to adapt their values dynamically while the algorithm runs. The second level
consists of the identification of theoptimal assembly or disassembly sequence
by a Fuzzy function, in order to obtain a closer control of the technological
knowledge of the assembly/disassembly process. Two case studies were analyzed
in order to test the efficiency of the Fuzzy-GA methodologies
A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data
Deep shotgun sequencing and analysis of genomes, transcriptomes, amplified
single-cell genomes, and metagenomes has enabled investigation of a wide range
of organisms and ecosystems. However, sampling variation in short-read data
sets and high sequencing error rates of modern sequencers present many new
computational challenges in data interpretation. These challenges have led to
the development of new classes of mapping tools and {\em de novo} assemblers.
These algorithms are challenged by the continued improvement in sequencing
throughput. We here describe digital normalization, a single-pass computational
algorithm that systematizes coverage in shotgun sequencing data sets, thereby
decreasing sampling variation, discarding redundant data, and removing the
majority of errors. Digital normalization substantially reduces the size of
shotgun data sets and decreases the memory and time requirements for {\em de
novo} sequence assembly, all without significantly impacting content of the
generated contigs. We apply digital normalization to the assembly of microbial
genomic data, amplified single-cell genomic data, and transcriptomic data. Our
implementation is freely available for use and modification
Thermodynamic Depth of Causal States: When Paddling around in Occam's Pool Shallowness Is a Virtue
Thermodynamic depth is an appealing but flawed structural complexity measure.
It depends on a set of macroscopic states for a system, but neither its
original introduction by Lloyd and Pagels nor any follow-up work has considered
how to select these states. Depth, therefore, is at root arbitrary.
Computational mechanics, an alternative approach to structural complexity,
provides a definition for a system's minimal, necessary causal states and a
procedure for finding them. We show that the rate of increase in thermodynamic
depth, or {\it dive}, is the system's reverse-time Shannon entropy rate, and so
depth only measures degrees of macroscopic randomness, not structure. To fix
this we redefine the depth in terms of the causal state
representation----machines---and show that this representation gives
the minimum dive consistent with accurate prediction. Thus, -machines
are optimally shallow.Comment: 11 pages, 9 figures, RevTe
SEED: efficient clustering of next-generation sequences.
MotivationSimilarity clustering of next-generation sequences (NGS) is an important computational problem to study the population sizes of DNA/RNA molecules and to reduce the redundancies in NGS data. Currently, most sequence clustering algorithms are limited by their speed and scalability, and thus cannot handle data with tens of millions of reads.ResultsHere, we introduce SEED-an efficient algorithm for clustering very large NGS sets. It joins sequences into clusters that can differ by up to three mismatches and three overhanging residues from their virtual center. It is based on a modified spaced seed method, called block spaced seeds. Its clustering component operates on the hash tables by first identifying virtual center sequences and then finding all their neighboring sequences that meet the similarity parameters. SEED can cluster 100 million short read sequences in <4 h with a linear time and memory performance. When using SEED as a preprocessing tool on genome/transcriptome assembly data, it was able to reduce the time and memory requirements of the Velvet/Oasis assembler for the datasets used in this study by 60-85% and 21-41%, respectively. In addition, the assemblies contained longer contigs than non-preprocessed data as indicated by 12-27% larger N50 values. Compared with other clustering tools, SEED showed the best performance in generating clusters of NGS data similar to true cluster results with a 2- to 10-fold better time performance. While most of SEED's utilities fall into the preprocessing area of NGS data, our tests also demonstrate its efficiency as stand-alone tool for discovering clusters of small RNA sequences in NGS data from unsequenced organisms.AvailabilityThe SEED software can be downloaded for free from this site: http://manuals.bioinformatics.ucr.edu/home/[email protected] informationSupplementary data are available at Bioinformatics online
- ā¦