20 research outputs found
Population Stratification of a Common APOBEC Gene Deletion Polymorphism
The APOBEC3 gene family plays a role in innate cellular immunity inhibiting retroviral infection, hepatitis B virus propagation, and the retrotransposition of endogenous elements. We present a detailed sequence and population genetic analysis of a 29.5-kb common human deletion polymorphism that removes the APOBEC3B gene. We developed a PCR-based genotyping assay, characterized 1,277 human diversity samples, and found that the frequency of the deletion allele varies significantly among major continental groups (global F (ST) = 0.2843). The deletion is rare in Africans and Europeans (frequency of 0.9% and 6%), more common in East Asians and Amerindians (36.9% and 57.7%), and almost fixed in Oceanic populations (92.9%). Despite a worldwide frequency of 22.5%, analysis of data from the International HapMap Project reveals that no single existing tag single nucleotide polymorphism may serve as a surrogate for the deletion variant, emphasizing that without careful analysis its phenotypic impact may be overlooked in association studies. Application of haplotype-based tests for selection revealed potential pitfalls in the direct application of existing methods to the analysis of genomic structural variation. These data emphasize the importance of directly genotyping structural variation in association studies and of accurately resolving variant breakpoints before proceeding with more detailed population-genetic analysis
Non-alignment comparison of human and high primate genomes
Compositional spectra (CS) analysis based on k-mer scoring of DNA sequences
was employed in this study for dot-plot comparison of human and primate
genomes. The detection of extended conserved synteny regions was based on
continuous fuzzy similarity rather than on chains of discrete anchors (genes or
highly conserved noncoding elements). In addition to the high correspondence
found in the comparisons of whole-genome sequences, a good similarity was also
found after masking gene sequences, indicating that CS analysis manages to
reveal phylogenetic signal in the organization of noncoding part of the genome
sequences, including repetitive DNA and the genome "dark matter". Obviously,
the possibility to reveal parallel ordering depends on the signal of common
ancestor sequence organization varying locally along the corresponding segments
of the compared genomes. We explored two sources contributing to this signal:
sequence composition (GC content) and sequence organization (abundances of
k-mers in the usual A,T,G,C or purine-pyrimidine alphabets). Whole-genome
comparisons based on GC distribution along the analyzed sequences indeed gives
reasonable results, but combining it with k-mer abundances dramatically
improves the ordering quality, indicating that compositional and organizational
heterogeneity comprise complementary sources of information on evolutionary
conserved similarity of genome sequences
A Fine-grained Data Set and Analysis of Tangling in Bug Fixing Commits
Context: Tangled commits are changes to software that address multiple
concerns at once. For researchers interested in bugs, tangled commits mean that
they actually study not only bugs, but also other concerns irrelevant for the
study of bugs.
Objective: We want to improve our understanding of the prevalence of tangling
and the types of changes that are tangled within bug fixing commits.
Methods: We use a crowd sourcing approach for manual labeling to validate
which changes contribute to bug fixes for each line in bug fixing commits. Each
line is labeled by four participants. If at least three participants agree on
the same label, we have consensus.
Results: We estimate that between 17% and 32% of all changes in bug fixing
commits modify the source code to fix the underlying problem. However, when we
only consider changes to the production code files this ratio increases to 66%
to 87%. We find that about 11% of lines are hard to label leading to active
disagreements between participants. Due to confirmed tangling and the
uncertainty in our data, we estimate that 3% to 47% of data is noisy without
manual untangling, depending on the use case.
Conclusion: Tangled commits have a high prevalence in bug fixes and can lead
to a large amount of noise in the data. Prior research indicates that this
noise may alter results. As researchers, we should be skeptics and assume that
unvalidated data is likely very noisy, until proven otherwise.Comment: Status: Accepted at Empirical Software Engineerin
Erratum: Corrigendum: Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution
International Chicken Genome Sequencing Consortium.
The Original Article was published on 09 December 2004.
Nature432, 695â716 (2004).
In Table 5 of this Article, the last four values listed in the âCopy numberâ column were incorrect. These should be: LTR elements, 30,000; DNA transposons, 20,000; simple repeats, 140,000; and satellites, 4,000. These errors do not affect any of the conclusions in our paper.
Additional information.
The online version of the original article can be found at 10.1038/nature0315
Understanding the knowledge gaps of software engineers: An empirical analysis based on SWEBOK
Context: Knowledge level and productivity of the software engineering (SE) workforce are the subject of regular discussions among practitioners, educators, and researchers. There have been many efforts to measure and improve the knowledge gap between SE education and industrial needs. Objective: Although the existing efforts for aligning SE education and industrial needs have provided valuable insights, there is a need for analyzing the SE topics in a more âfine-grainedâ manner; i.e., knowing that SE university graduates should know more about requirements engineering is important, but it is more valuable to know the exact topics of requirements engineering that are most important in the industry. Method: We achieve the above objective by assessing the knowledge gaps of software engineers by designing and executing an opinion survey on levels of knowledge learned in universities versus skills needed in industry. We designed the survey by using the SE knowledge areas (KAs) from the latest version of the Software Engineering Body of Knowledge (SWEBOK v3), which classifies the SE knowledge into 12 KAs, which are themselves broken down into 67 subareas (sub-KAs) in total. Our analysis is based on (opinion) data gathered from 129 practitioners, who are mostly based in Turkey. Results: Based on our findings, we recommend that educators should include more materials on software maintenance, software configuration management, and testing in their SE curriculum. Based on the literature as well as the current trends in industry, we provide actionable suggestions to improve SE curriculum to decrease the knowledge gap
Recent Segmental Duplications in the Working Draft Assembly of the Brown Norway Rat
We assessed the content, structure, and distribution of segmental duplications (â„90% sequence identity, â„5 kb length) within the published version of the Rattus norvegicus genome assembly (v.3.1). The overall fraction of duplicated sequence within the rat assembly (2.92%) is greater than that of the mouse (1%â1.2%) but significantly less than that of human (âŒ5%). Duplications were nonuniformly distributed, occurring predominantly as tandem and tightly clustered intrachromosomal duplications. Regions containing extensive interchromosomal duplications were observed, particularly within subtelomeric and pericentromeric regions. We identified 41 discrete genomic regions greater than 1 Mb in size, termed âduplication blocks.â These appear to have been the target of extensive duplication over millions of years of evolution. Gene content within duplicated regions (âŒ1%) was lower than expected based on the genome representation. Interestingly, sequence contigs lacking chromosome assignment (âthe unplaced chromosomeâ) showed a marked enrichment for segmental duplication (45% of 75.2 Mb), indicating that segmental duplications have been problematic for sequence and assembly of the rat genome. Further targeted efforts are required to resolve the organization and complexity of these regions
Closing the Gap Between Software Engineering Education and Industrial Needs
According to different reports, many recent software engineering graduates often face difficulties when beginning their professional careers, due to misalignment of the skills learnt in their university education with what is needed in industry. To address that need, many studies have been conducted to align software engineering education with industry needs. To synthesize that body of knowledge, we present in this paper a systematic literature review (SLR) which summarizes the findings of 33 studies in this area. By doing a meta-analysis of all those studies and using data from 12 countries and over 4,000 data points, this study will enable educators and hiring managers to adapt their education / hiring efforts to best prepare the software engineering workforce.</p
Using Continuous Integration and Automated Test Techniques for a Robust C4ISR System
We have used Cl (Continuous Integration) and various software testing techniques to achieve a robust C4ISR (Command, Control, Communications, Computers, Intelligence, Surveillance, and Reconnaissance) multi-platform system. Because of rapid changes in the C4ISR domain and in the software technology, frequent critical design adjustments and in turn vast code modifications or additions become inevitable. Defect fixes might also incur code changes. These unavoidable code modifications may put a big risk in the reliability of a mission critical system. Also, in order to stay competitive in the C4ISR market, a company must make recurring releases without sacrificing quality. We have designed and implemented an XML driven automated test framework that enabled us developing numerous high quality tests rapidly. While using Cl with automated software test techniques, we have aimed at speeding up the delivery of high quality and robust software by decreasing integration procedure, which is one of the main bottleneck points in the industry. This work describes how we have used Cl and software test techniques in a large-scaled, multi-platform, multi-language, distributed C4ISR project and what the benefits of such a system are