368 research outputs found
ASSESSING THE QUALITY OF SOFTWARE DEVELOPMENT TUTORIALS AVAILABLE ON THE WEB
Both expert and novice software developers frequently access software development resources available on the Web in order to lookup or learn new APIs, tools and techniques. Software quality is affected negatively when developers fail to find high-quality information relevant to their problem. While there is a substantial amount of freely available resources that can be accessed online, some of the available resources contain information that suffers from error proneness, copyright infringement, security concerns, and incompatible versions. Use of such toxic information can have a strong negative effect on developer’s efficacy. This dissertation focuses specifically on software tutorials, aiming to automatically evaluate the quality of such documents available on the Web. In order to achieve this goal, we present two contributions: 1) scalable detection of duplicated code snippets; 2) automatic identification of valid version ranges.
Software tutorials consist of a combination of source code snippets and natural language text. The code snippets in a tutorial can originate from different sources, perhaps carrying stringent licensing requirements or known security vulnerabilities. Developers, typically unaware of this, can reuse these code snippets in their project. First, in this thesis, we present our work on a Web-scale code clone search technique that is able to detect duplicate code snippets between large scale document and source code corpora in order to trace toxic code snippets.
As software libraries and APIs evolve over time, existing software development tutorials can become outdated. It is difficult for software developers and especially novices to determine the expected version of the software implicit in a specific tutorial in order to decide whether the tutorial is applicable to their software development environment. To overcome this challenge, in this thesis we present a novel technique for automatic identification of the valid version range of software development tutorials on the Web
A fast and scalable binary similarity method for open source libraries
Abstract. Usage of third party open source software has become more and more popular in the past years, due to the need for faster development cycles and the availability of good quality libraries. Those libraries are integrated as dependencies and often in the form of binary artifacts. This is especially common in embedded software applications. Dependencies, however, can proliferate and also add new attack surfaces to an application due to vulnerabilities in the library code. Hence, the need for binary similarity analysis methods to detect libraries compiled into applications.
Binary similarity detection methods are related to text similarity methods and build upon the research in that area. In this research we focus on fuzzy matching methods, that have been used widely and successfully in text similarity analysis. In particular, we propose using locality sensitive hashing schemes in combination with normalised binary code features. The normalization allows us to apply the similarity comparison across binaries produced by different compilers using different optimization flags and being build for various machine architectures.
To improve the matching precision, we use weighted code features. Machine learning is used to optimize the feature weights to create clusters of semantically similar code blocks extracted from different binaries. The machine learning is performed in an offline process to increase scalability and performance of the matching system.
Using above methods we build a database of binary similarity code signatures for open source libraries. The database is utilized to match by similarity any code blocks from an application to known libraries in the database. One of the goals of our system is to facilitate a fast and scalable similarity matching process. This allows integrating the system into continuous software development, testing and integration pipelines.
The evaluation shows that our results are comparable to other systems proposed in related research in terms of precision while maintaining the performance required in continuous integration systems.Nopea ja skaalautuva käännettyjen ohjelmistojen samankaltaisuuden tunnistusmenetelmä avoimen lähdekoodin kirjastoille. Tiivistelmä. Kolmansien osapuolten kehittämien ohjelmistojen käyttö on yleistynyt valtavasti viime vuosien aikana nopeutuvan ohjelmistokehityksen ja laadukkaiden ohjelmistokirjastojen tarjonnan kasvun myötä. Nämä kirjastot ovat yleensä lisätty kehitettävään ohjelmistoon riippuvuuksina ja usein jopa käännettyinä binääreinä. Tämä on yleistä varsinkin sulatetuissa ohjelmistoissa. Riippuvuudet saattavat kuitenkin luoda uusia hyökkäysvektoreita kirjastoista löytyvien haavoittuvuuksien johdosta. Nämä kolmansien osapuolten kirjastoista löytyvät haavoittuvuudet synnyttävät tarpeen tunnistaa käännetyistä binääriohjelmistoista löytyvät avoimen lähdekoodin ohjelmistokirjastot.
Binäärien samankaltaisuuden tunnistusmenetelmät usein pohjautuvat tekstin samankaltaisuuden tunnistusmenetelmiin ja hyödyntävät tämän tieteellisiä saavutuksia. Tässä tutkimuksessa keskitytään sumeisiin tunnistusmenetelmiin, joita on käytetty laajasti tekstin samankaltaisuuden tunnistamisessa. Tutkimuksessa hyödynnetään sijainnille sensitiivisiä tiivistemenetelmiä ja normalisoituja binäärien ominaisuuksia. Ominaisuuksien normalisoinnin avulla binäärien samankaltaisuutta voidaan vertailla ohjelmiston kääntämisessä käytetystä kääntäjästä, optimisaatiotasoista ja prosessoriarkkitehtuurista huolimatta.
Menetelmän tarkkuutta parannetaan painotettujen binääriominaisuuksien avulla. Koneoppimista hyödyntämällä binääriomisaisuuksien painotus optimoidaan siten, että samankaltaisista binääreistä puretut ohjelmistoblokit luovat samankaltaisien ohjelmistojen joukkoja. Koneoppiminen suoritetaan erillisessä prosessissa, mikä parantaa järjestelmän suorituskykyä.
Näiden menetelmien avulla luodaan tietokanta avoimen lähdekoodin kirjastojen tunnisteista. Tietokannan avulla minkä tahansa ohjelmiston samankaltaiset binääriblokit voidaan yhdistää tunnettuihin avoimen lähdekoodin kirjastoihin. Menetelmän tavoitteena on tarjota nopea ja skaalautuva samankaltaisuuden tunnistus. Näiden ominaisuuksien johdosta järjestelmä voidaan liittää osaksi ohjelmistokehitys-, integraatioprosesseja ja ohjelmistotestausta.
Vertailu muihin kirjallisuudessa esiteltyihin menetelmiin osoittaa, että esitellyn menetlmän tulokset on vertailtavissa muihin kirjallisuudessa esiteltyihin menetelmiin tarkkuuden osalta. Menetelmä myös ylläpitää suorituskyvyn, jota vaaditaan jatkuvan integraation järjestelmissä
Nanopore Sequencing Technology and Tools for Genome Assembly: Computational Analysis of the Current State, Bottlenecks and Future Directions
Nanopore sequencing technology has the potential to render other sequencing
technologies obsolete with its ability to generate long reads and provide
portability. However, high error rates of the technology pose a challenge while
generating accurate genome assemblies. The tools used for nanopore sequence
analysis are of critical importance as they should overcome the high error
rates of the technology. Our goal in this work is to comprehensively analyze
current publicly available tools for nanopore sequence analysis to understand
their advantages, disadvantages, and performance bottlenecks. It is important
to understand where the current tools do not perform well to develop better
tools. To this end, we 1) analyze the multiple steps and the associated tools
in the genome assembly pipeline using nanopore sequence data, and 2) provide
guidelines for determining the appropriate tools for each step. We analyze
various combinations of different tools and expose the tradeoffs between
accuracy, performance, memory usage and scalability. We conclude that our
observations can guide researchers and practitioners in making conscious and
effective choices for each step of the genome assembly pipeline using nanopore
sequence data. Also, with the help of bottlenecks we have found, developers can
improve the current tools or build new ones that are both accurate and fast, in
order to overcome the high error rates of the nanopore sequencing technology.Comment: To appear in Briefings in Bioinformatics (BIB), 201
Technology dictates algorithms: Recent developments in read alignment
Massively parallel sequencing techniques have revolutionized biological and
medical sciences by providing unprecedented insight into the genomes of humans,
animals, and microbes. Modern sequencing platforms generate enormous amounts of
genomic data in the form of nucleotide sequences or reads. Aligning reads onto
reference genomes enables the identification of individual-specific genetic
variants and is an essential step of the majority of genomic analysis
pipelines. Aligned reads are essential for answering important biological
questions, such as detecting mutations driving various human diseases and
complex traits as well as identifying species present in metagenomic samples.
The read alignment problem is extremely challenging due to the large size of
analyzed datasets and numerous technological limitations of sequencing
platforms, and researchers have developed novel bioinformatics algorithms to
tackle these difficulties. Importantly, computational algorithms have evolved
and diversified in accordance with technological advances, leading to todays
diverse array of bioinformatics tools. Our review provides a survey of
algorithmic foundations and methodologies across 107 alignment methods
published between 1988 and 2020, for both short and long reads. We provide
rigorous experimental evaluation of 11 read aligners to demonstrate the effect
of these underlying algorithms on speed and efficiency of read aligners. We
separately discuss how longer read lengths produce unique advantages and
limitations to read alignment techniques. We also discuss how general alignment
algorithms have been tailored to the specific needs of various domains in
biology, including whole transcriptome, adaptive immune repertoire, and human
microbiome studies
Rethinking Software Network Data Planes in the Era of Microservices
L'abstract è presente nell'allegato / the abstract is in the attachmen
- …