14 research outputs found

    Dealing with clones in software : a practical approach from detection towards management

    Get PDF
    Despite the fact that duplicated fragments of code also called code clones are considered one of the prominent code smells that may exist in software, cloning is widely practiced in industrial development. The larger the system, the more people involved in its development and the more parts developed by different teams result in an increased possibility of having cloned code in the system. While there are particular benefits of code cloning in software development, research shows that it might be a source of various troubles in evolving software. Therefore, investigating and understanding clones in a software system is important to manage the clones efficiently. However, when the system is fairly large, it is challenging to identify and manage those clones properly. Among the various types of clones that may exist in software, research shows detection of near-miss clones where there might be minor to significant differences (e.g., renaming of identifiers and additions/deletions/modifications of statements) among the cloned fragments is costly in terms of time and memory. Thus, there is a great demand of state-of-the-art technologies in dealing with clones in software. Over the years, several tools have been developed to detect and visualize exact and similar clones. However, usually the tools are standalone and do not integrate well with a software developer's workflow. In this thesis, first, a study is presented on the effectiveness of a fingerprint based data similarity measurement technique named 'simhash' in detecting clones in large scale code-base. Based on the positive outcome of the study, a time efficient detection approach is proposed to find exact and near-miss clones in software, especially in large scale software systems. The novel detection approach has been made available as a highly configurable and fully fledged standalone clone detection tool named 'SimCad', which can be configured for detection of clones in both source code and non-source code based data. Second, we show a robust use of the clone detection approach studied earlier by assembling its detection service as a portable library named 'SimLib'. This library can provide tightly coupled (integrated) clone detection functionality to other applications as opposed to loosely coupled service provided by a typical standalone tool. Because of being highly configurable and easily extensible, this library allows the user to customize its clone detection process for detecting clones in data having diverse characteristics. We performed a user study to get some feedback on installation and use of the 'SimLib' API (Application Programming Interface) and to uncover its potential use as a third-party clone detection library. Third, we investigated on what tools and techniques are currently in use to detect and manage clones and understand their evolution. The goal was to find how those tools and techniques can be made available to a developer's own software development platform for convenient identification, tracking and management of clones in the software. Based on that, we developed a clone-aware software development platform named 'SimEclipse' to promote the practical use of code clone research and to provide better support for clone management in software. Finally, we evaluated 'SimEclipse' by conducting a user study on its effectiveness, usability and information management. We believe that both researchers and developers would enjoy and utilize the benefit of using these tools in different aspect of code clone research and manage cloned code in software systems

    Exploring Hybrid Parallel Systems for Probabilistic Record Linkage

    Get PDF
    [EN] Record linkage is a technique widely used to gather data stored in disparate data sources that presumably pertain to the same real world entity. This integration can be done deterministically or probabilistically, depending on the existence of common key attributes among all data sources involved. The probabilistic approach is very time-consuming due to the amount of records that must be compared, specifically in big data scenarios. In this paper, we propose and evaluate a methodology that simultaneously exploits multicore and multi-GPU architectures in order to perform the probabilistic linkage of large-scale Brazilian governmental databases. We present some algorithmic optimizations that provide high accuracy and improve performance by defining the best algorithm-architecture combination for a problem given its input size. We also discuss performance results obtained with different data samples, showing that a hybrid approach outperforms other configurations, providing an average speedup of 7.9 when linking up to 20.000 million records.This work has been partially supported by CNPq, FAPESB, Bill & Melinda Gates Foundation, The Royal Society (UK), Medical Research Council (UK), NVIDIA Hardware Grant Program, Generalitat Valenciana (Grant PROMETEOII/2014/003), Spanish Government and European Commission through TEC2015-67387-C4-1-R (MINECO/FEDER), and network CAPAP-H. We have also worked in cooperation with the EU-COST Programme Action IC1305, "Network for Sustainable Ultrascale Computing (NESUS)Boratto, M.; Alonso-JordĆ”, P.; Pinto, C.; Melo, P.; Barreto, M.; Denaxas, S. (2019). Exploring Hybrid Parallel Systems for Probabilistic Record Linkage. The Journal of Supercomputing. 75:1137-1149. https://doi.org/10.1007/s11227-018-2328-3S1137114975Andrade G, Viegas F, Ramos GS, Almeida J, Rocha L, GonƧalves M, Ferreira R (2013) GPU-NB: a fast CUDA-based implementation of NaĆÆve Bayes. In: 2013 25th International Symposium on Computer Architecture and High Performance Computing, pp 168ā€“175Bloom BH (1970) Space/time trade-offs in hash coding with allowable errors. Commun ACM 13(7):422ā€“426Cook S (2013) CUDA Programming: A Developerā€™s Guide to Parallel Computing with GPUs, 1st edn. Morgan Kaufmann, San FranciscoDoan A, Halevy A, Ives Z (2012) Principles of Data Integration. Elsevier, AmsterdamƉtienne EY (2012) Hyper-threading. TurbsPublishing, SaarbrĆ¼ckenFellegi IP, Sunter AB (1969) A theory for record linkage. J Am Stat Assoc 64:1183ā€“1210Feng X, Jin H, Zheng R, Zhu L (2014) Near-duplicate detection using GPU-based simhash scheme. In: 2014 International Conference on Smart Computing, pp 223ā€“228Forchhammer B, Papenbrock T, Stening T, Viehmeier S, Naumann U.D.F (2013) Duplicate detection on GPUs. In: BTW. Kƶllen-Verlag, pp 165ā€“184Kim H.s, Lee D (2007) Parallel linkage. In: Proceedings of the Sixteenth ACM Conference on Information and Knowledge Management, CIKM 2007. ACM, New York, NY, USA, pp 283ā€“292Mamun AA, Aseltine R, Rajasekaran S (2015) RLT-S: a web system for record linkage. PLoS ONE 10(5):1ā€“9Mamun AA, Aseltine R, Rajasekaran S (2016) Efficient record linkage algorithms using complete linkage clustering. PLoS ONE 11(4):1ā€“21Mamun AA, Mi T, Aseltine R, Rajasekaran S (2014) Efficient sequential and parallel algorithms for record linkage. J Am Med Inform Assoc 21(2):252ā€“262Mizell E, Biery R (2017) How GPUs are defining the future of data analyticsMunshi A, Gaster B, Mattson TG, Fung J, Ginsburg D (2011) OpenCL Programming Guide, 1st edn. Addison-Wesley, ReadingNVIDIA Corporation: NVIDIA CUDA C programming guide (2010). Version 3.2OpenMP Architecture Review Board: OpenMP application program interface version 4.0 (2013)Pokorny J (2011) NoSQL databases: a step to database scalability in web environment. In: Proceedings of the 13th International Conference on Information Integration and Web-based Applications and Services, iiWAS ā€™11. ACM, New York, NY, USA, pp 278ā€“283Rendle S, Schmidt-Thieme L (2008) Scaling Record Linkage to Non-uniform Distributed Class Sizes. Springer, Berlin, pp 308ā€“319Sehili Z, Kolb L, Borgs C, Schnell R, Rahm E (2015) Privacy preserving record linkage with ppjoin. In: Datenbanksysteme fĆ¼r Business, Technologie und Web (BTW), pp 85ā€“104Winkler WE (1999) The state of record linkage and current research problemsZhong Z, Rychkov V, Lastovetsky A (2015) Data partitioning on multicore and multi-GPU platforms using functional performance models. IEEE Trans Comput 64(9):2506ā€“251

    A Crowdsourcing Approach to Promote Safe Walking for Visually Impaired People

    Get PDF
    [[abstract]]Visually impaired people have difficulty in walking freely because of the obstacles or the stairways along their walking paths, which can lead to accidental falls. Many researchers have devoted to promoting safe walking for visually impaired people by using smartphones and computer vision. In this research we propose an alternative approach to achieve the same goal - we take advantage of the power of crowdsourcing with machine learning. Specifically, by using smartphones carried by a vast amount of visually normal people, we can collect the tri-axial accelerometer data along with the corresponding GPS coordinates in large geographic areas. Then, machine learning techniques are used to analyze the data, turning them into a special topographic map in which the regions of outdoor stairways are marked. With the map installed in the smartphones carried by the visually impaired people, the Android App we developed can monitor their current outdoor locations and then enable an acoustic alert whey they are getting close to the stairways.[[notice]]č£œę­£å®Œ

    An Indexing Scheme and Descriptor for 3D Object Retrieval Based on Local Shape Querying

    Full text link
    A binary descriptor indexing scheme based on Hamming distance called the Hamming tree for local shape queries is presented. A new binary clutter resistant descriptor named Quick Intersection Count Change Image (QUICCI) is also introduced. This local shape descriptor is extremely small and fast to compare. Additionally, a novel distance function called Weighted Hamming applicable to QUICCI images is proposed for retrieval applications. The effectiveness of the indexing scheme and QUICCI is demonstrated on 828 million QUICCI images derived from the SHREC2017 dataset, while the clutter resistance of QUICCI is shown using the clutterbox experiment.Comment: 13 pages, 13 figures, to be published in a Special Issue in Computers & Graphic

    Effectiveness of Similarity Digest Algorithms for Binary Code Similarity in Memory Forensic Analysis

    Get PDF
    Hoy en dıĢa, cualquier organizacioĢn que esteĢ conectada a Internet es susceptible de sufrir incidentes de ciberseguridad y por tanto, debe contar con un plan de respuesta a incidentes. Este plan ayuda a prevenir, detectar, priorizar y gestionar los incidentes de ciberseguridad. Uno de los pasos para gestionar estos incidentes es la fase de eliminacioĢn, que se encarga de neutralizar la persistencia de los ataques, evaluar el alcance de los mismos e identificar el grado de compromiso. Uno de los puntos clave de esta fase es la identicacioĢn mediante triaje de la informacioĢn que es relevante en el incidente. Esto suele hacerse comparando los elementos disponibles con informacioĢn conocida, centraĢndose asıĢ en aquellos elementos que tienen relevancia para la investigacioĢn (llamados evidencias).Este objetivo puede alcanzarse estudiando dos fuentes de informacioĢn. Por un lado, mediante el anaĢlisis de los datos persistentes, como los datos de los discos duros o los dispositivos USB. Por otro lado, mediante el anaĢlisis de los datos volaĢtiles, como los datos de la memoria RAM. A diferencia del anaĢlisis de datos persistentes, el anaĢlisis de datos volaĢtiles permite determinar el alcance de algunos tipos de ataque que no guardan su coĢdigo en dispositivos de persistencia o cuando los archivos ejecutables almacenados en el disco estaĢn cifrados; cuyo coĢdigo soĢlo se muestra cuando estaĢ en la memoria y se estaĢ ejecutado.Existe una limitacioĢn en el uso de hashes criptograĢficos, comuĢnmente utilizados en el caso de identificacioĢn de evidencias en datos persistentes, para identificar evidencias de memoria. Esta limitacioĢn se debe a que las evidencias nunca seraĢn ideĢnticas porque la ejecucioĢn modifica el contenido de la memoria constantemente. AdemaĢs, es imposible adquirir la memoria maĢs de una vez con todos los programas en el mismo punto de ejecucioĢn. Por lo tanto, los hashes son un meĢtodo de identificacioĢn invaĢlido para el triaje de memoria. Como solucioĢn a este problema, en esta tesis se propone el uso de algoritmos de similitud de digest, que miden la similitud entre dos entradas de manera aproximada.Las principales aportaciones de esta tesis son tres. En primer lugar, se realiza un estudio del dominio del problema en el que se evaluĢa la gestioĢn de la memoria y la modificacioĢn de la misma en ejecucioĢn. A continuacioĢn, se estudian los algoritmos de similitud de digest, desarrollando una clasificacioĢn de sus fases y de los ataques contra estos algoritmos, correlacionando las caracterıĢsticas de la primera clasificacioĢn con los ataques identificados. Por uĢltimo, se proponen dos meĢtodos de preprocesamiento del contenido de volcados de memoria para mejorar la identificacioĢn de los elementos de intereĢs para el anaĢlisis.Como conclusioĢn, en esta tesis se muestra que la modificacioĢn de bytes dispersos afecta negativamente a los caĢlculos de similitud entre evidencias de memoria. Esta modificacioĢn se produce principalmente por el gestor de memoria del sistema operativo. AdemaĢs, se muestra que las teĢcnicas propuestas para preprocesar el contenido de volcados de memoria permiten mejorar el proceso de identificacioĢn de evidencias en memoria.<br /

    Partial 3D Object Retrieval using Local Binary QUICCI Descriptors and Dissimilarity Tree Indexing

    Full text link
    A complete pipeline is presented for accurate and efficient partial 3D object retrieval based on Quick Intersection Count Change Image (QUICCI) binary local descriptors and a novel indexing tree. It is shown how a modification to the QUICCI query descriptor makes it ideal for partial retrieval. An indexing structure called Dissimilarity Tree is proposed which can significantly accelerate searching the large space of local descriptors; this is applicable to QUICCI and other binary descriptors. The index exploits the distribution of bits within descriptors for efficient retrieval. The retrieval pipeline is tested on the artificial part of SHREC'16 dataset with near-ideal retrieval results.Comment: 19 pages, 17 figures, to be published in Computers & Graphic

    Symmetry-Adapted Machine Learning for Information Security

    Get PDF
    Symmetry-adapted machine learning has shown encouraging ability to mitigate the security risks in information and communication technology (ICT) systems. It is a subset of artificial intelligence (AI) that relies on the principles of processing future events by learning past events or historical data. The autonomous nature of symmetry-adapted machine learning supports effective data processing and analysis for security detection in ICT systems without the interference of human authorities. Many industries are developing machine-learning-adapted solutions to support security for smart hardware, distributed computing, and the cloud. In our Special Issue book, we focus on the deployment of symmetry-adapted machine learning for information security in various application areas. This security approach can support effective methods to handle the dynamic nature of security attacks by extraction and analysis of data to identify hidden patterns of data. The main topics of this Issue include malware classification, an intrusion detection system, image watermarking, color image watermarking, battlefield target aggregation behavior recognition model, IP camera, Internet of Things (IoT) security, service function chain, indoor positioning system, and crypto-analysis

    Accelerating Binary String Comparisons with a Scalable, Streaming-Based System Architecture Based on FPGAs

    Get PDF
    Pilz S, Porrmann F, Kaiser M, Hagemeyer J, Hogan JM, RĆ¼ckert U. Accelerating Binary String Comparisons with a Scalable, Streaming-Based System Architecture Based on FPGAs. Algorithms. 2020;13(2): 47.This paper is concerned with Field Programmable Gate Arrays (FPGA)-based systems for energy-efficient high-throughput string comparison. Modern applications which involve comparisons across large data setsā€”such as large sequence sets in molecular biologyā€”are by their nature computationally intensive. In this work, we present a scalable FPGA-based system architecture to accelerate the comparison of binary strings. The current architecture supports arbitrary lengths in the range 16 to 2048-bit, covering a wide range of possible applications. In our example application, we consider DNA sequences embedded in a binary vector space through Locality Sensitive Hashing (LSH) one of several possible encodings that enable us to avoid more costly character-based operations. Here the resulting encoding is a 512-bit binary signature with comparisons based on the Hamming distance. In this approach, most of the load arises from the calculation of the O ( m āˆ— n ) Hamming distances between the signatures, where m is the number of queries and n is the number of signatures contained in the database. Signature generation only needs to be performed once, and we do not consider it further, focusing instead on accelerating the signature comparisons. The proposed FPGA-based architecture is optimized for high-throughput using hundreds of computing elements, arranged in a systolic array. These core computing elements can be adapted to support other string comparison algorithms with little effort, while the other infrastructure stays the same. On a Xilinx Virtex UltraScale+ FPGA (XCVU9P-2), a peak throughput of 75.4 billion comparisons per secondā€”of 512-bit signaturesā€”was achieved, using a design with 384 parallel processing elements and a clock frequency of 200 MHz. This makes our FPGA design 86 times faster than a highly optimized CPU implementation. Compared to a GPU design, executed on an NVIDIA GTX1060, it performs nearly five times faster

    Differentially Private One Permutation Hashing and Bin-wise Consistent Weighted Sampling

    Full text link
    Minwise hashing (MinHash) is a standard algorithm widely used in the industry, for large-scale search and learning applications with the binary (0/1) Jaccard similarity. One common use of MinHash is for processing massive n-gram text representations so that practitioners do not have to materialize the original data (which would be prohibitive). Another popular use of MinHash is for building hash tables to enable sub-linear time approximate near neighbor (ANN) search. MinHash has also been used as a tool for building large-scale machine learning systems. The standard implementation of MinHash requires applying KK random permutations. In comparison, the method of one permutation hashing (OPH), is an efficient alternative of MinHash which splits the data vectors into KK bins and generates hash values within each bin. OPH is substantially more efficient and also more convenient to use. In this paper, we combine the differential privacy (DP) with OPH (as well as MinHash), to propose the DP-OPH framework with three variants: DP-OPH-fix, DP-OPH-re and DP-OPH-rand, depending on which densification strategy is adopted to deal with empty bins in OPH. A detailed roadmap to the algorithm design is presented along with the privacy analysis. An analytical comparison of our proposed DP-OPH methods with the DP minwise hashing (DP-MH) is provided to justify the advantage of DP-OPH. Experiments on similarity search confirm the merits of DP-OPH, and guide the choice of the proper variant in different practical scenarios. Our technique is also extended to bin-wise consistent weighted sampling (BCWS) to develop a new DP algorithm called DP-BCWS for non-binary data. Experiments on classification tasks demonstrate that DP-BCWS is able to achieve excellent utility at around Ļµ=5āˆ¼10\epsilon = 5\sim 10, where Ļµ\epsilon is the standard parameter in the language of (Ļµ,Ī“)(\epsilon, \delta)-DP