5 research outputs found
CSISE: cloud-based semantic image search engine
Title from PDF of title page, viewed on March 27, 2014Thesis advisor: Yugyung LeeVitaIncludes bibliographical references (pages 53-56)Thesis (M. S.)--School of Computing and Engineering. University of Missouri--Kansas City, 2013Due to rapid exponential growth in data, a couple of challenges we face today are how to handle big data and analyze large data sets. An IBM study showed the amount of data created in the last two years alone is 90% of the data in the world today. We have especially seen the exponential growth of images on the Web, e.g., more than 6 billion in Flickr, 1.5 billion in Google image engine, and more than 1 billon images in Instagram [1]. Since big data are not only a matter of a size, but are also heterogeneous types and sources of data, image searching with big data may not be scalable in practical settings. We envision Cloud computing as a new way to transform the big data challenge into a great opportunity. In this thesis, we intend to perform an efficient and accurate classification of a large collection of images using Cloud computing, which in turn supports semantic image searching. A novel approach with enhanced accuracy has been proposed to utilize semantic
technology to classify images by analyzing both metadata and image data types. A two-level
classification model was designed (i) semantic classification was performed on a metadata of
images using TF-IDF, and (ii) image classification was performed using a hybrid image processing model combined with Euclidean distance and SURF FLANN measurements. A Cloud-based Semantic Image Search Engine (CSISE) is also developed to search an
image using the proposed semantic model with the dynamic image repository by connecting online image search engines that include Google Image Search, Flickr, and Picasa. A series of experiments have been performed in a large-scale Hadoop environment using IBM's cloud on over half a million logo images of 76 types. The experimental results show that the
performance of the CSISE engine (based on the proposed method) is comparable to the popular online image search engines as well as accurate with a higher rate (average precision of 71%) than existing approachesAbstract -- Contents -- Illustrations -- Tables -- Acknowledgements - Introduction -- Related work -- Cloud-based semantic image search engine model -- Cloud-based semantic image search engine (CSISE) implementation -- Experimental results and evaluation -- Conclusion and future work - Reference
GraphEvo: Evaluating Software Evolution Using Machine Learning Based Call Graph Analytics And Network Portrait Divergence
Title from PDF of title page, viewed September 9, 2022Dissertation advisor: Yugyung LeeVitaIncludes bibliographical references (pages 151-168)Dissertation (Ph.D)--Department of Computer Science and Electrical Engineering. University of Missouri--Kansas City, 2022Understanding software evolution is essential for software development tasks, including debugging, maintenance, and testing. Unfortunately, as software changes, it becomes more prominent and more complicated, which makes it harder to understand. Software Defect Prediction (SDP) in the codebase is one of the most common ways artificial intelligence (AI) is used to improve the quality of agile products. But graph-based software metrics are seldom used in the software.
In this dissertation, we propose a graph-based software framework called GraphEvo based on deep learning modeling for graphs. We applied the recent network comparison advancement to software networks via information theory-based metric Network portrait divergence (NPD). NPD captures the structural changes to call graph-based software networks. The NPD-based method determines what significant software changes are, how much execution paths are affected, and how tests are improved concerning the code. All of these factors affect how reliable the software is. To ensure that NPD-based software works well, version controls and Pull Requests (PRs) are used.
GraphEvo's most significant contributions are: (i) Find and show how software has changed over time using call graphs. (ii) Using a machine learning and deep learning techniques to understand the software and guess how many defects are in each code entity (such as a class). (iii) Use the NPD-based tooling to create a public bug dataset and machine learning to see how well it can predict software defects. (iv) Help with the PR review process by knowing how the changes to code and tests that go with them work.
We compared the performance of GraphEvo (i) across 66 software releases from five popular Java open-source systems to show that it works, (ii) for 9 Java projects and deep learning to make an SDP model, (iii) for 19 Java projects of different sizes and types from GitHub and to add bug information from other places, and (iv) for 627 PRs from 14 Java projects to see how vital tests are in PRs. These comprehensive experiments show that GraphEvo works well for debugging, maintaining, and testing software. We also received favorable responses from user studies, in which we asked software developers and testers what they thought of GraphEvo.Introduction -- Characterizing and understanding software evolution using call graphs -- Defect prediction using deep learning with NPD for software evolution -- NPD-based tooling, extendible defect dataset and its assessment -- Reviewing pull requests with path-based NPD and test
A Fine-grained Data Set and Analysis of Tangling in Bug Fixing Commits
Context: Tangled commits are changes to software that address multiple
concerns at once. For researchers interested in bugs, tangled commits mean that
they actually study not only bugs, but also other concerns irrelevant for the
study of bugs.
Objective: We want to improve our understanding of the prevalence of tangling
and the types of changes that are tangled within bug fixing commits.
Methods: We use a crowd sourcing approach for manual labeling to validate
which changes contribute to bug fixes for each line in bug fixing commits. Each
line is labeled by four participants. If at least three participants agree on
the same label, we have consensus.
Results: We estimate that between 17% and 32% of all changes in bug fixing
commits modify the source code to fix the underlying problem. However, when we
only consider changes to the production code files this ratio increases to 66%
to 87%. We find that about 11% of lines are hard to label leading to active
disagreements between participants. Due to confirmed tangling and the
uncertainty in our data, we estimate that 3% to 47% of data is noisy without
manual untangling, depending on the use case.
Conclusion: Tangled commits have a high prevalence in bug fixes and can lead
to a large amount of noise in the data. Prior research indicates that this
noise may alter results. As researchers, we should be skeptics and assume that
unvalidated data is likely very noisy, until proven otherwise.Comment: Status: Accepted at Empirical Software Engineerin
A fine-grained data set and analysis of tangling in bug fixing commits
Abstract
Context: Tangled commits are changes to software that address multiple concerns at once. For researchers interested in bugs, tangled commits mean that they actually study not only bugs, but also other concerns irrelevant for the study of bugs.
Objectives: We want to improve our understanding of the prevalence of tangling and the types of changes that are tangled within bug fixing commits.
Methods: We use a crowd sourcing approach for manual labeling to validate which changes contribute to bug fixes for each line in bug fixing commits. Each line is labeled by four participants. If at least three participants agree on the same label, we have consensus.
Results: We estimate that between 17% and 32% of all changes in bug fixing commits modify the source code to fix the underlying problem. However, when we only consider changes to the production code files this ratio increases to 66% to 87%. We find that about 11% of lines are hard to label leading to active disagreements between participants. Due to confirmed tangling and the uncertainty in our data, we estimate that 3% to 47% of data is noisy without manual untangling, depending on the use case.
Conclusions: Tangled commits have a high prevalence in bug fixes and can lead to a large amount of noise in the data. Prior research indicates that this noise may alter results. As researchers, we should be skeptics and assume that unvalidated data is likely very noisy, until proven otherwise