8 research outputs found

    Evidence-based Software Process Recovery

    Get PDF
    Developing a large software system involves many complicated, varied, and inter-dependent tasks, and these tasks are typically implemented using a combination of defined processes, semi-automated tools, and ad hoc practices. Stakeholders in the development process --- including software developers, managers, and customers --- often want to be able to track the actual practices being employed within a project. For example, a customer may wish to be sure that the process is ISO 9000 compliant, a manager may wish to track the amount of testing that has been done in the current iteration, and a developer may wish to determine who has recently been working on a subsystem that has had several major bugs appear in it. However, extracting the software development processes from an existing project is expensive if one must rely upon manual inspection of artifacts and interviews of developers and their managers. Previously, researchers have suggested the live observation and instrumentation of a project to allow for more measurement, but this is costly, invasive, and also requires a live running project. In this work, we propose an approach that we call software process recovery that is based on after-the-fact analysis of various kinds of software development artifacts. We use a variety of supervised and unsupervised techniques from machine learning, topic analysis, natural language processing, and statistics on software repositories such as version control systems, bug trackers, and mailing list archives. We show how we can combine all of these methods to recover process signals that we map back to software development processes such as the Unified Process. The Unified Process has been visualized using a time-line view that shows effort per parallel discipline occurring across time. This visualization is called the Unified Process diagram. We use this diagram as inspiration to produce Recovered Unified Process Views (RUPV) that are a concrete version of this theoretical Unified Process diagram. We then validate these methods using case studies of multiple open source software systems

    DATA MINING AND RE-IDENTIFICATION: ANALYSIS OF DATABASE QUERY PATTERNS THAT POSE A THREAT TO ANONYMISED INFORMATION

    Get PDF
    To maintain the globally connected civilization culture in place today, a number of sectors are built on the gathering and sharing of data. Personal and sensitive data are collected and shared about the individuals using the services offered by these sectors. Data controllers rely on the robustness of anonymisation measures to keep personal and sensitive attributes in the shared dataset privacy safe. Typically, the dataset is stripped of direct identifiers such as names and National Insurance (NI) numbers, such that individuals in the dataset are not uniquely identifiable. However, details in the dataset perceived by data controllers to have no negative data privacy impact can be used by attackers to perform a re-identification attack. Such an attack uses the details shared in the dataset in conjunction with a secondary data source to rebuild a personally identifiable profile for individual(s) in the supposedly anonymised shared dataset. There have been a few publicised cases of re-identification attacks, and with the information reported about these attacks, it is unknown what constitutes a re-identification attack from a technical perspective other than its outcome. The work in this thesis explores real cases of successful re-identification attacks to analyse and build a technical profile of what re-identification entails. Using the Netflix Prize Data and the re-identification of Governor William Weld as case studies, synthetic datasets are created to represent the anonymised databases shared in each of these re-identification attack cases. An exploratory study to technically represent re-identification attacks as database queries in SQL is conducted. This involves the research performing re-identification attacks on the synthetic databases by executing a series of SQL queries. With a hypothesis that there is enough similarity in the patterns of SQL database queries that lead to re-identification attacks on anonymised databases, this research employs data mining techniques and machine learning algorithms to train classifiers to recognise re-identification patterns in SQL queries. Four classification algorithms: Multilayer Perceptron (MLP), Naive Bayes (NB), K-Nearest Neighbors (KNN), and Logistic Regression (LR) are trained in this research to recognise and predict attempts of re-identification attacks. The results of the performance evaluation and unseen data testing indicate that the MLP, Multinomial Naive Bayes (MNB), and the LR classifiers are most effective at recognising patterns of re-identification attacks. During performance evaluation, the MLP classifier achieved an accuracy of 100%, the MNB achieved 79.3% and the LR achieved 100%. The unseen data testing shows that the MLP, MNB, and LR classifiers are able to predict new instances of re-identification attack attempts 79%, 71%, and 79% of the time respectively, indicating a good generalisation performance. To the best of this research’s knowledge, the work in this thesis is the only effort to date to automate the recognition and prediction of re-identification attack attempts on anonymised databases. The novel system developed in this research can be implemented to improve the monitoring of anonymised databases in data sharing environments

    Personality Identification from Social Media Using Deep Learning: A Review

    Get PDF
    Social media helps in sharing of ideas and information among people scattered around the world and thus helps in creating communities, groups, and virtual networks. Identification of personality is significant in many types of applications such as in detecting the mental state or character of a person, predicting job satisfaction, professional and personal relationship success, in recommendation systems. Personality is also an important factor to determine individual variation in thoughts, feelings, and conduct systems. According to the survey of Global social media research in 2018, approximately 3.196 billion social media users are in worldwide. The numbers are estimated to grow rapidly further with the use of mobile smart devices and advancement in technology. Support vector machine (SVM), Naive Bayes (NB), Multilayer perceptron neural network, and convolutional neural network (CNN) are some of the machine learning techniques used for personality identification in the literature review. This paper presents various studies conducted in identifying the personality of social media users with the help of machine learning approaches and the recent studies that targeted to predict the personality of online social media (OSM) users are reviewed

    Mining social structures from genealogical data

    Get PDF

    Essentials of Business Analytics

    Get PDF

    Advances in knowledge discovery and data mining Part II

    Get PDF
    19th Pacific-Asia Conference, PAKDD 2015, Ho Chi Minh City, Vietnam, May 19-22, 2015, Proceedings, Part II</p
    corecore