46 research outputs found

    The relationship between search based software engineering and predictive modeling

    Full text link
    Search Based Software Engineering (SBSE) is an approach to software engineering in which search based optimization algorithms are used to identify optimal or near optimal solutions and to yield insight. SBSE techniques can cater for multiple, possibly competing objectives and/or constraints and applications where the potential solution space is large and complex. This paper will provide a brief overview of SBSE, explaining some of the ways in which it has already been applied to construction of predictive models. There is a mutually beneficial relationship between predictive models and SBSE. The paper sets out eleven open problem areas for Search Based Predictive Modeling and describes how predictive models also have role to play in improving SBSE

    On Formal Methods for Large-Scale Product Configuration

    Get PDF
    <p>In product development companies mass customization is widely used to achieve better customer satisfaction while keeping costs down. To efficiently implement mass customization, product platforms are often used. A product platform allows building a wide range of products from a set of predefined components. The process of matching these components to customers' needs is called product configuration. Not all components can be combined with each other due to restrictions of various kinds, for example, geometrical, marketing and legal reasons. Product design engineers develop configuration constraints to describe such restrictions. The number of constraints and the complexity of the relations between them are immense for complex product like a vehicle. Thus, it is both error-prone and time consuming to analyze, author and verify the constraints manually. Software tools based on formal methods can help engineers to avoid making errors when working with configuration constraints, thus design a correct product faster.</p> <p>This thesis introduces a number of formal methods to help engineers maintain, verify and analyze product configuration constraints. These methods provide automatic verification of constraints and computational support for analyzing and refactoring constraints. The methods also allow verifying the correctness of one specific type of constraints, item usage rules, for sets of mutually-exclusive required items, and automatic verification of equivalence of different formulations of the constraints. The thesis also introduces three methods for efficient enumeration of valid partial configurations, with benchmarking of the methods on an industrial dataset.</p> <p>Handling large-scale industrial product configuration problems demands high efficiency from the software methods. This thesis investigates a number of search-based and knowledge-compilation-based methods for working with large product configuration instances, including Boolean satisfiability solvers, binary decision diagrams and decomposable negation normal form. This thesis also proposes a novel method based on supervisory control theory for efficient reasoning about product configuration data. The methods were implemented in a tool, to investigate the applicability of the methods for handling large product configuration problems. It was found that search-based Boolean satisfiability solvers with incremental capabilities are well suited for industrial configuration problems.</p> <p>The methods proposed in this thesis exhibit good performance on practical configuration problems, and have a potential to be implemented in industry to support product design engineers in creating and maintaining configuration constraints, and speed up the development of product platforms and new products.</p

    Industrial Symbiosis Recommender Systems

    Get PDF
    For a long time, humanity has lived upon the paradigm that the amounts of natural resources are unlimited and that the environment has ample regenerative capacity. However, the notion to shift towards sustainability has resulted in a worldwide adoption of policies addressing resource efficiency and preservation of natural resources.One of the key environmental and economic sustainable operations that is currently promoted and enacted in the European Union policy is Industrial Symbiosis. In industrial symbiosis, firms aim to reduce the total material and energy footprint by circulating traditional secondary production process outputs of firms to become part of an input for the production process of other firms.This thesis directs attention to the design considerations for recommender systems in the highly dynamic domain of industrial symbiosis. Recommender systems are a promising technology that may facilitate in multiple facets of the industrial symbiosis creation as they reduce the complexity of decision making. This typical strength of recommender systems has been responsible for improved sales and a higher return of investments. That provides the prospect for industrial symbiosis recommenders to increase the number of synergistic transactions that reduce the total environmental impact of the process industry in particular

    Approaches to grid-based SAT solving

    Get PDF
    In this work we develop techniques for using distributed computing resources to efficiently solve instances of the propositional satisfiability problem (SAT). The computing resources considered in this work are assumed to be geographically distributed and connected by a non-dedicated network. Such systems are typically referred to as computational grid environments. The time a modern SAT solver consumes while solving an instance varies according to a random distribution. Unlike many other methods for distributed SAT solving, this work identifies the random distribution as a valuable resource for solving-time reduction. The methods which use randomness in the run times of a search algorithm, such as the ones discussed in this work, are examples of multi-search. The main contribution of this work is in developing and analyzing the multi-search approach in SAT solving and showing its efficiency with several experiments. For the purpose of the analysis, the work introduces a grid simulation model which captures several of the properties of a grid environment which are not observed in more traditional parallel computing systems. The work develops two algorithmic frameworks for multi-search in SAT. The first, SDSAT, is based on using properties of the distribution of the solving time so that the expected time required to solve an instance is reduced. Based on the analysis of SDSAT, the work proposes an algorithm for efficiently using large number of computing resources simultaneously to solve collections of SAT instances. The analysis of SDSAT also motivates the second algorithmic framework, CL-SDSAT. The framework is used to efficiently solve many industrial SAT instances by carefully combining information learned in the distributed SAT solvers. All methods described in the work are directly applicable in a wide range of grid environments and can be used together with virtually unmodified state-of-the-art SAT solvers. The methods are experimentally verified using standard benchmark SAT instances in a production-level grid environment. The experiments show that using the relatively simple methods developed in the work, SAT instances which cannot be solved efficiently in sequential settings can be now solved in a grid environment

    USING SOCIAL ANNOTATIONS TO IMPROVE WEB SEARCH

    Get PDF
    Web-based tagging systems, which include social bookmarking systems such as Delicious, have become increasingly popular. These systems allow participants to annotate or tag web resources. This research examined the use of social annotations to improve the quality of web searches. The research involved three components. First, social annotations were used to index resources. Two annotation-based indexing methods were proposed: annotation based indexing and full text with annotation indexing. Second, social annotations were used to improve search result ranking. Six annotation based ranking methods were proposed: Popularity Count, Propagate Popularity Count, Query Weighted Popularity Count, Query Weighted Propagate Popularity Count, Match Tag Count and Normalized Match Tag Count. Third, social annotations were used to both index and rank resources. The result from the first experiment suggested that both static feature and similarity feature should be considered when using social annotations to re-rank search result. The result of the second experiment showed that using only annotation as an index of resources may not be a good idea. Since social Annotations could be viewed as a high level concept of the content, combining them to the content of resource could add some more important concepts to the resources. Last but not least, the result from the third experiment confirmed that the combination of using social annotations to rank the search result and using social annotations as resource index augmentation provided a promising rank of search results. It showed that social annotations could benefit web search

    Machine learning for network based intrusion detection: an investigation into discrepancies in findings with the KDD cup '99 data set and multi-objective evolution of neural network classifier ensembles from imbalanced data.

    Get PDF
    For the last decade it has become commonplace to evaluate machine learning techniques for network based intrusion detection on the KDD Cup '99 data set. This data set has served well to demonstrate that machine learning can be useful in intrusion detection. However, it has undergone some criticism in the literature, and it is out of date. Therefore, some researchers question the validity of the findings reported based on this data set. Furthermore, as identified in this thesis, there are also discrepancies in the findings reported in the literature. In some cases the results are contradictory. Consequently, it is difficult to analyse the current body of research to determine the value in the findings. This thesis reports on an empirical investigation to determine the underlying causes of the discrepancies. Several methodological factors, such as choice of data subset, validation method and data preprocessing, are identified and are found to affect the results significantly. These findings have also enabled a better interpretation of the current body of research. Furthermore, the criticisms in the literature are addressed and future use of the data set is discussed, which is important since researchers continue to use it due to a lack of better publicly available alternatives. Due to the nature of the intrusion detection domain, there is an extreme imbalance among the classes in the KDD Cup '99 data set, which poses a significant challenge to machine learning. In other domains, researchers have demonstrated that well known techniques such as Artificial Neural Networks (ANNs) and Decision Trees (DTs) often fail to learn the minor class(es) due to class imbalance. However, this has not been recognized as an issue in intrusion detection previously. This thesis reports on an empirical investigation that demonstrates that it is the class imbalance that causes the poor detection of some classes of intrusion reported in the literature. An alternative approach to training ANNs is proposed in this thesis, using Genetic Algorithms (GAs) to evolve the weights of the ANNs, referred to as an Evolutionary Neural Network (ENN). When employing evaluation functions that calculate the fitness proportionally to the instances of each class, thereby avoiding a bias towards the major class(es) in the data set, significantly improved true positive rates are obtained whilst maintaining a low false positive rate. These findings demonstrate that the issues of learning from imbalanced data are not due to limitations of the ANNs; rather the training algorithm. Moreover, the ENN is capable of detecting a class of intrusion that has been reported in the literature to be undetectable by ANNs. One limitation of the ENN is a lack of control of the classification trade-off the ANNs obtain. This is identified as a general issue with current approaches to creating classifiers. Striving to create a single best classifier that obtains the highest accuracy may give an unfruitful classification trade-off, which is demonstrated clearly in this thesis. Therefore, an extension of the ENN is proposed, using a Multi-Objective GA (MOGA), which treats the classification rate on each class as a separate objective. This approach produces a Pareto front of non-dominated solutions that exhibit different classification trade-offs, from which the user can select one with the desired properties. The multi-objective approach is also utilised to evolve classifier ensembles, which yields an improved Pareto front of solutions. Furthermore, the selection of classifier members for the ensembles is investigated, demonstrating how this affects the performance of the resultant ensembles. This is a key to explaining why some classifier combinations fail to give fruitful solutions

    Exploring variability in medical imaging

    Get PDF
    Although recent successes of deep learning and novel machine learning techniques improved the perfor- mance of classification and (anomaly) detection in computer vision problems, the application of these methods in medical imaging pipeline remains a very challenging task. One of the main reasons for this is the amount of variability that is encountered and encapsulated in human anatomy and subsequently reflected in medical images. This fundamental factor impacts most stages in modern medical imaging processing pipelines. Variability of human anatomy makes it virtually impossible to build large datasets for each disease with labels and annotation for fully supervised machine learning. An efficient way to cope with this is to try and learn only from normal samples. Such data is much easier to collect. A case study of such an automatic anomaly detection system based on normative learning is presented in this work. We present a framework for detecting fetal cardiac anomalies during ultrasound screening using generative models, which are trained only utilising normal/healthy subjects. However, despite the significant improvement in automatic abnormality detection systems, clinical routine continues to rely exclusively on the contribution of overburdened medical experts to diagnosis and localise abnormalities. Integrating human expert knowledge into the medical imaging processing pipeline entails uncertainty which is mainly correlated with inter-observer variability. From the per- spective of building an automated medical imaging system, it is still an open issue, to what extent this kind of variability and the resulting uncertainty are introduced during the training of a model and how it affects the final performance of the task. Consequently, it is very important to explore the effect of inter-observer variability both, on the reliable estimation of model’s uncertainty, as well as on the model’s performance in a specific machine learning task. A thorough investigation of this issue is presented in this work by leveraging automated estimates for machine learning model uncertainty, inter-observer variability and segmentation task performance in lung CT scan images. Finally, a presentation of an overview of the existing anomaly detection methods in medical imaging was attempted. This state-of-the-art survey includes both conventional pattern recognition methods and deep learning based methods. It is one of the first literature surveys attempted in the specific research area.Open Acces

    Similarity and diversity: two sides of the same coin in the evaluation of data streams

    Get PDF
    The Information Systems represent the primary instrument of growth for the companies that operate in the so-called e-commerce environment. The data streams generated by the users that interact with their websites are the primary source to define the user behavioral models. Some main examples of services integrated in these websites are the Recommender Systems, where these models are exploited in order to generate recommendations of items of potential interest to users, the User Segmentation Systems, where the models are used in order to group the users on the basis of their preferences, and the Fraud Detection Systems, where these models are exploited to determine the legitimacy of a financial transaction. Even though in literature diversity and similarity are considered as two sides of the same coin, almost all the approaches take into account them in a mutually exclusive manner, rather than jointly. The aim of this thesis is to demonstrate how the consideration of both sides of this coin is instead essential to overcome some well-known problems that affict the state-of-the-art approaches used to implement these services, improving their performance. Its contributions are the following: with regard to the recommender systems, the detection of the diversity in a user profile is used to discard incoherent items, improving the accuracy, while the exploitation of the similarity of the predicted items is used to re-rank the recommendations, improving their effectiveness; with regard to the user segmentation systems, the detection of the diversity overcomes the problem of the non-reliability of data source, while the exploitation of the similarity reduces the problems of understandability and triviality of the obtained segments; lastly, concerning the fraud detection systems, the joint use of both diversity and similarity in the evaluation of a new transaction overcomes the problems of the data scarcity, and those of the non-stationary and unbalanced class distribution

    Similarity and diversity: two sides of the same coin in the evaluation of data streams

    Get PDF
    The Information Systems represent the primary instrument of growth for the companies that operate in the so-called e-commerce environment. The data streams generated by the users that interact with their websites are the primary source to define the user behavioral models. Some main examples of services integrated in these websites are the Recommender Systems, where these models are exploited in order to generate recommendations of items of potential interest to users, the User Segmentation Systems, where the models are used in order to group the users on the basis of their preferences, and the Fraud Detection Systems, where these models are exploited to determine the legitimacy of a financial transaction. Even though in literature diversity and similarity are considered as two sides of the same coin, almost all the approaches take into account them in a mutually exclusive manner, rather than jointly. The aim of this thesis is to demonstrate how the consideration of both sides of this coin is instead essential to overcome some well-known problems that affict the state-of-the-art approaches used to implement these services, improving their performance. Its contributions are the following: with regard to the recommender systems, the detection of the diversity in a user profile is used to discard incoherent items, improving the accuracy, while the exploitation of the similarity of the predicted items is used to re-rank the recommendations, improving their effectiveness; with regard to the user segmentation systems, the detection of the diversity overcomes the problem of the non-reliability of data source, while the exploitation of the similarity reduces the problems of understandability and triviality of the obtained segments; lastly, concerning the fraud detection systems, the joint use of both diversity and similarity in the evaluation of a new transaction overcomes the problems of the data scarcity, and those of the non-stationary and unbalanced class distribution
    corecore