37 research outputs found

    Mining domain knowledge from app descriptions

    Get PDF
    Domain analysis aims at obtaining knowledge to a particular domain in the early stage of software development. A key challenge in domain analysis is to extract features automatically from related product artifacts. Compared with other kinds of artifacts, high volume of descriptions can be collected from app marketplaces (such as Google Play and Apple Store) easily when developing a new mobile application (App), so it is essential for the success of domain analysis to obtain features and relationship from them using data technologies. In this paper, we propose an approach to mine domain knowledge from App descriptions automatically. In our approach, the information of features in a single app description is firstly extracted and formally described by a Concern-based Description Model (CDM), this process is based on predefined rules of feature extraction and a modified topic modeling method; then the overall knowledge in the domain is identified by classifying, clustering and merging the knowledge in the set of CDMs and topics, and the results are formalized by a Data-based Raw Domain Model (DRDM). Furthermore, we propose a quantified evaluation method for prioritizing the knowledge in DRDM. The proposed approach is validated by a series of experiments

    Mining Domain Knowledge: Using Functional Dependencies to Profile Data

    Get PDF
    Poor data quality is one of the primary issues facing big data projects. Cleaning data and improving quality can be expensive and time-intensive. In data warehouse projects, data cleaning is estimated to account for 30% to 80% of the project\u27s development time and budget. Data quality mining is one method used to identify errors that has become increasingly popular in the past 20 years. Our research-in-progress aims to identify multi-field errors via the mining of functional dependencies. Existing research on data quality mining and functional dependencies has focused on improving algorithms to identify a higher percentage of complex errors. The proposed process strives to introduce an efficient method for expediting error identification and increasing a user\u27s domain knowledge in order to reduce the costs associated with cleaning; the process will also include an assessment of when further cleaning is unlikely to be cost effective

    Derivation of Monotone Decision Models from Non-Monotone Data

    Get PDF
    The objective of data mining is the extraction of knowledge from databases. In practice, one often encounters difficulties with models that are constructed purely by search, without incorporation of knowledge about the domain of application.In economic decision making such as credit loan approval or risk analysis, one often requires models that are monotone with respect to the decision variables involved.If the model is obtained by a blind search through the data, it does mostly not have this property even if the underlying database is monotone.In this paper, we present methods to enforce monotonicity of decision models.We propose measures to express the degree of monotonicity of the data and an algorithm to make data sets monotone.In addition, it is shown that monotone decision trees derived from cleaned data perform better compared to trees derived from raw data.decision models;knowledge;decision theory;operational research;data mining

    Simplifying Deep-Learning-Based Model for Code Search

    Full text link
    To accelerate software development, developers frequently search and reuse existing code snippets from a large-scale codebase, e.g., GitHub. Over the years, researchers proposed many information retrieval (IR) based models for code search, which match keywords in query with code text. But they fail to connect the semantic gap between query and code. To conquer this challenge, Gu et al. proposed a deep-learning-based model named DeepCS. It jointly embeds method code and natural language description into a shared vector space, where methods related to a natural language query are retrieved according to their vector similarities. However, DeepCS' working process is complicated and time-consuming. To overcome this issue, we proposed a simplified model CodeMatcher that leverages the IR technique but maintains many features in DeepCS. Generally, CodeMatcher combines query keywords with the original order, performs a fuzzy search on name and body strings of methods, and returned the best-matched methods with the longer sequence of used keywords. We verified its effectiveness on a large-scale codebase with about 41k repositories. Experimental results showed the simplified model CodeMatcher outperforms DeepCS by 97% in terms of MRR (a widely used accuracy measure for code search), and it is over 66 times faster than DeepCS. Besides, comparing with the state-of-the-art IR-based model CodeHow, CodeMatcher also improves the MRR by 73%. We also observed that: fusing the advantages of IR-based and deep-learning-based models is promising because they compensate with each other by nature; improving the quality of method naming helps code search, since method name plays an important role in connecting query and code

    Customer profiles:extracting usage models from log files

    Get PDF
    The project "Customer Profiles" is executed under supervision of the Embedded System Innovation by TNO (TNO-ESI) at ASML. The project was a full-time, nine-month graduation assignment in the context of a post-master program in Software Technology offered by the Eindhoven University of Technology. The project goal was to obtain insight into the actual usage of systems by analyzing log files. The project resulted with a prototype, a portable architecture, domain analysis, and suggestions how to improve the process of extracting customer profiles. The most important project artifact is the prototype that shows the feasibility of applying process mining and resources tracing techniques to obtain insight into the actual usage of a system by analyzing log files. The prototype supports set of different activities such as: data collection, data preprocessing, information extraction, and information aggregation that work together to obtain a customer profile model that express the typical and atypical behavior of the participants in production environment as captured in the log files, which defines the prototype output. The validation phase has shown that the prototype output exceeds the stakeholders' expectations. ASML profited from the prototype output and TNO-ESI will reuse the approach for different customers. The success of the prototype output lead to a new requirement: a portable system architecture. Therefore, as a part of the project a portable system architecture that supports extracting customer profiles was designed. The architecture is based on the Pipes and Filters architectural pattern. The system architecture and design are a result of a broad architectural and system analysis, which balances between the stakeholder requirements and the most common practices in the software architecture and software development. As a part of the architecture, components that support different functionalities such as: Data Source, Event Parser, Event Enricher, and Event Combiner were designed. A lot of domain knowledge was gained during the project. The domain knowledge was transformed into a comprehensive domain analysis. The domain analysis contains the most common aspects of applying process mining for extracting customer profiles such us: mapping issues, missing information, and the minimal log data requirements. As a part of the domain analysis an evaluation of the process mining algorithms was performed. The evaluation showed that the heuristics miner and the genetic miner are the most appropriate process mining algorithms for extracting customer profiles. In order to improve the process of extracting customer profiles a list of suggestions was created. The suggestions focus on the most common problems in the logging infrastructures and in the process mining techniques. One of the suggestions is conscious manufacturer decision on the log file content. The manufacturer should define the ratio, the context (based on the minimal log data requirements), and the scope of the logging infrastructure. Another important suggestion for the logging infrastructure is having unique identifiers across the entire logging domain. The next suggestion advocates logging infrastructure on use case (end user activity) level. The last, but not the least suggestion is consistent accurate and standardized timestamp in the logging infrastructure. During the project experiments it was detected that the maturity level of the process mining tools is not on an appropriate level for industrial usage

    Secure and Distributed Approach for Mining Association Rules

    Get PDF
    Data mining is the process of extracting trends from data sources. Domain exerts can make use of the trends to derive business intelligence. Big organizations store data in multiple server and often data is horizontally distributed. Mining such database provides useful and actionable knowledge which can help in making well informed decisions. However, secure mining of extracting association rules can provide interesting information that can help enterprises to make expert decisions. In this paper, we propose an algorithm and have a secure mechanism in order to mine association rules for deriving knowledge. We also incorporated auditing of data in the proposed system. We built a prototype application that demonstrates the secure mining of association rules with support and confidence. The statistical measures such as support and confidence help in knowing the usefulness of the rules. The empirical results are encouraging

    Semantic-enhanced web-page recommender systems

    Full text link
    University of Technology, Sydney. Faculty of Engineering and Information Technology.This thesis presents a new framework for a semantic-enhanced Web-page recommender (WPR) system, and a suite of enabling techniques which include semantic network models of domain knowledge and Web usage knowledge, querying techniques, and Web-page recommendation strategies. The framework enables the system to automatically discover and construct the domain and Web usage knowledge bases, and to generate effective Webpage recommendations. The main contributions of the framework are fourfold: (1) it effectively changes the fact that knowledge base construction must rely on human experts; (2) it enriches the pool of candidate Web-pages for effective Web-page recommendations by using semantic knowledge of both Web-pages and Web usage; (3) it thoroughly resolves the inconsistency problem facing contemporary WPR systems which heavily employ heterogeneous representations of knowledge bases. Knowledge bases in the system are consistently represented in a formal Web ontology language, namely OWL; and (4) it can generate effective Web-page recommendations based on a set of thoughtfully-designed recommendation strategies. A prototype of the semantic-enhanced WPR system is developed and presented, and the experimental comparisons with existing WPR approaches convincingly prove the significantly improved performance of WPR systems based on the framework and its enabling techniques

    Natural Language Requirements Processing: A 4D Vision

    Get PDF
    The future evolution of the application of natural language processing technologies in requirements engineering can be viewed from four dimensions: discipline, dynamism, domain knowledge, and datasets
    corecore