114,702 research outputs found

    On the role of pre and post-processing in environmental data mining

    Get PDF
    The quality of discovered knowledge is highly depending on data quality. Unfortunately real data use to contain noise, uncertainty, errors, redundancies or even irrelevant information. The more complex is the reality to be analyzed, the higher the risk of getting low quality data. Knowledge Discovery from Databases (KDD) offers a global framework to prepare data in the right form to perform correct analyses. On the other hand, the quality of decisions taken upon KDD results, depend not only on the quality of the results themselves, but on the capacity of the system to communicate those results in an understandable form. Environmental systems are particularly complex and environmental users particularly require clarity in their results. In this paper some details about how this can be achieved are provided. The role of the pre and post processing in the whole process of Knowledge Discovery in environmental systems is discussed

    Automated assessment of non-native learner essays: Investigating the role of linguistic features

    Get PDF
    Automatic essay scoring (AES) refers to the process of scoring free text responses to given prompts, considering human grader scores as the gold standard. Writing such essays is an essential component of many language and aptitude exams. Hence, AES became an active and established area of research, and there are many proprietary systems used in real life applications today. However, not much is known about which specific linguistic features are useful for prediction and how much of this is consistent across datasets. This article addresses that by exploring the role of various linguistic features in automatic essay scoring using two publicly available datasets of non-native English essays written in test taking scenarios. The linguistic properties are modeled by encoding lexical, syntactic, discourse and error types of learner language in the feature set. Predictive models are then developed using these features on both datasets and the most predictive features are compared. While the results show that the feature set used results in good predictive models with both datasets, the question "what are the most predictive features?" has a different answer for each dataset.Comment: Article accepted for publication at: International Journal of Artificial Intelligence in Education (IJAIED). To appear in early 2017 (journal url: http://www.springer.com/computer/ai/journal/40593

    Mining developer communication data streams

    Full text link
    This paper explores the concepts of modelling a software development project as a process that results in the creation of a continuous stream of data. In terms of the Jazz repository used in this research, one aspect of that stream of data would be developer communication. Such data can be used to create an evolving social network characterized by a range of metrics. This paper presents the application of data stream mining techniques to identify the most useful metrics for predicting build outcomes. Results are presented from applying the Hoeffding Tree classification method used in conjunction with the Adaptive Sliding Window (ADWIN) method for detecting concept drift. The results indicate that only a small number of the available metrics considered have any significance for predicting the outcome of a build

    Data analytics and algorithms in policing in England and Wales: Towards a new policy framework

    Get PDF
    RUSI was commissioned by the Centre for Data Ethics and Innovation (CDEI) to conduct an independent study into the use of data analytics by police forces in England and Wales, with a focus on algorithmic bias. The primary purpose of the project is to inform CDEI’s review of bias in algorithmic decision-making, which is focusing on four sectors, including policing, and working towards a draft framework for the ethical development and deployment of data analytics tools for policing. This paper focuses on advanced algorithms used by the police to derive insights, inform operational decision-making or make predictions. Biometric technology, including live facial recognition, DNA analysis and fingerprint matching, are outside the direct scope of this study, as are covert surveillance capabilities and digital forensics technology, such as mobile phone data extraction and computer forensics. However, because many of the policy issues discussed in this paper stem from general underlying data protection and human rights frameworks, these issues will also be relevant to other police technologies, and their use must be considered in parallel to the tools examined in this paper. The project involved engaging closely with senior police officers, government officials, academics, legal experts, regulatory and oversight bodies and civil society organisations. Sixty nine participants took part in the research in the form of semi-structured interviews, focus groups and roundtable discussions. The project has revealed widespread concern across the UK law enforcement community regarding the lack of official national guidance for the use of algorithms in policing, with respondents suggesting that this gap should be addressed as a matter of urgency. Any future policy framework should be principles-based and complement existing police guidance in a ‘tech-agnostic’ way. Rather than establishing prescriptive rules and standards for different data technologies, the framework should establish standardised processes to ensure that data analytics projects follow recommended routes for the empirical evaluation of algorithms within their operational context and evaluate the project against legal requirements and ethical standards. The new guidance should focus on ensuring multi-disciplinary legal, ethical and operational input from the outset of a police technology project; a standard process for model development, testing and evaluation; a clear focus on the human–machine interaction and the ultimate interventions a data driven process may inform; and ongoing tracking and mitigation of discrimination risk

    Investigating effort prediction of web-based applications using CBR on the ISBSG dataset

    Get PDF
    As web-based applications become more popular and more sophisticated, so does the requirement for early accurate estimates of the effort required to build such systems. Case-based reasoning (CBR) has been shown to be a reasonably effective estimation strategy, although it has not been widely explored in the context of web applications. This paper reports on a study carried out on a subset of the ISBSG dataset to examine the optimal number of analogies that should be used in making a prediction. The results show that it is not possible to select such a value with confidence, and that, in common with other findings in different domains, the effectiveness of CBR is hampered by other factors including the characteristics of the underlying dataset (such as the spread of data and presence of outliers) and the calculation employed to evaluate the distance function (in particular, the treatment of numeric and categorical data)

    A System for Accessible Artificial Intelligence

    Full text link
    While artificial intelligence (AI) has become widespread, many commercial AI systems are not yet accessible to individual researchers nor the general public due to the deep knowledge of the systems required to use them. We believe that AI has matured to the point where it should be an accessible technology for everyone. We present an ongoing project whose ultimate goal is to deliver an open source, user-friendly AI system that is specialized for machine learning analysis of complex data in the biomedical and health care domains. We discuss how genetic programming can aid in this endeavor, and highlight specific examples where genetic programming has automated machine learning analyses in previous projects.Comment: 14 pages, 5 figures, submitted to Genetic Programming Theory and Practice 2017 worksho
    • 

    corecore