7 research outputs found

    SOME APPROACHES TO TEXT MINING AND THEIR POTENTIAL FOR SEMANTIC WEB APPLICATIONS

    Get PDF
    In this paper we describe some approaches to text mining, which are supported by an original software system developed in Java for support of information retrieval and text mining (JBowl), as well as its possible use in a distributed environment. The system JBowl1 is being developed as an open source software with the intention to provide an easily extensible, modular framework for pre-processing, indexing and further exploration of large text collections. The overall architecture of the system is described, followed by some typical use case scenarios, which have been used in some previous projects. Then, basic principles and technologies used for service-oriented computing, web services and semantic web services are presented. We further discuss how the JBowl system can be adopted into a distributed environment via technologies available already and what benefits can bring such an adaptation. This is in particular important in the context of a new integrated EU-funded project KP-Lab2 (Knowledge Practices Laboratory) that is briefly presented as well as the role of the proposed text mining services, which are currently being designed and developed there

    Exploiting the architectural characteristics of software components to improve software reuse

    Get PDF
    PhD ThesisSoftware development is a costly process for all but the most trivial systems. One of the commonly known ways of minimizing development costs is to re-use previously built software components. However, a significant problem that source-code re-users encounter is the difficulty of finding components that not only provide the functionality they need but also conform to the architecture of the system they are building. To facilitate finding reusable components there is a need to establish an appropriate mechanism for matching the key architectural characteristics of the available source-code components against the characteristics of the system being built. This research develops a precise characterization of the architectural characteristics of source-code components, and investigates a new way to describe how appropriate components for re-use can be identified and categorized.Umm Al- Qura University

    From text mining to knowledge mining: An integrated framework of concept extraction and categorization for domain ontology

    Get PDF
    Organizations are struggling with the challenges coming from the regulatory, social and economic environment which are complex and changing continuously. They cause increase demand for the management of organizational knowledge, like how to provide employees, the necessary job-specific knowledge in right time and in right format. Employees have to update their knowledge, improve their competencies continuously. Knowledge repositories have key roles from knowledge management aspects, because they contain primarily the organizationsā€™ intellectual assets (it is explicit knowledge) while employees have tacit knowledge, which is difficult to extract and codify. Business processes are also important from the management of organizational knowledge aspects, they have explicit and tacit knowledge elements as well. One of the key questions is how to handle this hidden knowledge in order to improve the organizational knowledge especially employees' knowledge by providing the most appropriate learning and/or training materials and how can we ensure that the knowledge in business processes are the same as in knowledge repositories and employees' head. These are the major themes in this thesis

    New Fundamental Technologies in Data Mining

    Get PDF
    The progress of data mining technology and large public popularity establish a need for a comprehensive text on the subject. The series of books entitled by "Data Mining" address the need by presenting in-depth description of novel mining algorithms and many useful applications. In addition to understanding each section deeply, the two books present useful hints and strategies to solving problems in the following chapters. The contributing authors have highlighted many future research directions that will foster multi-disciplinary collaborations and hence will lead to significant development in the field of data mining

    A Corpus Driven Computational Intelligence Framework for Deception Detection in Financial Text

    Get PDF
    Financial fraud rampages onwards seemingly uncontained. The annual cost of fraud in the UK is estimated to be as high as Ā£193bn a year [1] . From a data science perspective and hitherto less explored this thesis demonstrates how the use of linguistic features to drive data mining algorithms can aid in unravelling fraud. To this end, the spotlight is turned on Financial Statement Fraud (FSF), known to be the costliest type of fraud [2]. A new corpus of 6.3 million words is composed of102 annual reports/10-K (narrative sections) from firms formally indicted for FSF juxtaposed with 306 non-fraud firms of similar size and industrial grouping. Differently from other similar studies, this thesis uniquely takes a wide angled view and extracts a range of features of different categories from the corpus. These linguistic correlates of deception are uncovered using a variety of techniques and tools. Corpus linguistics methodology is applied to extract keywords and to examine linguistic structure. N-grams are extracted to draw out collocations. Readability measurement in financial text is advanced through the extraction of new indices that probe the text at a deeper level. Cognitive and perceptual processes are also picked out. Tone, intention and liquidity are gauged using customised word lists. Linguistic ratios are derived from grammatical constructs and word categories. An attempt is also made to determine ā€˜whatā€™ was said as opposed to ā€˜howā€™. Further a new module is developed to condense synonyms into concepts. Lastly frequency counts from keywords unearthed from a previous content analysis study on financial narrative are also used. These features are then used to drive machine learning based classification and clustering algorithms to determine if they aid in discriminating a fraud from a non-fraud firm. The results derived from the battery of models built typically exceed classification accuracy of 70%. The above process is amalgamated into a framework. The process outlined, driven by empirical data demonstrates in a practical way how linguistic analysis could aid in fraud detection and also constitutes a unique contribution made to deception detection studies

    A multi-factor model for range estimation in electric vehicles

    Get PDF
    Electric vehicles (EVs) are well-known for their challenges related to trip planning and energy consumption estimation. Range anxiety is currently a barrier to the adoption of EVs. One of the issues influencing range anxiety is the inaccuracy of the remaining driving range (RDR) estimate in on-board displays. RDR displays are important as they can help drivers with trip planning. The RDR is a parameter that changes under environmental and behavioural conditions. Several factors (for example, weather, and traffic) can influence the energy consumption of an EV that are not considered during the RDR estimation in traditional on-board computers or third-party applications, such as navigation or mapping applications. The need for accurate RDR estimation is growing, since this can reduce the range anxiety of drivers. One way of overcoming range anxiety is to provide trip planning applications that provide accurate estimations of the RDR, based on various factors, and which adapt to the usersā€™ driving behaviour. Existing models used for estimating the RDR are often simplified, and do not consider all the factors that can influence it. Collecting data for each factor also presents several challenges. Powerful computing resources are required to collect, transform, and analyse the disparate datasets that are required for each factor. The aim of this research was to design a Multi-factor Model for range estimation in EVs. Five main factors that influence the energy consumption of EVs were identified from literature, namely, Route and Terrain, Driving Behaviour, Weather and Environment, Vehicle Modelling, and Battery Modelling. These factors were used throughout this research to guide the data collection and analysis processes. A Multi-factor Model was proposed based on four main components that collect, process, analyse, and visualise data from available data sources to produce estimates relating to trip planning. A proof-of-concept RDR system was developed and evaluated in field experiments, to demonstrate that the Multi-factor Model addresses the main aim of this research. The experiments were performed to collect data for each of the five factors, and to analyse their impact on energy consumption. Several machine learning techniques were used, and evaluated, for accuracy in estimating the energy consumption, from which the RDR can be derived, for a specified trip. A case study was conducted with an electric mobility programme (uYilo) in Port Elizabeth, South Africa (SA). The case study was used to investigate whether the available resources at uYilo were sufficient to provide data for each of the five factors. Several challenges were noted during the data collection. These were shortages of software applications, a lack of quality data, technical interoperability and data access between the data collection instruments and systems. Data access was a problem in some cases, since proprietary systems restrict access to external developers. The theoretical contribution of this research is a list of factors that influence RDR and a classification of machine learning techniques that can be used to estimate the RDR. The practical contributions of this research include a database of EV trips, proof-of-concept RDR estimation system, and a deployed machine learning model that can be accessed by researchers and EV practitioners. Four research papers were published and presented at local and international conferences. In addition, one conference paper was published in an accredited journal: NextComp 2017 (Appendix C), Conference Paper, Pointe aux Piments (Mauritius); SATNAC 2017 (Appendix F), Conference Paper, Barcelona (Spain); GITMA 2018 (Appendix B), Conference Paper, Mexico City (Mexico); SATNAC 2018 (Appendix G), Conference Paper, George (South Africa), and IFIP World Computer Congress 2018 (Appendix E), Journal Article
    corecore