9 research outputs found

    Classification for accuracy and insight : A weighted sum approach

    Get PDF
    This research presents a classifier that aims to provide insight into a dataset in addition to achieving classification accuracies comparable to other algorithms. The classifier called, Automated Weighted Sum (AWSum) uses a weighted sum approach where feature values are assigned weights that are summed and compared to a threshold in order to classify an example. Though naive, this approach is scalable, achieves accurate classifications on standard datasets and also provides a degree of insight. By insight we mean that the technique provides an appreciation of the influence a feature value has on class values, relative to each other. AWSum provides a focus on the feature value space that allows the technique to identify feature values and combinations of feature values that are sensitive and important for a classification. This is particularly useful in fields such as medicine where this sort of micro-focus and understanding is critical in classification

    AWSum - applying data mining in a health care scenario

    Get PDF
    This paper investigates the application of a new data mining algorithm called Automated Weighted Sum, (AWSum), to diabetes screening data to explore its use in providing researchers with new insight into the disease and secondarily to explore the potential the algorithm has for the generation of prognostic models for clinical use. There are many data mining classifiers that produce high levels of predictive accuracy but their application to health research and clinical applications is limited because they are complex, produce results that are difficult to interpret and are difficult to integrate with current knowledge and practises. This is because most focus on accuracy at the expense of informing the user as to the influences that lead to their classification results. By providing this information on influences a researcher can be pointed to new potentially interesting avenues for investigation. AWSum measures influence by calculating a weight for each feature value that represents its influence on a class value relative to other class values. The results produced, although on limited data, indicated the approach has potential uses for research and has some characteristics that may be useful in the future development of prognostic models

    Novel data mining techniques for incompleted clinical data in diabetes management

    Get PDF
    An important part of health care involves upkeep and interpretation of medical databases containing patient records for clinical decision making, diagnosis and follow-up treatment. Missing clinical entries make it difficult to apply data mining algorithms for clinical decision support. This study demonstrates that higher predictive accuracy is possible using conventional data mining algorithms if missing values are dealt with appropriately. We propose a novel algorithm using a convolution of sub-problems to stage a super problem, where classes are defined by Cartesian Product of class values of the underlying problems, and Incomplete Information Dismissal and Data Completion techniques are applied for reducing features and imputing missing values. Predictive accuracies using Decision Branch, Nearest Neighborhood and Naïve Bayesian classifiers were compared to predict diabetes, cardiovascular disease and hypertension. Data is derived from Diabetes Screening Complications Research Initiative (DiScRi) conducted at a regional Australian university involving more than 2400 patient records with more than one hundred clinical risk factors (attributes). The results show substantial improvements in the accuracy achieved with each classifier for an effective diagnosis of diabetes, cardiovascular disease and hypertension as compared to those achieved without substituting missing values. The gain in improvement is 7% for diabetes, 21% for cardiovascular disease and 24% for hypertension, and our integrated novel approach has resulted in more than 90% accuracy for the diagnosis of any of the three conditions. This work advances data mining research towards achieving an integrated and holistic management of diabetes. - See more at: http://www.sciencedomain.org/abstract.php?iid=670&id=5&aid=6128#.VCSxDfmSx8

    Multivariate data-driven decision guidance for clinical scientists

    Get PDF
    Clinical decision-support is gaining widespread attention as medical institutions and governing bodies turn towards utilising better information management for effective and efficient healthcare delivery and quality assured outcomes. Amass of data across all stages, from disease diagnosis to palliative care, is further indication of the opportunities and challenges created for effective data management, analysis, prediction and optimization techniques as parts of knowledge management in clinical environments. A Data-driven Decision Guidance Management System (DD-DGMS) architecture can encompass solutions into a single closed-loop integrated platform to empower clinical scientists to seamlessly explore a multivariate data space in search of novel patterns and correlations to inform their research and practice. The paper describes the components of such an architecture, which includes a robust data warehouse as an infrastructure for comprehensive clinical knowledge management. The proposed DD-DGMS architecture incorporates the dynamic dimensional data model as its elemental core. Given the heterogeneous nature of clinical contexts and corresponding data, the dimensional data model presents itself as an adaptive model that facilitates knowledge discovery, distribution and application, which is essential for clinical decision support. The paper reports on a trial of the DD-DGMS system prototype conducted on diabetes screening data which further establishes the relevance of the proposed architecture to a clinical context.E

    Generalization of cyberbullying traces

    Get PDF
    De nos jours, la cyberintimidation est un problème courant dans les communautés en ligne. Filtrer automatiquement ces messages de cyberintimidation des conversations en ligne c’est avéré être un défi qui a mené à la création de plusieurs ensembles de données, dont plusieurs disponibles comme ressources pour l’entraînement de classificateurs. Toutefois, sans consensus sur la définition de la cyberintimidation, chacun des ensembles de données se retrouve à documenter différentes formes de comportements. Cela rend difficile la comparaison des performances obtenues par de classificateurs entraînés sur de différents ensembles de données, ou même l’application d’un de ces classificateurs à un autre ensemble de données. Dans ce mémoire, on utilise une variété de ces ensembles de données afin d’explorer les différentes définitions, ainsi que l’impact que cela occasionne sur le langage utilisé. Par la suite, on explore la portabilité d’un classificateur entraîné sur un ensemble de données vers un autre ensemble, nous donnant ainsi une meilleure compréhension de la généralisation des classificateurs. Finalement, on étudie plusieurs architectures d’ensemble de modèles, qui par la combinaison de ces différents classificateurs, nous permet de mieux comprendre les interactions des différentes définitions. Nos résultats montrent qu’il est possible d’obtenir une meilleure généralisation en combinant tous les ensembles de données en un seul ensemble de données plutôt que d’utiliser un ensemble de modèles composé de plusieurs classificateurs, chacun entraîné individuellement sur un ensemble de données différent.Cyberbullying is a common problem in today’s ubiquitous online communities. Automatically filtering it out of online conversations has proven a challenge, and the efforts have led to the creation of many different datasets, which are distributed as resources to train classifiers. However, without a consensus for the definition of cyberbullying, each of these datasets ends up documenting a different form of the behavior. This makes it difficult to compare the results of classifiers trained on different datasets, or to apply one such classifier on a different dataset. In this thesis, we will use a variety of these datasets to explore the differences in their definitions of cyberbullying and the impact it has on the language used in the messages. We will then explore the portability of a classifier trained on one dataset to another in order to gain insight on the generalization power of classifiers trained from each of them. Finally, we will study various architectures of ensemble models combining these classifiers in order to understand how they interact with each other. Our results show that by combining all datasets together into a single bigger one, we can achieve a better generalization than by using an ensemble model of individual classifiers trained on each dataset

    Digital writing technologies in higher education : theory, research, and practice

    Get PDF
    This open access book serves as a comprehensive guide to digital writing technology, featuring contributions from over 20 renowned researchers from various disciplines around the world. The book is designed to provide a state-of-the-art synthesis of the developments in digital writing in higher education, making it an essential resource for anyone interested in this rapidly evolving field. In the first part of the book, the authors offer an overview of the impact that digitalization has had on writing, covering more than 25 key technological innovations and their implications for writing practices and pedagogical uses. Drawing on these chapters, the second part of the book explores the theoretical underpinnings of digital writing technology such as writing and learning, writing quality, formulation support, writing and thinking, and writing processes. The authors provide insightful analysis on the impact of these developments and offer valuable insights into the future of writing. Overall, this book provides a cohesive and consistent theoretical view of the new realities of digital writing, complementing existing literature on the digitalization of writing. It is an essential resource for scholars, educators, and practitioners interested in the intersection of technology and writing

    A análise funcional de estruturas lexicais (lexical frames): dados linguísticos extraídos de corpora para subsidiar o ensino de IFA

    Get PDF
    Estruturas Lexicais (Lexical Frames) são sequências descontínuas de palavras que formam uma estrutura (frame) em torno de lacunas variáveis (slots) – por exemplo, the (aim, purpose, objective) of the (GRAY; BIBER, 2013). Tais unidades representam blocos formulaicos muito importantes na construção do discurso acadêmico (GRAY; BIBER, 2013). Diversas pesquisas já foram realizadas sobre linguagem formulaica em contextos acadêmicos (BIBER et al., 1999; HYLAND, 2008; CORTES, 2004; CORTES, 2013). Poucos estudos, no entanto, tiveram como foco a análise funcional das ELs em resumos de diferentes áreas de especialidade, a partir de um modelo que combine princípios de duas grandes áreas: os Estudos sobre Gêneros do Discurso e a Linguística de Corpus. Nessa direção, este estudo investiga o uso e a distribuição das ELs utilizadas na realização linguística das funções retóricas expressas nas seções de resumos de três áreas do conhecimento: (1) Ciências da Computação e da Informação, (2) Física e (3) Medicina e Ciências da Saúde. Em especial, busca-se a identificação e a análise funcional dos referidos blocos formulaicos extraídos dos corpora, em uma abordagem dirigida por corpus (data-driven approach). Para tanto, foram compilados três corpora de resumos escritos em inglês, das áreas-alvo, publicados em periódicos revisados por pares. Cada corpus, com 1 milhão de palavras, foi compilado com a ferramenta AntCorGen (ANTHONY, 2019) e analisado por meio das ferramentas Sketch Engine (KILGARRIFF et al., 2004) e AntConc 4.0.10 (ANTHONY, 2022). Um total de 717 ELs foram extraídas dos três corpora estudados. Destas, 159 são da área das Ciências da Computação e da Informação; 154, da área da Física; e 404, da área da Medicina e das Ciências da Saúde. Quanto ao padrão retórico, foi possível constatar que as seções que são convencionais nos resumos estruturados das áreas-alvo são as mesmas elencadas por Swales e Feak (2009). A observação da amostra de 150 ELs, no que tange às funções retóricas que realizam nos resumos acadêmicos, indicou a existência de duas grandes categorias de unidades multipalavras descontínuas: (i) as ELs transparentes e (ii) as ELs opacas. As ELs transparentes representam as unidades que têm a sua função retórica mais facilmente identificável a partir da observação: (i) dos elementos fixos que constituem a sua estrutura; (ii) das palavras que preenchem as lacunas variáveis; e (iii) dos contextos de ocorrência. Quanto a sua tipologia, as ELs transparentes podem ser divididas em 2 tipos, as ELs transparentes retóricas (ELTR) e as ELs transparentes terminológicas (ELTT). As ELTRs realizam linguisticamente as funções retóricas expressas nos gêneros, particularmente as funções relativas à apresentação dos objetivos do trabalho, estando, por essa razão, mais vinculadas a um movimento ou a uma seção retórica. Tais estruturas apresentam, em sua composição, palavras lexicais que indicam função retórica (por exemplo, aim, purpose, results). As ELTTs mais vinculadas às áreas de especialidade realizam a função de referir termos, procedimentos e conceitos consagrados nas áreas especializadas. Quanto à frequência de ocorrência nos corpora, as ELTTs são menos frequentes do que as ELTRs, sendo necessários pontos de corte mais baixos para que possamos identificá-las. Essas estruturas apresentam, em sua composição, palavras lexicais que indicam vinculação a uma área de especialidade (por exemplo, risk, hazard patients). Sugere-se que os dados obtidos neste estudo sejam usados para subsidiar o ensino de Inglês para Fins Acadêmicos (IFA).Lexical Frames are discontinuous sequences of words that form a structure (frame) around variable gaps (slots) – for example, the (aim, purpose, objective) of the (GRAY; BIBER, 2013). Lexical Frames (LFs) have great pedagogical value for the production of written academic genres in different areas of expertise. Several studies have already been carried out on formulaic language in academic contexts (BIBER et al., 1999; HYLAND, 2008; CORTES, 2004; CORTES, 2013). Few studies, however, have focused on the functional analysis of LFs in abstracts from different specialized areas, based on a model that combines principles from two major areas: Studies on Discourse Genres and Corpus Linguistics. In this direction, this study investigates the use and distribution of the LFs used in the linguistic realization of the rhetorical functions expressed in the sections of abstracts of three areas of knowledge: (1) Computer and Information Sciences, (2) Physics and (3) Medicine and Health Sciences. In particular, the study seeks to identify and functionally analyze the aforementioned formulaic blocks extracted from the corpora, in a data-driven approach. To this end, three corpora of abstracts, written in English, from the target areas, published in peer-reviewed journals, were compiled. Each corpus, with 1 million words, was compiled with AntCorGen (ANTHONY, 2019) and analyzed using Sketch Engine (Kilgarriff et al., 2004) and AntConc 4.0.10 (ANTHONY, 2022). A total of 717 LFs were extracted from the three corpora studied. Of these, 159 are in the area of Computer and Information Sciences; 154, in the area of Physics; and 404, in the area of Medicine and Health Sciences. As for the rhetorical pattern, it was possible to verify that the sections that are conventional in the structured abstracts of the target areas are the same listed by Swales and Feak (2009). The observation of the sample of 150 LFs, regarding the rhetorical functions they perform in the abstracts, indicated the existence of two major categories of discontinuous multiword units: (i) the transparent LFs and (ii) the opaque LFs. The transparent LFs represent the units that have their rhetorical function more easily identifiable from the observation: (i) of the fixed elements that constitute their structure; (ii) the words that fill in the variable slots; and (iii) the contexts of occurrence. Regarding their typology, the transparent LFs can be divided into 2 types, the rhetorically transparent LFs (RTLF) and the terminologically transparent LFs (TTLF). The RTLFs linguistically perform the rhetorical functions expressed in the genres, especially the functions related to the presentation of the objectives of the work and, for this reason, they are more linked to a rhetorical move or section. They have, in their composition, lexical words that indicate rhetorical function (for example, aim, purpose, results). The TTLFs, more linked to the specialized areas, perform the function of referring to terms, procedures and concepts established in the specialized domains. As for the frequency of occurrence in the corpora, TTLFs are less frequent than RTLFs, requiring lower cut-off points so that they can be extracted from the corpus. They have, in their composition, lexical words that indicate a link to a specialized area (for example, risk, hazard patients). It is suggested that the data obtained in this study be used to support the teaching of English for Academic Purposes (EAP)

    AWSum - Data mining for insight

    Full text link
    Many classifiers achieve high levels of accuracy but have limited use in real world problems because they provide little insight into data sets, are difficult to interpret and require expertise to use. In areas such as health informatics not only do analysts require accurate classifications but they also want some insight into the influences on the classification. This can then be used to direct research and formulate interventions. This research investigates the practical applications of Automated Weighted Sum, (AWSum), a classifier that gives accuracy comparable to other techniques whist providing insight into the data. AWSum achieves this by calculating a weight for each feature value that represents its influence on the class value. The merits of AWSum in classification and insight are tested on a Cystic Fibrosis dataset with positive results. © 2008 Springer-Verlag Berlin Heidelberg

    A Classification Algorithm that Derives Weighted Sum Scores for Insight into Disease

    Get PDF
    Data mining is often performed with datasets associated with diseases in order to increase insights that can ultimately lead to improved prevention or treatment. Classification algorithms can achieve high levels of predictive accuracy but have limited application for facilitating the insight that leads to deeper understanding of aspects of the disease. This is because the representation of knowledge that arises from classification algorithms is too opaque, too complex or too sparse to facilitate insight. Clustering, association and visualisation approaches enable greater scope for clinicians to be engaged in a way that leads to insight, however predictive accuracy is compromised or non-existent. This research investigates the practical applications of Automated Weighted Sum, (AWSum), a classification algorithm that provides accuracy comparable to other techniques whilst providing some insight into the data. This is achieved by calculating a weight for each feature value that represents its influence on the class value. Clinicians are very familiar with weighted sum scoring scales so the internal representation is intuitive and easily understood. This paper presents results from the use of the AWSum approach with data from patients suffering from Cystic Fibrosis
    corecore