3,393 research outputs found

    State of the art in selection of variables and functional forms in multivariable analysis-outstanding issues

    Get PDF
    Background: How to select variables and identify functional forms for continuous variables is a key concern when creating a multivariable model. Ad hoc ‘traditional’ approaches to variable selection have been in use for at least 50 years. Similarly, methods for determining functional forms for continuous variables were first suggested many years ago. More recently, many alternative approaches to address these two challenges have been proposed, but knowledge of their properties and meaningful comparisons between them are scarce. To define a state of the art and to provide evidence-supported guidance to researchers who have only a basic level of statistical knowledge, many outstanding issues in multivariable modelling remain. Our main aims are to identify and illustrate such gaps in the literature and present them at a moderate technical level to the wide community of practitioners, researchers and students of statistics. Methods: We briefly discuss general issues in building descriptive regression models, strategies for variable selection, different ways of choosing functional forms for continuous variables and methods for combining the selection of variables and functions. We discuss two examples, taken from the medical literature, to illustrate problems in the practice of modelling. Results: Our overview revealed that there is not yet enough evidence on which to base recommendations for the selection of variables and functional forms in multivariable analysis. Such evidence may come from comparisons between alternative methods. In particular, we highlight seven important topics that require further investigation and make suggestions for the direction of further research. Conclusions: Selection of variables and of functional forms are important topics in multivariable analysis. To define a state of the art and to provide evidence-supported guidance to researchers who have only a basic level of statistical knowledge, further comparative research is required

    Statistical modelling of clickstream behaviour to inform real-time advertising decisions

    Get PDF
    Online user browsing generates vast quantities of typically unexploited data. Investigating this data and uncovering the valuable information it contains can be of substantial value to online businesses, and statistics plays a key role in this process. The data takes the form of an anonymous digital footprint associated with each unique visitor, resulting in 10610^{6} unique profiles across 10710^{7} individual page visits on a daily basis. Exploring, cleaning and transforming data of this scale and high dimensionality (2TB+ of memory) is particularly challenging, and requires cluster computing. We outline a variable selection method to summarise clickstream behaviour with a single value, and make comparisons to other dimension reduction techniques. We illustrate how to apply generalised linear models and zero-inflated models to predict sponsored search advert clicks based on keywords. We consider the problem of predicting customer purchases (known as conversions), from the customer’s journey or clickstream, which is the sequence of pages seen during a single visit to a website. We consider each page as a discrete state with probabilities of transitions between the pages, providing the basis for a simple Markov model. Further, Hidden Markov models (HMMs) are applied to relate the observed clickstream to a sequence of hidden states, uncovering meta-states of user activity. We can also apply conventional logistic regression to model conversions in terms of summaries of the profile’s browsing behaviour and incorporate both into a set of tools to solve a wide range of conversion types where we can directly compare the predictive capability of each model. In real-time, predicting profiles that are likely to follow similar behaviour patterns to known conversions, will have a critical impact on targeted advertising. We illustrate these analyses with results from real data collected by an Audience Management Platform (AMP) - Carbon

    A modular architecture for systematic text categorisation

    Get PDF
    This work examines and attempts to overcome issues caused by the lack of formal standardisation when defining text categorisation techniques and detailing how they might be appropriately integrated with each other. Despite text categorisation’s long history the concept of automation is relatively new, coinciding with the evolution of computing technology and subsequent increase in quantity and availability of electronic textual data. Nevertheless insufficient descriptions of the diverse algorithms discovered have lead to an acknowledged ambiguity when trying to accurately replicate methods, which has made reliable comparative evaluations impossible. Existing interpretations of general data mining and text categorisation methodologies are analysed in the first half of the thesis and common elements are extracted to create a distinct set of significant stages. Their possible interactions are logically determined and a unique universal architecture is generated that encapsulates all complexities and highlights the critical components. A variety of text related algorithms are also comprehensively surveyed and grouped according to which stage they belong in order to demonstrate how they can be mapped. The second part reviews several open-source data mining applications, placing an emphasis on their ability to handle the proposed architecture, potential for expansion and text processing capabilities. Finding these inflexible and too elaborate to be readily adapted, designs for a novel framework are introduced that focus on rapid prototyping through lightweight customisations and reusable atomic components. Being a consequence of inadequacies with existing options, a rudimentary implementation is realised along with a selection of text categorisation modules. Finally a series of experiments are conducted that validate the feasibility of the outlined methodology and importance of its composition, whilst also establishing the practicality of the framework for research purposes. The simplicity of experiments and results gathered clearly indicate the potential benefits that can be gained when a formalised approach is utilised

    An improved text classification modelling approach to identify security messages in heterogeneous projects

    Get PDF
    Security remains under-addressed in many organisations, illustrated by the number of large-scale software security breaches. Preventing breaches can begin during software development if attention is paid to security during the software’s design and implementation. One approach to security assurance during software development is to examine communications between developers as a means of studying the security concerns of the project. Prior research has investigated models for classifying project communication messages (e.g., issues or commits) as security related or not. A known problem is that these models are project-specific, limiting their use by other projects or organisations. We investigate whether we can build a generic classification model that can generalise across projects. We define a set of security keywords by extracting them from relevant security sources, dividing them into four categories: asset, attack/threat, control/mitigation, and implicit. Using different combinations of these categories and including them in the training dataset, we built a classification model and evaluated it on industrial, open-source, and research-based datasets containing over 45 different products. Our model based on harvested security keywords as a feature set shows average recall from 55 to 86%, minimum recall from 43 to 71% and maximum recall from 60 to 100%. An average f-score between 3.4 and 88%, an average g-measure of at least 66% across all the dataset, and an average AUC of ROC from 69 to 89%. In addition, models that use externally sourced features outperformed models that use project-specific features on average by a margin of 26–44% in recall, 22–50% in g-measure, 0.4–28% in f-score, and 15–19% in AUC of ROC. Further, our results outperform a state-of-the-art prediction model for security bug reports in all cases. We find using sound statistical and effect size tests that (1) using harvested security keywords as features to train a text classification model improve classification models and generalise to other projects significantly. (2) Including features in the training dataset before model construction improve classification models significantly. (3) Different security categories represent predictors for different projects. Finally, we introduce new and promising approaches to construct models that can generalise across different independent projects.publishedVersio

    Multidimensional Epidemiological Transformations: Addressing Location-Privacy in Public Health Practice

    Get PDF
    The following publications arose directly from this research: AbdelMalik P, Boulos MNK: Multidimensional point transform for public health practice. Methods of Information in Medicine. (In press; ePub ahead of print available online) http://dx.doi.org/10.3414/ME11-01-0001 AbdelMalik P, Boulos MNK, Jones R: The Perceived Impact of Location Privacy: A web-based survey of public health perspectives and requirements in the UK and Canada. BMC Public Health, 8:156 (2008) http://www.biomedcentral.com/1471-2458/8/156 The following papers were co-authored in relation to this research: Khaled El Emam, Ann Brown, Philip AbdelMalik, Angelica Neisa, Mark Walker, Jim Bottomley, Tyson Roffey: A method for managing re-identification risk from small geographic areas in Canada. BMC Medical Informatics and Decision Making. 10:18 (2010) http://www.biomedcentral.com/1472-6947/10/18 Maged N. Kamel Boulos, Andrew J. Curtis, Philip AbdelMalik: Musings on privacy issues in health research involving disaggregate geographic data about individuals. International Journal of Health Geographics. 8:46 (2009) http://www.ij-healthgeographics.com/content/pdf/1476-072X-8-46.pdf Khaled El Emam, Ann Brown, Philip AbdelMalik: Evaluating predictors of geographic area population size cut-offs to manage re-identification risk. Journal of the American Medical Informatics Association, 16:256-266 (2009)The ability to control one’s own personally identifiable information is a worthwhile human right that is becoming increasingly vulnerable. However just as significant, if not more so, is the right to health. With increasing globalisation and threats of natural disasters and acts of terrorism, this right is also becoming increasingly vulnerable. Public health practice – which is charged with the protection, promotion and mitigation of the health of society and its individuals – has been at odds with the right to privacy. This is particularly significant when location privacy is under consideration. Spatial information is an important aspect of public health, yet the increasing availability of spatial imagery and location-sensitive applications and technologies has brought location-privacy to the forefront, threatening to negatively impact the practice of public health by inhibiting or severely limiting data-sharing. This study begins by reviewing the current relevant legislation as it pertains to public health and investigates the public health community’s perceptions on location privacy barriers to the practice. Bureaucracy and legislation are identified by survey participants as the two greatest privacy-related barriers to public health. In response to this clash, a number of solutions and workarounds are proposed in the literature to compensate for location privacy. However, as their weaknesses are outlined, a novel approach - the multidimensional point transform - that works synergistically on multiple dimensions, including location, to anonymise data is developed and demonstrated. Finally, a framework for guiding decisions on data-sharing and identifying requirements is proposed and a sample implementation is demonstrated through a fictitious scenario. For each aspect of the study, a tool prototype and/or design for implementation is proposed and explained, and the need for further development of these is highlighted. In summary, this study provides a multi-disciplinary and multidimensional solution to the clash between privacy and data-sharing in public health practice.Partially sponsored by the Public Health Agency of Canad

    The state of research on folksonomies in the field of Library and Information Science : a Systematic Literature Review

    Get PDF
    Purpose – The purpose of this thesis is to provide an overview of all relevant peer-reviewed articles on folksonomies, social tagging and social bookmarking as knowledge organisation systems within the field of Library and Information Science by reviewing the current state of research on these systems of managing knowledge. Method – I use the systematic literature review method in order to systematically and transparently review and synthesise data extracted from 39 articles found through the discovery system LUBsearch in order to find out which, and to which degree different methods, theories and systems are represented, which subfields can be distinguished, how present research within these subfields is and which larger conclusions can be drawn from research conducted between 2003-2013 on folksonomies. Findings – There have been done many studies which are exploratory or reviewing literature discussions, and other frequently used methods which have been used are questionnaires or surveys, although often in conjunction with other methods. Furthermore, out of the 39 studies, 22 were quantitative, 15 were qualitative and 2 used mixed methods. I also found that there were an underwhelming number of theories being explicitly used, where merely 11 articles explicitly used theories, and only one theory was used twice. No key authors on the topic were identified, though Knowledge Organization, Information Processing & Management and Journal of the American Society for Information Science and Technology were recognised as key journals for research on folksonomies. There have been plenty of studies on how tags and folksonomies have effected other knowledge organisation systems, or how pre-existing have been used to create new systems. Other well represented subfields include studies on the quality or characteristics of tags or text, and studies aiming to improve folksonomies, search methods or tags. Value – I provide an overview on what has been researched and where the focus on said research has been during the last decade and present future research suggestions and identify possible dangers to be wary of which I argue will benefit folksonomies and knowledge organisation as a whole

    Identification, Categorisation and Forecasting of Court Decisions

    Get PDF
    Masha Medvedeva’s PhD dissertation ‘Identification, Categorisation and Forecasting of Court Decisions’ focuses on automatic prediction and analysis of judicial decisions. In her thesis she discusses her work on forecasting, categorising and analyzing outcomes of the European Court of Human Rights (ECtHR) and case law across Dutch national courts. Her dissertation demonstrates the potential of such research, but also to highlight its limitations and identify challenges of working with legal data, and attempts to establish a more standard way of conducting research in automatic prediction of judicial decisions. Medvedeva provides an analysis of the systems for predicting court decisions available today, and finds that the majority of them are unable to forecasts future decisions of the court while claiming to be able to do so. In response she provides an online platform JURI Says that has been developed during her PhD, and is available at jurisays.com. The system forecasts decisions of the ECtHR based on information available many years before the verdict is made, thus being able to predict court decisions that have not been made yet, which is a novelty in the field. In her dissertation Medvedeva argues against ‘robo-judges’ and replacing judges with algorithms, and discusses how predicting decisions and making decisions are very different processes, and how automated systems are very vulnerable to abuse

    Trust and Reputation for Successful Software Self-Organisation

    Get PDF
    Abstract An increasing number of dynamic software evolution approaches is com- monly based on integrating or utilising new pieces of software. This requires reso- lution of issues such as ensuring awareness of newly available software pieces and selection of most appropriate software pieces to use. Other chapters in this book dis- cuss dynamic software evolution focusing primarily on awareness, integration and utilisation of new software pieces, paying less attention on how selection among different software pieces is made. The selection issue is quite important since in the increasingly dynamic software world quite a few new software pieces occur over time, some of which being of lower utility, lower quality or even potentially harmful and malicious (for example, a new piece of software may contain hidden spyware or it may be a virus). In this chapter, we describe how computational trust and reputation can be used to avoid choosing new pieces of software that may be malicious or of lower quality. We start by describing computational models of trust and reputation and subsequently we apply them in two application domains. Firstly, in quality assessment of open source software, discussing the case where different trustors have different understandings of trust and trust estimation methods. Sec- ondly, in protection of open collaborative software, such as Wikipedia

    Ontology-aided business document classification

    Get PDF
    corecore