8,504 research outputs found

    A Stand-Alone Methodology for Data Exploration in Support of Data Mining and Analytics

    Get PDF
    With the emergence of Big Data, data high in volume, variety, and velocity, new analysis techniques need to be developed to effectively use the data that is being collected. Knowledge discovery from databases is a larger methodology encompassing a process for gathering knowledge from that data. Analytics pair the knowledge with decision making to improve overall outcomes. Organizations have conclusive evidence that analytics provide competitive advantages and improve overall performance. This paper proposes a stand-alone methodology for data exploration. Data exploration is one part of the data mining process, used in knowledge discovery from databases and analytics. The goal of the methodology is to reduce the amount of time to gain meaningful information about a previously unanalyzed data set using tabular summaries and visualizations. The reduced time will enable faster implementation of analytics in an organization. Two case studies using a prototype implementation are presented showing the benefits of the methodology

    Using simulation studies to evaluate statistical methods

    Get PDF
    Simulation studies are computer experiments that involve creating data by pseudorandom sampling. The key strength of simulation studies is the ability to understand the behaviour of statistical methods because some 'truth' (usually some parameter/s of interest) is known from the process of generating the data. This allows us to consider properties of methods, such as bias. While widely used, simulation studies are often poorly designed, analysed and reported. This tutorial outlines the rationale for using simulation studies and offers guidance for design, execution, analysis, reporting and presentation. In particular, this tutorial provides: a structured approach for planning and reporting simulation studies, which involves defining aims, data-generating mechanisms, estimands, methods and performance measures ('ADEMP'); coherent terminology for simulation studies; guidance on coding simulation studies; a critical discussion of key performance measures and their estimation; guidance on structuring tabular and graphical presentation of results; and new graphical presentations. With a view to describing recent practice, we review 100 articles taken from Volume 34 of Statistics in Medicine that included at least one simulation study and identify areas for improvement.Comment: 31 pages, 9 figures (2 in appendix), 8 tables (1 in appendix

    Machine-learning-aided prediction of brain metastases development in non-small-cell lung cancers

    Get PDF
    Purpose Non–small-cell lung cancer (NSCLC) shows a high incidence of brain metastases (BM). Early detection is crucial to improve clinical prospects. We trained and validated classifier models to identify patients with a high risk of developing BM, as they could potentially benefit from surveillance brain MRI. Methods Consecutive patients with an initial diagnosis of NSCLC from January 2011 to April 2019 and an in-house chest-CT scan (staging) were retrospectively recruited at a German lung cancer center. Brain imaging was performed at initial diagnosis and in case of neurological symptoms (follow-up). Subjects lost to follow-up or still alive without BM at the data cut-off point (12/2020) were excluded. Covariates included clinical and/or 3D-radiomics-features of the primary tumor from staging chest-CT. Four machine learning models for prediction (80/20 training) were compared. Gini Importance and SHAP were used as measures of importance; sensitivity, specificity, area under the precision-recall curve, and Matthew's Correlation Coefficient as evaluation metrics. Results Three hundred and ninety-five patients compromised the clinical cohort. Predictive models based on clinical features offered the best performance (tuned to maximize recall: sensitivity∼70%, specificity∼60%). Radiomics features failed to provide sufficient information, likely due to the heterogeneity of imaging data. Adenocarcinoma histology, lymph node invasion, and histological tumor grade were positively correlated with the prediction of BM, age, and squamous cell carcinoma histology were negatively correlated. A subgroup discovery analysis identified 2 candidate patient subpopulations appearing to present a higher risk of BM (female patients + adenocarcinoma histology, adenocarcinoma patients + no other distant metastases). Conclusion Analysis of the importance of input features suggests that the models are learning the relevant relationships between clinical features/development of BM. A higher number of samples is to be prioritized to improve performance. Employed prospectively at initial diagnosis, such models can help select high-risk subgroups for surveillance brain MRI

    TOWARD AUTOMATED THREAT MODELING BY ADVERSARY NETWORK INFRASTRUCTURE DISCOVERY

    Get PDF
    Threat modeling can help defenders ascertain potential attacker capabilities and resources, allowing better protection of critical networks and systems from sophisticated cyber-attacks. One aspect of the adversary profile that is of interest to defenders is the means to conduct a cyber-attack, including malware capabilities and network infrastructure. Even though most defenders collect data on cyber incidents, extracting knowledge about adversaries to build and improve the threat model can be time-consuming. This thesis applies machine learning methods to historical cyber incident data to enable automated threat modeling of adversary network infrastructure. Using network data of attacker command and control servers based on real-world cyber incidents, specific adversary datasets can be created and enriched using the capabilities of internet-scanning search engines. Mixing these datasets with data from benign or non-associated hosts with similar port-service mappings allows for building an interpretable machine learning model of attackers. Additionally, creating internet-scanning search engine queries based on machine learning model predictions allows for automating threat modeling of adversary infrastructure. Automated threat modeling of adversary network infrastructure allows searching for unknown or emerging threat actor network infrastructure on the Internet.Major, Ukrainian Ground ForcesApproved for public release. Distribution is unlimited

    A Multidisciplinary Approach to the Reuse of Open Learning Resources

    Get PDF
    Educational standards are having a significant impact on e-Learning. They allow for better exchange of information among different organizations and institutions. They simplify reusing and repurposing learning materials. They give teachers the possibility of personalizing them according to the student’s background and learning speed. Thanks to these standards, off-the-shelf content can be adapted to a particular student cohort’s context and learning needs. The same course content can be presented in different languages. Overall, all the parties involved in the learning-teaching process (students, teachers and institutions) can benefit from these standards and so online education can be improved. To materialize the benefits of standards, learning resources should be structured according to these standards. Unfortunately, there is the problem that a large number of existing e-Learning materials lack the intrinsic logical structure required, and further, when they have the structure, they are not encoded as required. These problems make it virtually impossible to share these materials. This thesis addresses the following research question: How to make the best use of existing open learning resources available on the Internet by taking advantage of educational standards and specifications and thus improving content reusability?In order to answer this question, I combine different technologies, techniques and standards that make the sharing of publicly available learning resources possible in innovative ways. I developed and implemented a three-stage tool to tackle the above problem. By applying information extraction techniques and open e-Learning standards to legacy learning resources the tool has proven to improve content reusability. In so doing, it contributes to the understanding of how these technologies can be used in real scenarios and shows how online education can benefit from them. In particular, three main components were created which enable the conversion process from unstructured educational content into a standard compliant form in a systematic and automatic way. An increasing number of repositories with educational resources are available, including Wikiversity and the Massachusetts Institute of Technology OpenCourseware. Wikivesity is an open repository containing over 6,000 learning resources in several disciplines and for all age groups [1]. I used the OpenCourseWare repository to evaluate the effectiveness of my software components and ideas. The results show that it is possible to create standard compliant learning objects from the publicly available web pages, improving their searchability, interoperability and reusability

    Predicting Thrombosis and Bleeding

    Get PDF
    • …
    corecore