234,773 research outputs found

    Scalable Text Mining with Sparse Generative Models

    Get PDF
    The information age has brought a deluge of data. Much of this is in text form, insurmountable in scope for humans and incomprehensible in structure for computers. Text mining is an expanding field of research that seeks to utilize the information contained in vast document collections. General data mining methods based on machine learning face challenges with the scale of text data, posing a need for scalable text mining methods. This thesis proposes a solution to scalable text mining: generative models combined with sparse computation. A unifying formalization for generative text models is defined, bringing together research traditions that have used formally equivalent models, but ignored parallel developments. This framework allows the use of methods developed in different processing tasks such as retrieval and classification, yielding effective solutions across different text mining tasks. Sparse computation using inverted indices is proposed for inference on probabilistic models. This reduces the computational complexity of the common text mining operations according to sparsity, yielding probabilistic models with the scalability of modern search engines. The proposed combination provides sparse generative models: a solution for text mining that is general, effective, and scalable. Extensive experimentation on text classification and ranked retrieval datasets are conducted, showing that the proposed solution matches or outperforms the leading task-specific methods in effectiveness, with a order of magnitude decrease in classification times for Wikipedia article categorization with a million classes. The developed methods were further applied in two 2014 Kaggle data mining prize competitions with over a hundred competing teams, earning first and second places

    Supply chain risk management : systematic literature review and a conceptual framework for capturing interdependencies between risks

    Get PDF
    The purpose of this research is to conduct a comprehensive and systematic review of the literature in the field of 'Supply Chain Risk Management' and identify important research gaps for potential research. Furthermore, a conceptual risk management framework is also proposed that encompasses holistic view of the field. 'Systematic Literature Review' method is used to examine quality articles published over a time period of almost 15 years (2000 - June, 2014). The findings of the study are validated through text mining software. Systematic literature review has identified the progress of research based on various descriptive and thematic typologies. The review and text mining analysis have also provided an insight into major research gaps. Based on the identified gaps, a framework is developed that can help researchers model interdependencies between risk factors

    Measuring impact of academic research in computer and information science on society

    Get PDF
    Academic research in computer & information science (CIS) has contributed immensely to all aspects of society. As academic research today is substantially supported by various government sources, recent political changes have created ambivalence amongst academics about the future of research funding. With uncertainty looming, it is important to develop a framework to extract and measure the information relating to impact of CIS research on society to justify public funding, and demonstrate the actual contribution and impact of CIS research outside academia. A new method combining discourse analysis and text mining of a collection of over 1000 pages of impact case study documents written in free-text format for the Research Excellence Framework (REF) 2014 was developed in order to identify the most commonly used categories or headings for reporting impact of CIS research by UK Universities (UKU). According to the research reported in REF2014, UKU acquired 83 patents in various areas of CIS, created 64 spin-offs, generated £857.5 million in different financial forms, created substantial employment, reached over 6 billion users worldwide and has helped save over £1 billion Pounds due to improved processes etc. to various sectors internationally, between 2008 and 2013

    EXACT2: the semantics of biomedical protocols

    Get PDF
    © 2014 Soldatova et al.; licensee BioMed Central. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.This article has been made available through the Brunel Open Access Publishing Fund.Background: The reliability and reproducibility of experimental procedures is a cornerstone of scientific practice. There is a pressing technological need for the better representation of biomedical protocols to enable other agents (human or machine) to better reproduce results. A framework that ensures that all information required for the replication of experimental protocols is essential to achieve reproducibility. Methods: We have developed the ontology EXACT2 (EXperimental ACTions) that is designed to capture the full semantics of biomedical protocols required for their reproducibility. To construct EXACT2 we manually inspected hundreds of published and commercial biomedical protocols from several areas of biomedicine. After establishing a clear pattern for extracting the required information we utilized text-mining tools to translate the protocols into a machine amenable format. We have verified the utility of EXACT2 through the successful processing of previously ‘unseen’ (not used for the construction of EXACT2) protocols. Results: The paper reports on a fundamentally new version EXACT2 that supports the semantically-defined representation of biomedical protocols. The ability of EXACT2 to capture the semantics of biomedical procedures was verified through a text mining use case. In this EXACT2 is used as a reference model for text mining tools to identify terms pertinent to experimental actions, and their properties, in biomedical protocols expressed in natural language. An EXACT2-based framework for the translation of biomedical protocols to a machine amenable format is proposed. Conclusions: The EXACT2 ontology is sufficient to record, in a machine processable form, the essential information about biomedical protocols. EXACT2 defines explicit semantics of experimental actions, and can be used by various computer applications. It can serve as a reference model for for the translation of biomedical protocols in natural language into a semantically-defined format.This work has been partially funded by the Brunel University BRIEF award and a grant from Occams Resources

    Exploring Text Mining and Analytics for Applications in Public Security: An in-depth dive into a systematic literature review

    Get PDF
    Text mining and related analytics emerge as a technological approach to support human activities in extracting useful knowledge through texts in several formats. From a managerial point of view, it can help organizations in planning and decision-making processes, providing information that was not previously evident through textual materials produced internally or even externally. In this context, within the public/governmental scope, public security agencies are great beneficiaries of the tools associated with text mining, in several aspects, from applications in the criminal area to the collection of people's opinions and sentiments about the actions taken to promote their welfare. This article reports details of a systematic literature review focused on identifying the main areas of text mining application in public security, the most recurrent technological tools, and future research directions. The searches covered four major article bases (Scopus, Web of Science, IEEE Xplore, and ACM Digital Library), selecting 194 materials published between 2014 and the first half of 2021, among journals, conferences, and book chapters. There were several findings concerning the targets of the literature review, as presented in the results of this article

    A Longitudinal Analysis of Job Skills for Entry-Level Data Analysts

    Get PDF
    The explosive growth of the data analytics field has continued over the past decade with no signs of slowing down. Given the fast pace of technology changes and the need for IT professionals to constantly keep up with the field, it is important to analyze the job skills and knowledge required in the data analyst and business intelligence (BI) analyst job market. In this research, we examine over 9,000 job postings for entry-level data analytics jobs over five years (2014-2018). Using a text mining approach and a custom text mining dictionary, we identify a preliminary set of analytic competencies sought in practice. Further, the longitudinal data also demonstrates how these key skills have evolved over time. We find that the three biggest trends include proficiency with Python, Tableau, and R. We also find that an increasing number of jobs emphasize data visualization. Some skills, like Microsoft Access, SAP, and Cognos, declined in popularity over the time frame studied. Using the results of the study, universities can make informed curriculum decisions, and instructors can decide what skills to teach based on industry needs. Our custom text mining dictionary can be added to the growing literature and assist other researchers in this space

    A patent time series processing component for technology intelligence by trend identification functionality

    Full text link
    © 2014, Springer-Verlag London. Technology intelligence indicates the concept and applications that transform data hidden in patents or scientific literatures into technical insight for technology strategy-making support. The existing frameworks and applications of technology intelligence mainly focus on obtaining text-based knowledge with text mining components. However, what is the corresponding technological trend of the knowledge over time is seldom taken into consideration. In order to capture the hidden trend turning points and improve the framework of existing technology intelligence, this paper proposes a patent time series processing component with trend identification functionality. We use piecewise linear representation method to generate and quantify the trend of patent publication activities, then utilize the outcome to identify trend turning points and provide trend tags to the existing text mining component, thus making it possible to combine the text-based and time-based knowledge together to support technology strategy making more satisfactorily. A case study using Australia patents (year 1983–2012) in Information and Communications Technology industry is presented to demonstrate the feasibility of the component when dealing with real-world tasks. The result shows that the new component identifies the trend reasonably well, at the same time learns valuable trend turning points in historical patent time series

    The evolution of Latino threat narrative from 1997 to 2014

    Get PDF
    This study presents preliminary findings of a project focusing on the evolution of Latino threat narrative, a social process of portraying Latinos with derogatory terms. A total of 440,984 newspapers articles about Latinos across 13 news outlets from 1997 to 2014 were analyzed using text mining. The results of this study demonstrate the potential association between LTN in print news media and significant political and social events, including: September 11, 2001 terror event; passage of restrictive immigration legislation in 2001, 2002, 2005, and 2006; and mass protests against immigration reform in 2006. The study also reveals greater intensity in the use of LTN-related words during the (Republican) Bush administration than the immediately preceding and following (Democratic) administrations. This is the first work that uses text mining techniques to explore Latino threat narrative at a large scale over a long period of time
    corecore