8 research outputs found

    Social-Child-Case Document Clustering based on Topic Modeling using Latent Dirichlet Allocation

    Get PDF
    Children are the future of the nation. All treatment and learning they get would affect their future. Nowadays, there are various kinds of social problems related to children.  To ensure the right solution to their problem, social workers usually refer to the social-child-case (SCC) documents to find similar cases in the past and adapting the solution of the cases. Nevertheless, to read a bunch of documents to find similar cases is a tedious task and needs much time. Hence, this work aims to categorize those documents into several groups according to the case type. We use topic modeling with Latent Dirichlet Allocation (LDA) approach to extract topics from the documents and classify them based on their similarities. The Coherence Score and Perplexity graph are used in determining the best model. The result obtains a model with 5 topics that match the targeted case types. The result supports the process of reusing knowledge about SCC handling that ease the finding of documents with similar case

    Textual data summarization using the Self-Organized Co-Clustering model

    Get PDF
    International audienceRecently, different studies have demonstrated the use of co-clustering, a data mining technique which simultaneously produces row-clusters of observations and column-clusters of features. The present work introduces a novel co-clustering model to easily summarize textual data in a document-term format. In addition to highlighting homogeneous co-clusters as other existing algorithms do we also distinguish noisy co-clusters from significant co-clusters, which is particularly useful for sparse document-term matrices. Furthermore, our model proposes a structure among the significant co-clusters, thus providing improved interpretability to users. The approach proposed contends with state-of-the-art methods for document and term clustering and offers user-friendly results. The model relies on the Poisson distribution and on a constrained version of the Latent Block Model, which is a probabilistic approach for co-clustering. A Stochastic Expectation-Maximization algorithm is proposed to run the model’s inference as well as a model selection criterion to choose the number of coclusters. Both simulated and real data sets illustrate the eciency of this model by its ability to easily identify relevant co-clusters

    Measuring LDA topic stability from clusters of replicated runs

    No full text
    Abstract Background: Unstructured and textual data is increasing rapidly and Latent Dirichlet Allocation (LDA) topic modeling is a popular data analysis methods for it. Past work suggests that instability of LDA topics may lead to systematic errors. Aim: We propose a method that relies on replicated LDA runs, clustering, and providing a stability metric for the topics. Method: We generate k LDA topics and replicate this process n times resulting in n*k topics. Then we use K-medioids to cluster the n*k topics to k clusters. The k clusters now represent the original LDA topics and we present them like normal LDA topics showing the ten most probable words. For the clusters, we try multiple stability metrics, out of which we recommend Rank-Biased Overlap, showing the stability of the topics inside the clusters. Results: We provide an initial validation where our method is used for 270,000 Mozilla Firefox commit messages with k=20 and n=20. We show how our topic stability metrics are related to the contents of the topics. Conclusions: Advances in text mining enable us to analyze large masses of text in software engineering but non-deterministic algorithms, such as LDA, may lead to unreplicable conclusions. Our approach makes LDA stability transparent and is also complementary rather than alternative to many prior works that focus on LDA parameter tuning

    Using Topic Models to Study the History of a Video Game Genre: Towards a Non-Periodic Representation of Changes in the Textual Content of Computer Role-Playing Games between 1992 and 2017

    Get PDF
    peer reviewedCet article propose de montrer comment des outils mathématiques encore peu employés par les historiens (topic models, classification hiérarchique, carte auto-orga- nisatrice) peuvent être combinés et exploités pour l’étude d’un corpus historique daté, mais hétérogène, afin d’en caractériser les évolutions temporelles. Le fil conducteur de l’étude sera d’examiner l’évolution du vocabulaire employé par un ensemble de 21 jeux vidéo de rôle occidentaux à forte audience, publiés entre 1992 et 2017, pour un total de 17,5 millions de mots. Nous nous efforcerons, au travers de cette analyse, d’apporter un nouvel éclairage à l’histoire canonique du genre et de proposer un modèle alternatif à la périodisation afin de mettre au jour les tensions et influences avec lesquelles chaque jeu particulier doit négocier.This paper shows how researchers can combine and exploit mathematical tools still rarely used by historians (topic models, hierarchical classification, self-organizing maps) to study a dated but heterogeneous historical corpus and to characterize its evolution over time. The analysis focuses on changes in the vocabulary used by a set of 21 popular western role-playing video games published between 1992 and 2017 and comprising a total of 17.5 million words. Through this example, we aim to shed new light on the canonical history of the genre and to propose an alternative model to periodization for uncovering the trade-offs and influences underlying the development of each particular game must negotiate with

    Unsupervised Identification of Crime Problems from Police Free-text Data

    Get PDF
    We present a novel exploratory application of unsupervised machine-learning methods to identify clusters of specific crime problems from unstructured modus operandi free-text data within a single administrative crime classification. To illustrate our proposed approach, we analyse police recorded free-text narrative descriptions of residential burglaries occurring over a two-year period in a major metropolitan area of the UK. Results of our analyses demonstrate that topic modelling algorithms are capable of clustering substantively different burglary problems without prior knowledge of such groupings. Subsequently, we describe a prototype dashboard that allows replication of our analytical workflow and could be applied to support operational decision making in the identification of specific crime problems. This approach to grouping distinct types of offences within existing offence categories, we argue, has the potential to support crime analysts in proactively analysing large volumes of modus operandi free-text data—with the ultimate aims of developing a greater understanding of crime problems and supporting the design of tailored crime reduction interventions

    The nexus between quality of customer relationship management systems and customers' satisfaction: Evidence from online customers’ reviews

    Get PDF
    Customer Relationship Management (CRM) is a method of management that aims to establish, develop, and improve relationships with targeted customers in order to maximize corporate profitability and customer value. There have been many CRM systems in the market. These systems are developed based on the combination of business requirements, customer needs, and industry best practices. The impact of CRM systems on the customers' satisfaction and competitive advantages as well as tangible and intangible benefits are widely investigated in the previous studies. However, there is a lack of studies to assess the quality dimensions of these systems to meet an organization's CRM strategy. This study aims to investigate customers' satisfaction with CRM systems through online reviews. We collected 5172 online customers' reviews from 8 CRM systems in the Google play store platform. The satisfaction factors were extracted using Latent Dirichlet Allocation (LDA) and grouped into three dimensions; information quality, system quality, and service quality. Data segmentation is performed using Learning Vector Quantization (LVQ). In addition, feature selection is performed by the entropy-weight approach. We then used the Adaptive Neuro Fuzzy Inference System (ANFIS), the hybrid of fuzzy logic and neural networks, to assess the relationship between these dimensions and customer satisfaction. The results are discussed and research implications are provided.The authors are thankful to the Deanship of Scientific Research at Najran University for funding this work, under the Research Groups Funding program grant code NU/RG/SERC/12/44

    Innovative Heuristics to Improve the Latent Dirichlet Allocation Methodology for Textual Analysis and a New Modernized Topic Modeling Approach

    Get PDF
    Natural Language Processing is a complex method of data mining the vast trove of documents created and made available every day. Topic modeling seeks to identify the topics within textual corpora with limited human input into the process to speed analysis. Current topic modeling techniques used in Natural Language Processing have limitations in the pre-processing steps. This dissertation studies topic modeling techniques, those limitations in the pre-processing, and introduces new algorithms to gain improvements from existing topic modeling techniques while being competitive with computational complexity. This research introduces four contributions to the field of Natural Language Processing and topic modeling. First, this research identifies a requirement for a more robust “stopwords” list and proposes a heuristic for creating a more robust list. Second, a new dimensionality-reduction technique is introduced that exploits the number of words within a document to infer importance to word choice. Third, an algorithm is developed to determine the number of topics within a corpus and demonstrated using a standard topic modeling data set. These techniques produce a higher quality result from the Latent Dirichlet Allocation topic modeling technique. Fourth, a novel heuristic utilizing Principal Component Analysis is introduced that is capable of determining the number of topics within a corpus that produces stable sets of topic words

    A multi-disciplinary co-design approach to social media sensemaking with text mining

    Get PDF
    This thesis presents the development of a bespoke social media analytics platform called Sentinel using an event driven co-design approach. The performance and outputs of this system, along with its integration into the routine research methodology of its users, were used to evaluate how the application of an event driven co-design approach to system design improves the degree to which Social Web data can be converted into actionable intelligence, with respect to robustness, agility, and usability. The thesis includes a systematic review into the state-of-the-art technology that can support real-time text analysis of social media data, used to position the text analysis elements of the Sentinel Pipeline. This is followed by research chapters that focus on combinations of robustness, agility, and usability as themes, covering the iterative developments of the system through the event driven co-design lifecycle. Robustness and agility are covered during initial infrastructure design and early prototyping of bottom-up and top-down semantic enrichment. Robustness and usability are then considered during the development of the Semantic Search component of the Sentinel Platform, which exploits the semantic enrichment developed in the prototype, alpha, and beta systems. Finally, agility and usability are used whilst building upon the Semantic Search functionality to produce a data download functionality for rapidly collecting corpora for further qualitative research. These iterations are evaluated using a number of case studies that were undertaken in conjunction with a wider research programme, within the field of crime and security, that the Sentinel platform was designed to support. The findings from these case studies are used in the co-design process to inform how developments should evolve. As part of this research programme the Sentinel platform has supported the production of a number of research papers authored by stakeholders, highlighting the impact the system has had in the field of crime and security researc
    corecore