56 research outputs found

    Semantic-aware Retrieval Standards based on Dirichlet Compound Model to Rank Notifications by Level of Urgency

    Get PDF
    There is a growing number of notifications generated from a wide range of sources. However, to our knowledge, there is no well-known generalizable standard for detecting the most urgent notifications. Establishing reusable standards is crucial for applications in which the recommendation (notification) is critical due to the level of urgency and sensitivity (e.g. medical domain). To tackle this problem, this thesis aims to establish Information Retrieval (IR) standards for notification (recommendation) task by taking semantic dimensions (terms, opinions, concepts and user interaction) into consideration. The technical research contributions of this thesis include but not limited to the development of a semantic IR framework based on Dirichlet Compound Model (DCM); namely FDCM, extending FDCM to the recommendation scenario (RFDCM) and proposing novel opinion-aware ranking models. Transparency, explainability and generalizability are some benefits that the use of a mathematically well-defined solution such as DCM offers. The FDCM framework is based on a robust aggregation parameter which effectively combines the semantic retrieval scores using Query Performance Predictors (QPPs). Our experimental results confirm the effectiveness of such approach in recommendation systems and semantic retrieval. One of the main findings of this thesis is that the concept-based extension (term-only + concept-only) of FDCM consistently outperformed both terms-only and concept-only baselines concerning biomedical data. Moreover, we show that semantic IR is beneficial for collaborative filtering and therefore it could help data scientists to develop hybrid and consolidated IR systems comprising content-based and collaborative filtering aspects of recommendation

    The development of computer science a sociocultural perspective

    Get PDF

    Text generation for small data regimes

    Get PDF
    In Natural Language Processing (NLP), applications trained on downstream tasks for text classification usually require enormous amounts of data to perform well. Neural Network (NN) models are among the applications that can always be trained to produce better results. Yet, a huge factor in improving results is the ability to scale over large datasets. Given that Deep NNs are known to be data hungry, having more training samples can always be beneficial. For a classification model to perform well, it could require thousands or even millions of textual training examples. Transfer learning enables us to leverage knowledge gained from general data collections to perform well on target tasks. In NLP, training language models on large data collections has been shown to achieve great results when tuned to different task-specific datasets Wang et al. (2019, 2018a). However, even with transfer learning, adequate training data remains a condition for training machine learning models. Nonetheless, we show that small textual datasets can be augmented to a degree that is enough to achieve improved classification performance. In this thesis, we make multiple contributions to data augmentation. Firstly, we transform the data generation task into an optimization problem which maximizes the usefulness of the generated output, using Monte Carlo Tree Search (MCTS) as the optimization strategy and incorporating entropy as one of the optimization criteria. Secondly, we propose a language generation approach for targeted data generation with the participation of the training classifier. With a user in the loop, we find that manual annotation of a small proportion of the generated data is enough to boost classification performance. Thirdly, under a self-learning scheme, we replace the user by an automated approach in which the classifier is trained on its own pseudo-labels. Finally, we extend the data generation approach to the knowledge distillation domain, by generating samples that a teacher model can confidently label, but not its student

    Identifying reusable knowledge in developer instant messaging communication.

    Get PDF
    Context and background: Software engineering is a complex and knowledge-intensive activity. Required knowledge (e.g., about technologies, frameworks, and design decisions) changes fast and the knowledge needs of those who design, code, test and maintain software constantly evolve. On the other hand, software developers use a wide range of processes, practices and tools where developers explicitly and implicitly “produce” and capture different types of knowledge. Problem: Software developers use instant messaging tools (e.g., Slack, Microsoft Teams and Gitter) to discuss development-related problems, share experiences and to collaborate in projects. This communication takes place in chat rooms that accumulate potentially relevant knowledge to be reused by other developers. Therefore, in this research we analyze whether there is reusable knowledge in developer instant messaging communication by exploring (a) which instant messaging platforms can be a source of reusable knowledge, and (b) software engineering themes that represent the main discussions of developers in instant messaging communication. We also analyze how this reusable knowledge can be identified with the use of topic modeling (a natural language processing technique to discover abstract topics in text) by (c) surveying the literature on how topic modeling has been applied in software engineering research, and (d) evaluating how topic models perform with developer instant messages. Method: First, we conducted a Field Study through an exploratory case study and a reflexive thematic analysis to check whether there is reusable knowledge in developer instant messaging communication, and if so, what this knowledge (main themes discussed) is. Then, we conducted a Sample Study to explore how reusable knowledge in developer instant messaging communication can we identified. In this study, we applied a literature survey and software repository mining (i.e. short text topic modeling). Findings and contributions: We (a) developed a comparison framework for instant messaging tools, (b) identified a map of the main themes discussed in chat rooms of an instant messaging tool (Gitter, a platform used by software developers), (c) provided a comprehensive literature review that offers insights and references on the use of topic modeling in software engineering, and (d) provided an evaluation of the performance of topic models applied to developer instant messages based on topic coherence metrics and human judgment for topic quality

    Uncertainty in Artificial Intelligence: Proceedings of the Thirty-Fourth Conference

    Get PDF

    Bayesian nonparametric models for name disambiguation and supervised learning

    Get PDF
    This thesis presents new Bayesian nonparametric models and approaches for their development, for the problems of name disambiguation and supervised learning. Bayesian nonparametric methods form an increasingly popular approach for solving problems that demand a high amount of model flexibility. However, this field is relatively new, and there are many areas that need further investigation. Previous work on Bayesian nonparametrics has neither fully explored the problems of entity disambiguation and supervised learning nor the advantages of nested hierarchical models. Entity disambiguation is a widely encountered problem where different references need to be linked to a real underlying entity. This problem is often unsupervised as there is no previously known information about the entities. Further to this, effective use of Bayesian nonparametrics offer a new approach to tackling supervised problems, which are frequently encountered. The main original contribution of this thesis is a set of new structured Dirichlet process mixture models for name disambiguation and supervised learning that can also have a wide range of applications. These models use techniques from Bayesian statistics, including hierarchical and nested Dirichlet processes, generalised linear models, Markov chain Monte Carlo methods and optimisation techniques such as BFGS. The new models have tangible advantages over existing methods in the field as shown with experiments on real-world datasets including citation databases and classification and regression datasets. I develop the unsupervised author-topic space model for author disambiguation that uses free-text to perform disambiguation unlike traditional author disambiguation approaches. The model incorporates a name variant model that is based on a nonparametric Dirichlet language model. The model handles both novel unseen name variants and can model the unknown authors of the text of the documents. Through this, the model can disambiguate authors with no prior knowledge of the number of true authors in the dataset. In addition, it can do this when the authors have identical names. I use a model for nesting Dirichlet processes named the hybrid NDP-HDP. This model allows Dirichlet processes to be clustered together and adds an additional level of structure to the hierarchical Dirichlet process. I also develop a new hierarchical extension to the hybrid NDP-HDP. I develop this model into the grouped author-topic model for the entity disambiguation task. The grouped author-topic model uses clusters to model the co-occurrence of entities in documents, which can be interpreted as research groups. Since this model does not require entities to be linked to specific words in a document, it overcomes the problems of some existing author-topic models. The model incorporates a new method for modelling name variants, so that domain-specific name variant models can be used. Lastly, I develop extensions to supervised latent Dirichlet allocation, a type of supervised topic model. The keyword-supervised LDA model predicts document responses more accurately by modelling the effect of individual words and their contexts directly. The supervised HDP model has more model flexibility by using Bayesian nonparametrics for supervised learning. These models are evaluated on a number of classification and regression problems, and the results show that they outperform existing supervised topic modelling approaches. The models can also be extended to use similar information to the previous models, incorporating additional information such as entities and document titles to improve prediction
    corecore