56 research outputs found
Semantic-aware Retrieval Standards based on Dirichlet Compound Model to Rank Notifications by Level of Urgency
There is a growing number of notifications generated from a wide range of sources. However, to our knowledge, there is no well-known generalizable standard for detecting the most urgent notifications. Establishing reusable standards is crucial for applications in which the recommendation (notification) is critical due to the level of urgency and sensitivity (e.g. medical domain). To tackle this problem, this thesis aims to establish Information Retrieval (IR) standards for notification (recommendation) task by taking semantic dimensions (terms, opinions, concepts and user interaction) into consideration. The technical research contributions of this thesis include but not limited to the development of a semantic IR framework based on Dirichlet Compound Model (DCM); namely FDCM, extending FDCM to the recommendation scenario (RFDCM) and proposing novel opinion-aware ranking models. Transparency, explainability and generalizability are some benefits that the use of a mathematically well-defined solution such as DCM offers. The FDCM framework is based on a robust aggregation parameter which effectively combines the semantic retrieval scores using Query Performance Predictors (QPPs). Our experimental results confirm the effectiveness of such approach in recommendation systems and semantic retrieval. One of the main findings of this thesis is that the concept-based extension (term-only + concept-only) of FDCM consistently outperformed both terms-only and concept-only baselines concerning biomedical data. Moreover, we show that semantic IR is beneficial for collaborative filtering and therefore it could help data scientists to develop hybrid and consolidated IR systems comprising content-based and collaborative filtering aspects of recommendation
Identifying reusable knowledge in developer instant messaging communication.
Context and background: Software engineering is a complex and knowledge-intensive
activity. Required knowledge (e.g., about technologies, frameworks, and design decisions)
changes fast and the knowledge needs of those who design, code, test and maintain
software constantly evolve. On the other hand, software developers use a wide range of
processes, practices and tools where developers explicitly and implicitly âproduceâ and
capture different types of knowledge.
Problem: Software developers use instant messaging tools (e.g., Slack, Microsoft
Teams and Gitter) to discuss development-related problems, share experiences and to
collaborate in projects. This communication takes place in chat rooms that accumulate
potentially relevant knowledge to be reused by other developers. Therefore, in this
research we analyze whether there is reusable knowledge in developer instant messaging
communication by exploring (a) which instant messaging platforms can be a source
of reusable knowledge, and (b) software engineering themes that represent the main
discussions of developers in instant messaging communication. We also analyze how
this reusable knowledge can be identified with the use of topic modeling (a natural
language processing technique to discover abstract topics in text) by (c) surveying the
literature on how topic modeling has been applied in software engineering research, and
(d) evaluating how topic models perform with developer instant messages.
Method: First, we conducted a Field Study through an exploratory case study and a
reflexive thematic analysis to check whether there is reusable knowledge in developer
instant messaging communication, and if so, what this knowledge (main themes discussed)
is. Then, we conducted a Sample Study to explore how reusable knowledge in
developer instant messaging communication can we identified. In this study, we applied
a literature survey and software repository mining (i.e. short text topic modeling).
Findings and contributions: We (a) developed a comparison framework for instant
messaging tools, (b) identified a map of the main themes discussed in chat rooms of an
instant messaging tool (Gitter, a platform used by software developers), (c) provided a
comprehensive literature review that offers insights and references on the use of topic
modeling in software engineering, and (d) provided an evaluation of the performance of
topic models applied to developer instant messages based on topic coherence metrics
and human judgment for topic quality
Recommended from our members
Inventing Intelligence: On the History of Complex Information Processing and Artificial Intelligence in the United States in the Mid-Twentieth Century
In the mid-1950s, researchers in the United States melded formal theories of problem solving and intelligence with another powerful new tool for control: the electronic digital computer. Several branches of western mathematical science emerged from this nexus, including computer science (1960sâ), data science (1990sâ) and artificial intelligence (AI). This thesis offers an account of the origins and politics of AI in the mid-twentieth century United States, which focuses on its imbrications in systems of societal control. In an effort to denaturalize the power relations upon which the field came into being, I situate AIâs canonical origin story in relation to the structural and intellectual priorities of the U.S. military and American industry during the Cold War, circa 1952 to 1961.
This thesis offers a detailed and comparative account of the early careers, research interests, and key outputs of four researchers often credited with laying the foundations for AI and machine learningâHerbert A. Simon, Frank Rosenblatt, John McCarthy and Marvin Minsky. It chronicles the distinct ways in which each sought to formalise and simulate human mental behaviour using digital electronic computers. Rather than assess their contributions as discontinuous with what came before, as in mythologies of AI's genesis, I establish continuities with, and borrowings from, management science and operations research (Simon), Hayekian economics and instrumentalist statistics (Rosenblatt), automatic coding techniques and pedagogy (McCarthy), and cybernetics (Minsky), along with the broadscale mobilization of Cold War-era civilian-led military science generally.
I assess how Minskyâs 1961 paper 'Steps Toward Artificial Intelligence' simultaneously consolidated and obscured these entanglements as it set in motion an initial research agenda for AI in the following two decades. I argue that mind-computer metaphors, and research in complex information processing generally, played an important role in normalizing the small- and large-scale structuring of social behaviour using mathematics in the United States from the second half of the twentieth century onward
Text generation for small data regimes
In Natural Language Processing (NLP), applications trained on downstream tasks for text classification usually require enormous amounts of data to perform well. Neural Network (NN) models are among the applications that can always be trained to produce better results. Yet, a huge factor in improving results is the ability to scale over large datasets. Given that Deep NNs are known to be data hungry, having more training samples can always be beneficial. For a classification model to perform well, it could require thousands or even
millions of textual training examples. Transfer learning enables us to leverage knowledge gained from general data collections to perform well on target tasks. In NLP, training language models on large data collections has been shown to achieve great results when tuned to different task-specific datasets Wang et al. (2019, 2018a). However, even with transfer learning, adequate training data remains a condition for training machine learning models. Nonetheless, we show that small textual datasets can be augmented to a degree that is enough to achieve improved classification performance. In this thesis, we make multiple contributions to data augmentation. Firstly, we transform the data generation task into an optimization problem which maximizes the usefulness of the generated output, using Monte Carlo Tree Search (MCTS) as the optimization strategy and incorporating entropy as one of the optimization criteria. Secondly, we propose a language generation approach for targeted data generation with the participation of the training classifier. With a user in the loop, we find that manual annotation of a small proportion of the generated data is enough to boost classification performance. Thirdly, under a self-learning scheme, we replace the user by an automated approach in which the classifier is trained on its own pseudo-labels. Finally, we extend the data generation approach to the knowledge distillation domain, by generating samples that a teacher model can confidently label, but not its student
Bayesian nonparametric models for name disambiguation and supervised learning
This thesis presents new Bayesian nonparametric models and approaches for their development,
for the problems of name disambiguation and supervised learning. Bayesian
nonparametric methods form an increasingly popular approach for solving problems
that demand a high amount of model flexibility. However, this field is relatively new,
and there are many areas that need further investigation. Previous work on Bayesian
nonparametrics has neither fully explored the problems of entity disambiguation and
supervised learning nor the advantages of nested hierarchical models. Entity disambiguation
is a widely encountered problem where different references need to be linked
to a real underlying entity. This problem is often unsupervised as there is no previously
known information about the entities. Further to this, effective use of Bayesian
nonparametrics offer a new approach to tackling supervised problems, which are frequently
encountered.
The main original contribution of this thesis is a set of new structured Dirichlet process
mixture models for name disambiguation and supervised learning that can also
have a wide range of applications. These models use techniques from Bayesian statistics,
including hierarchical and nested Dirichlet processes, generalised linear models,
Markov chain Monte Carlo methods and optimisation techniques such as BFGS. The
new models have tangible advantages over existing methods in the field as shown with
experiments on real-world datasets including citation databases and classification and
regression datasets.
I develop the unsupervised author-topic space model for author disambiguation that
uses free-text to perform disambiguation unlike traditional author disambiguation approaches.
The model incorporates a name variant model that is based on a nonparametric
Dirichlet language model. The model handles both novel unseen name variants and
can model the unknown authors of the text of the documents. Through this, the model
can disambiguate authors with no prior knowledge of the number of true authors in the
dataset. In addition, it can do this when the authors have identical names.
I use a model for nesting Dirichlet processes named the hybrid NDP-HDP. This
model allows Dirichlet processes to be clustered together and adds an additional level of
structure to the hierarchical Dirichlet process. I also develop a new hierarchical extension
to the hybrid NDP-HDP. I develop this model into the grouped author-topic model
for the entity disambiguation task. The grouped author-topic model uses clusters to model the co-occurrence of entities in documents, which can be interpreted as research
groups. Since this model does not require entities to be linked to specific words in a
document, it overcomes the problems of some existing author-topic models. The model
incorporates a new method for modelling name variants, so that domain-specific name
variant models can be used.
Lastly, I develop extensions to supervised latent Dirichlet allocation, a type of supervised
topic model. The keyword-supervised LDA model predicts document responses
more accurately by modelling the effect of individual words and their contexts directly.
The supervised HDP model has more model flexibility by using Bayesian nonparametrics
for supervised learning. These models are evaluated on a number of classification
and regression problems, and the results show that they outperform existing supervised
topic modelling approaches. The models can also be extended to use similar information
to the previous models, incorporating additional information such as entities and
document titles to improve prediction
- âŠ