637 research outputs found

    Information retrieval and machine learning methods for academic expert finding

    Get PDF
    In the context of academic expert finding, this paper investigates and compares the performance of information retrieval (IR) and machine learning (ML) methods, including deep learning, to approach the problem of identifying academic figures who are experts in different domains when a potential user requests their expertise. IR-based methods construct multifaceted textual profiles for each expert by clustering information from their scientific publications. Several methods fully tailored for this problem are presented in this paper. In contrast, ML-based methods treat expert finding as a classification task, training automatic text classifiers using publications authored by experts. By comparing these approaches, we contribute to a deeper understanding of academic-expert-finding techniques and their applicability in knowledge discovery. These methods are tested with two large datasets from the biomedical field: PMSC-UGR and CORD-19. The results show how IR techniques were, in general, more robust with both datasets and more suitable than the ML-based ones, with some exceptions showing good performance.Agencia Estatal de Investigación | Ref. PID2019-106758GB-C31Agencia Estatal de Investigación | Ref. PID2020-113230RB-C22FEDER/Junta de Andalucía | Ref. A-TIC-146-UGR2

    Understanding comparative questions and retrieving argumentative answers

    Get PDF
    Making decisions is an integral part of everyday life, yet it can be a difficult and complex process. While peoples’ wants and needs are unlimited, resources are often scarce, making it necessary to research the possible alternatives and weigh the pros and cons before making a decision. Nowadays, the Internet has become the main source of information when it comes to comparing alternatives, making search engines the primary means for collecting new information. However, relying only on term matching is not sufficient to adequately address requests for comparisons. Therefore, search systems should go beyond this approach to effectively address comparative information needs. In this dissertation, I explore from different perspectives how search systems can respond to comparative questions. First, I examine approaches to identifying comparative questions and study their underlying information needs. Second, I investigate a methodology to identify important constituents of comparative questions like the to-be-compared options and to detect the stance of answers towards these comparison options. Then, I address ambiguous comparative search queries by studying an interactive clarification search interface. And finally, addressing answering comparative questions, I investigate retrieval approaches that consider not only the topical relevance of potential answers but also account for the presence of arguments towards the comparison options mentioned in the questions. By addressing these facets, I aim to provide a comprehensive understanding of how to effectively satisfy the information needs of searchers seeking to compare different alternatives

    Entity Linking in Low-Annotation Data Settings

    Get PDF
    Recent advances in natural language processing have focused on applying and adapting large pretrained language models to specific tasks. These models, such as BERT (Devlin et al., 2019) and BART (Lewis et al., 2020a), are pretrained on massive amounts of unlabeled text across a variety of domains. The impact of these pretrained models is visible in the task of entity linking, where a mention of an entity in unstructured text is matched to the relevant entry in a knowledge base. State-of-the-art linkers, such as Wu et al. (2020) and De Cao et al. (2021), leverage pretrained models as a foundation for their systems. However, these models are also trained on large amounts of annotated data, which is crucial to their performance. Often these large datasets consist of domains that are easily annotated, such as Wikipedia or newswire text. However, tailoring NLP tools to a narrow variety of textual domains severely restricts their use in the real world. Many other domains, such as medicine or law, do not have large amounts of entity linking annotations available. Entity linking, which serves to bridge the gap between massive unstructured amounts of text and structured repositories of knowledge, is equally crucial in these domains. Yet tools trained on newswire or Wikipedia annotations are unlikely to be well-suited for identifying medical conditions mentioned in clinical notes. As most annotation efforts focus on English, similar challenges can be noted in building systems for non-English text. There is often a relatively small amount of annotated data in these domains. With this being the case, looking to other types of domain-specific data, such as unannotated text or highly-curated structured knowledge bases, is often required. In these settings, it is crucial to translate lessons taken from tools tailored for high-annotation domains into algorithms that are suited for low-annotation domains. This requires both leveraging broader types of data and understanding the unique challenges present in each domain

    Exploring the State of the Art in Legal QA Systems

    Full text link
    Answering questions related to the legal domain is a complex task, primarily due to the intricate nature and diverse range of legal document systems. Providing an accurate answer to a legal query typically necessitates specialized knowledge in the relevant domain, which makes this task all the more challenging, even for human experts. QA (Question answering systems) are designed to generate answers to questions asked in human languages. They use natural language processing to understand questions and search through information to find relevant answers. QA has various practical applications, including customer service, education, research, and cross-lingual communication. However, they face challenges such as improving natural language understanding and handling complex and ambiguous questions. Answering questions related to the legal domain is a complex task, primarily due to the intricate nature and diverse range of legal document systems. Providing an accurate answer to a legal query typically necessitates specialized knowledge in the relevant domain, which makes this task all the more challenging, even for human experts. At this time, there is a lack of surveys that discuss legal question answering. To address this problem, we provide a comprehensive survey that reviews 14 benchmark datasets for question-answering in the legal field as well as presents a comprehensive review of the state-of-the-art Legal Question Answering deep learning models. We cover the different architectures and techniques used in these studies and the performance and limitations of these models. Moreover, we have established a public GitHub repository where we regularly upload the most recent articles, open data, and source code. The repository is available at: \url{https://github.com/abdoelsayed2016/Legal-Question-Answering-Review}

    Machine Learning Algorithm for the Scansion of Old Saxon Poetry

    Get PDF
    Several scholars designed tools to perform the automatic scansion of poetry in many languages, but none of these tools deal with Old Saxon or Old English. This project aims to be a first attempt to create a tool for these languages. We implemented a Bidirectional Long Short-Term Memory (BiLSTM) model to perform the automatic scansion of Old Saxon and Old English poems. Since this model uses supervised learning, we manually annotated the Heliand manuscript, and we used the resulting corpus as labeled dataset to train the model. The evaluation of the performance of the algorithm reached a 97% for the accuracy and a 99% of weighted average for precision, recall and F1 Score. In addition, we tested the model with some verses from the Old Saxon Genesis and some from The Battle of Brunanburh, and we observed that the model predicted almost all Old Saxon metrical patterns correctly misclassified the majority of the Old English input verses

    Continuous Rationale Management

    Get PDF
    Continuous Software Engineering (CSE) is a software life cycle model open to frequent changes in requirements or technology. During CSE, software developers continuously make decisions on the requirements and design of the software or the development process. They establish essential decision knowledge, which they need to document and share so that it supports the evolution and changes of the software. The management of decision knowledge is called rationale management. Rationale management provides an opportunity to support the change process during CSE. However, rationale management is not well integrated into CSE. The overall goal of this dissertation is to provide workflows and tool support for continuous rationale management. The dissertation contributes an interview study with practitioners from the industry, which investigates rationale management problems, current practices, and features to support continuous rationale management beneficial for practitioners. Problems of rationale management in practice are threefold: First, documenting decision knowledge is intrusive in the development process and an additional effort. Second, the high amount of distributed decision knowledge documentation is difficult to access and use. Third, the documented knowledge can be of low quality, e.g., outdated, which impedes its use. The dissertation contributes a systematic mapping study on recommendation and classification approaches to treat the rationale management problems. The major contribution of this dissertation is a validated approach for continuous rationale management consisting of the ConRat life cycle model extension and the comprehensive ConDec tool support. To reduce intrusiveness and additional effort, ConRat integrates rationale management activities into existing workflows, such as requirements elicitation, development, and meetings. ConDec integrates into standard development tools instead of providing a separate tool. ConDec enables lightweight capturing and use of decision knowledge from various artifacts and reduces the developers' effort through automatic text classification, recommendation, and nudging mechanisms for rationale management. To enable access and use of distributed decision knowledge documentation, ConRat defines a knowledge model of decision knowledge and other artifacts. ConDec instantiates the model as a knowledge graph and offers interactive knowledge views with useful tailoring, e.g., transitive linking. To operationalize high quality, ConRat introduces the rationale backlog, the definition of done for knowledge documentation, and metrics for intra-rationale completeness and decision coverage of requirements and code. ConDec implements these agile concepts for rationale management and a knowledge dashboard. ConDec also supports consistent changes through change impact analysis. The dissertation shows the feasibility, effectiveness, and user acceptance of ConRat and ConDec in six case study projects in an industrial setting. Besides, it comprehensively analyses the rationale documentation created in the projects. The validation indicates that ConRat and ConDec benefit CSE projects. Based on the dissertation, continuous rationale management should become a standard part of CSE, like automated testing or continuous integration

    Performance Challenges with Data Visualizations in Browser Environment

    Get PDF
    Information exists in many forms, from text, to equations, videos, audio, and graphical mediums. With graphical or visual mediums, it is becoming easier to absorb information where the alternatives are textual descriptions. Graphs are important vehicles of transporting information. In order to create a good graph, certain attributes need to be taken into account, such as which variables are being displayed over which axis, visual elements, and their sizes are also important to consider. In modern times with the internet and the amount of data being generated, how can all this data be fitted into a single graph? That question is the motivation for this thesis. Presenting large data in visualizations involves a great deal of thought, effort, and ingenuity on how to proceed with what information to convey. There are times when obtaining data for such visualization come with their own challenges. This thesis investigates the obstacles facing an internal tool within a company in regard to their data retrieval method. As well as the objective to research an efficient and easy-to-use method for presenting large data on a webpage

    Semantic-aware Retrieval Standards based on Dirichlet Compound Model to Rank Notifications by Level of Urgency

    Get PDF
    There is a growing number of notifications generated from a wide range of sources. However, to our knowledge, there is no well-known generalizable standard for detecting the most urgent notifications. Establishing reusable standards is crucial for applications in which the recommendation (notification) is critical due to the level of urgency and sensitivity (e.g. medical domain). To tackle this problem, this thesis aims to establish Information Retrieval (IR) standards for notification (recommendation) task by taking semantic dimensions (terms, opinions, concepts and user interaction) into consideration. The technical research contributions of this thesis include but not limited to the development of a semantic IR framework based on Dirichlet Compound Model (DCM); namely FDCM, extending FDCM to the recommendation scenario (RFDCM) and proposing novel opinion-aware ranking models. Transparency, explainability and generalizability are some benefits that the use of a mathematically well-defined solution such as DCM offers. The FDCM framework is based on a robust aggregation parameter which effectively combines the semantic retrieval scores using Query Performance Predictors (QPPs). Our experimental results confirm the effectiveness of such approach in recommendation systems and semantic retrieval. One of the main findings of this thesis is that the concept-based extension (term-only + concept-only) of FDCM consistently outperformed both terms-only and concept-only baselines concerning biomedical data. Moreover, we show that semantic IR is beneficial for collaborative filtering and therefore it could help data scientists to develop hybrid and consolidated IR systems comprising content-based and collaborative filtering aspects of recommendation

    Intelligent Software Tooling For Improving Software Development

    Get PDF
    Software has eaten the world with many of the necessities and quality of life services people use requiring software. Therefore, tools that improve the software development experience can have a significant impact on the world such as generating code and test cases, detecting bugs, question and answering, etc. The success of Deep Learning (DL) over the past decade has shown huge advancements in automation across many domains, including Software Development processes. One of the main reasons behind this success is the availability of large datasets such as open-source code available through GitHub or image datasets of mobile Graphical User Interfaces (GUIs) with RICO and ReDRAW to be trained on. Therefore, the central research question my dissertation explores is: In what ways can the software development process be improved through leveraging DL techniques on the vast amounts of unstructured software engineering artifacts? We coin the approaches that leverage DL to automate or augment various software development task as Intelligent Software Tools. To guide our research of these intelligent software tools, we performed a systematic literature review to understand the current landscape of research on applying DL techniques to software tasks and any gaps that exist. From this literature review, we found code generation to be one of the most studied tasks with other tasks and artifacts such as impact analysis or tasks involving images and videos to be understudied. Therefore, we set out to explore the application of DL to these understudied tasks and artifacts as well as the limitations of DL models under the well studied task code completion, a subfield in code generation. Specifically, we developed a tool for automatically detecting duplicate mobile bug reports from user submitted videos. We used the popular Convolutional Neural Network (CNN) to learn important features from a large collection of mobile screenshots. Using this model, we could then compute similarity between a newly submitted bug report and existing ones to produce a ranked list of duplicate candidates that can be reviewed by a developer. Next, we explored impact analysis, a critical software maintenance task that identifies potential adverse effects of a given code change on the larger software system. To this end, we created Athena, a novel approach to impact analysis that integrates knowledge of a software system through its call-graph along with high-level representations of the code inside the system to improve impact analysis performance. Lastly, we explored the task of code completion, which has seen heavy interest from industry and academia. Specifically, we explored various methods that modify the positional encoding scheme of the Transformer architecture for allowing these models to incorporate longer sequences of tokens when predicting completions than seen during their training as this can significantly improve training times
    • …
    corecore