4,980 research outputs found

    Using Deep Learning for Title-Based Semantic Subject Indexing to Reach Competitive Performance to Full-Text

    Get PDF
    For (semi-)automated subject indexing systems in digital libraries, it is often more practical to use metadata such as the title of a publication instead of the full-text or the abstract. Therefore, it is desirable to have good text mining and text classification algorithms that operate well already on the title of a publication. So far, the classification performance on titles is not competitive with the performance on the full-texts if the same number of training samples is used for training. However, it is much easier to obtain title data in large quantities and to use it for training than full-text data. In this paper, we investigate the question how models obtained from training on increasing amounts of title training data compare to models from training on a constant number of full-texts. We evaluate this question on a large-scale dataset from the medical domain (PubMed) and from economics (EconBiz). In these datasets, the titles and annotations of millions of publications are available, and they outnumber the available full-texts by a factor of 20 and 15, respectively. To exploit these large amounts of data to their full potential, we develop three strong deep learning classifiers and evaluate their performance on the two datasets. The results are promising. On the EconBiz dataset, all three classifiers outperform their full-text counterparts by a large margin. The best title-based classifier outperforms the best full-text method by 9.4%. On the PubMed dataset, the best title-based method almost reaches the performance of the best full-text classifier, with a difference of only 2.9%

    Large-scale fine-grained semantic indexing of biomedical literature based on weakly-supervised deep learning

    Full text link
    Semantic indexing of biomedical literature is usually done at the level of MeSH descriptors, representing topics of interest for the biomedical community. Several related but distinct biomedical concepts are often grouped together in a single coarse-grained descriptor and are treated as a single topic for semantic indexing. This study proposes a new method for the automated refinement of subject annotations at the level of concepts, investigating deep learning approaches. Lacking labelled data for this task, our method relies on weak supervision based on concept occurrence in the abstract of an article. The proposed approach is evaluated on an extended large-scale retrospective scenario, taking advantage of concepts that eventually become MeSH descriptors, for which annotations become available in MEDLINE/PubMed. The results suggest that concept occurrence is a strong heuristic for automated subject annotation refinement and can be further enhanced when combined with dictionary-based heuristics. In addition, such heuristics can be useful as weak supervision for developing deep learning models that can achieve further improvement in some cases.Comment: 48 pages, 5 figures, 9 tables, 1 algorith

    Recommendations for item set completion: On the semantics of item co-occurrence with data sparsity, input size, and input modalities

    Get PDF
    We address the problem of recommending relevant items to a user in order to "complete" a partial set of items already known. We consider the two scenarios of citation and subject label recommendation, which resemble different semantics of item co-occurrence: relatedness for co-citations and diversity for subject labels. We assess the influence of the completeness of an already known partial item set on the recommender performance. We also investigate data sparsity through a pruning parameter and the influence of using additional metadata. As recommender models, we focus on different autoencoders, which are particularly suited for reconstructing missing items in a set. We extend autoencoders to exploit a multi-modal input of text and structured data. Our experiments on six real-world datasets show that supplying the partial item set as input is helpful when item co-occurrence resembles relatedness, while metadata are effective when co-occurrence implies diversity. This outcome means that the semantics of item co-occurrence is an important factor. The simple item co-occurrence model is a strong baseline for citation recommendation. However, autoencoders have the advantage to enable exploiting additional metadata besides the partial item set as input and achieve comparable performance. For the subject label recommendation task, the title is the most important attribute. Adding more input modalities sometimes even harms the result. In conclusion, it is crucial to consider the semantics of the item co-occurrence for the choice of an appropriate recommendation model and carefully decide which metadata to exploit

    Overview of BioASQ 2021-MESINESP track. Evaluation of advance hierarchical classification techniques for scientific literature, patents and clinical trials

    Get PDF
    CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania,There is a pressing need to exploit recent advances in natural language processing technologies, in particular language models and deep learning approaches, to enable improved retrieval, classification and ultimately access to information contained in multiple, heterogeneous types of documents. This is particularly true for the field of biomedicine and clinical research, where medical experts and scientists need to carry out complex search queries against a variety of document collections, including literature, patents, clinical trials or other kind of content like EHRs. Indexing documents with structured controlled vocabularies used for semantic search engines and query expansion purposes is a critical task for enabling sophisticated user queries and even cross-language retrieval. Due to the complexity of the medical domain and the use of very large hierarchical indexing terminologies, implementing efficient automatic systems to aid manual indexing is extremely difficult. This paper provides a summary of the MESINESP task results on medical semantic indexing in Spanish (BioASQ/ CLEF 2021 Challenge). MESINESP was carried out in direct collaboration with literature content databases and medical indexing experts using the DeCS vocabulary, a similar resource as MeSH terms. Seven participating teams used advanced technologies including extreme multilabel classification and deep language models to solve this challenge which can be viewed as a multi-label classification problem. MESINESP resources, we have released a Gold Standard collection of 243,000 documents with a total of 2179 manual annotations divided in train, development and test subsets covering literature, patents as well as clinical trial summaries, under a cross-genre training and data labeling scenario. Manual indexing of the evaluation subsets was carried out by three independent experts using a specially developed indexing interface called ASIT. Additionally, we have published a collection of large-scale automatic semantic annotations based on NER systems of these documents with mentions of drugs/medications (170,000), symptoms (137,000), diseases (840,000) and clinical procedures (415,000). In addition to a summary of the used technologies by the teams, this paperS

    A visual analytics platform for competitive intelligence

    Get PDF
    Silva, D., & Bação, F. (2023). MapIntel: A visual analytics platform for competitive intelligence. Expert Systems, [e13445]. https://doi.org/https://www.authorea.com/doi/full/10.22541/au.166785335.50477185, https://doi.org/10.1111/exsy.13445 --- Funding Information: This work was supported by the (research grant under the DSAIPA/DS/0116/2019 project). Fundação para a Ciência e Tecnologia of Ministério da Ciência e Tecnologia e Ensino SuperiorCompetitive Intelligence allows an organization to keep up with market trends and foresee business opportunities. This practice is mainly performed by analysts scanning for any piece of valuable information in a myriad of dispersed and unstructured sources. Here we present MapIntel, a system for acquiring intelligence from vast collections of text data by representing each document as a multidimensional vector that captures its own semantics. The system is designed to handle complex Natural Language queries and visual exploration of the corpus, potentially aiding overburdened analysts in finding meaningful insights to help decision-making. The system searching module uses a retriever and re-ranker engine that first finds the closest neighbours to the query embedding and then sifts the results through a cross-encoder model that identifies the most relevant documents. The browsing or visualization module also leverages the embeddings by projecting them onto two dimensions while preserving the multidimensional landscape, resulting in a map where semantically related documents form topical clusters which we capture using topic modelling. This map aims at promoting a fast overview of the corpus while allowing a more detailed exploration and interactive information encountering process. We evaluate the system and its components on the 20 newsgroups data set, using the semantic document labels provided, and demonstrate the superiority of Transformer-based components. Finally, we present a prototype of the system in Python and show how some of its features can be used to acquire intelligence from a news article corpus we collected during a period of 8 months.preprintauthorsversionepub_ahead_of_prin

    Fusion architectures for automatic subject indexing under concept drift:Analysis and empirical results on short texts

    Get PDF
    Indexing documents with controlled vocabularies enables a wealth of semantic applications for digital libraries. Due to the rapid growth of scientific publications, machine learning-based methods are required that assign subject descriptors automatically. While stability of generative processes behind the underlying data is often assumed tacitly, it is being violated in practice. Addressing this problem, this article studies explicit and implicit concept drift, that is, settings with new descriptor terms and new types of documents, respectively. First, the existence of concept drift in automatic subject indexing is discussed in detail and demonstrated by example. Subsequently, architectures for automatic indexing are analyzed in this regard, highlighting individual strengths and weaknesses. The results of the theoretical analysis justify research on fusion of different indexing approaches with special consideration on information sharing among descriptors. Experimental results on titles and author keywords in the domain of economics underline the relevance of the fusion methodology, especially under concept drift. Fusion approaches outperformed non-fusion strategies on the tested data sets, which comprised shifts in priors of descriptors as well as covariates. These findings can help researchers and practitioners in digital libraries to choose appropriate methods for automatic subject indexing, as is finally shown by a recent case study

    CHORUS Deliverable 2.1: State of the Art on Multimedia Search Engines

    Get PDF
    Based on the information provided by European projects and national initiatives related to multimedia search as well as domains experts that participated in the CHORUS Think-thanks and workshops, this document reports on the state of the art related to multimedia content search from, a technical, and socio-economic perspective. The technical perspective includes an up to date view on content based indexing and retrieval technologies, multimedia search in the context of mobile devices and peer-to-peer networks, and an overview of current evaluation and benchmark inititiatives to measure the performance of multimedia search engines. From a socio-economic perspective we inventorize the impact and legal consequences of these technical advances and point out future directions of research
    • …
    corecore