9 research outputs found

    Automated Patent Categorization and Guided Patent Search using IPC as Inspired by MeSH and PubMed

    Full text link

    Automated Patent Categorization and Guided Patent Search using IPC as Inspired by MeSH and PubMed

    Get PDF
    The patent domain is a very important source of scientific information that is currently not used to its full potential. Searching for relevant patents is a complex task because the number of existing patents is very high and grows quickly, patent text is extremely complicated, and standard vocabulary is not used consistently or doesn’t even exist. As a consequence, pure keyword searches often fail to return satisfying results in the patent domain. Major companies employ patent professionals who are able to search patents effectively, but even they have to invest a lot of time and effort into their search. Academic scientists on the other hand do not have access to such resources and therefore often do not search patents at all, but they risk missing up-to-date information that will not be published in scientific publications until much later, if it is published at all. Document search on PubMed, the pre-eminent database for biomedical literature, relies on the annotation of its documents with relevant terms from the Medical Subject Headings ontology (MeSH) for improving recall through query expansion. Similarly, professional patent searches expand beyond keywords by including class codes from various patent classification systems. However, classification-based searches can only be performed effectively if the user has very detailed knowledge of the system, which is usually not the case for academic scientists. Consequently, we investigated methods to automatically identify relevant classes that can then be suggested to the user to expand their query. Since every patent is assigned at least one class code, it should be possible for these assignments to be used in a similar way as the MeSH annotations in PubMed. In order to develop a system for this task, it is necessary to have a good understanding of the properties of both classification systems. In order to gain such knowledge, we perform an in-depth comparative analysis of MeSH and the main patent classification system, the International Patent Classification (IPC). We investigate the hierarchical structures as well as the properties of the terms/classes respectively, and we compare the assignment of IPC codes to patents with the annotation of PubMed documents with MeSH terms. Our analysis shows that the hierarchies are structurally similar, but terms and annotations differ significantly. The most important differences concern the considerably higher complexity of the IPC class definitions compared to MeSH terms and the far lower number of class assignments to the average patent compared to the number of MeSH terms assigned to PubMed documents. As a result of these differences, problems are caused both for unexperienced patent searchers and professionals. On the one hand, the complex term system makes it very difficult for members of the former group to find any IPC classes that are relevant for their search task. On the other hand, the low number of IPC classes per patent points to incomplete class assignments by the patent office, therefore limiting the recall of the classification-based searches that are frequently performed by the latter group. We approach these problems from two directions: First, by automatically assigning additional patent classes to make up for the missing assignments, and second, by automatically retrieving relevant keywords and classes that are proposed to the user so they can expand their initial search. For the automated assignment of additional patent classes, we adapt an approach to the patent domain that was successfully used for the assignment of MeSH terms to PubMed abstracts. Each document is assigned a set of IPC classes by a large set of binary Maximum-Entropy classifiers. Our evaluation shows good performance by individual classifiers (precision/recall between 0:84 and 0:90), making the retrieval of additional relevant documents for specific IPC classes feasible. The assignment of additional classes to specific documents is more problematic, since the precision of our classifiers is not high enough to avoid false positives. However, we propose filtering methods that can help solve this problem. For the guided patent search, we demonstrate various methods to expand a user’s initial query. Our methods use both keywords and class codes that the user enters to retrieve additional relevant keywords and classes that are then suggested to the user. These additional query components are extracted from different sources such as patent text, IPC definitions, external vocabularies and co-occurrence data. The suggested expansions can help unexperienced users refine their queries with relevant IPC classes, and professionals can compose their complete query faster and more easily. We also present GoPatents, a patent retrieval prototype that incorporates some of our proposals and makes faceted browsing of a patent corpus possible

    Selected papers from the 15th Annual Bio-Ontologies special interest group meeting

    Get PDF
    © 2013 Soldatova et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.Over the 15 years, the Bio-Ontologies SIG at ISMB has provided a forum for discussion of the latest and most innovative research in the bio-ontologies development, its applications to biomedicine and more generally the organisation, presentation and dissemination of knowledge in biomedicine and the life sciences. The seven papers and the commentary selected for this supplement span a wide range of topics including: web-based querying over multiple ontologies, integration of data, annotating patent records, NCBO Web services, ontology developments for probabilistic reasoning and for physiological processes, and analysis of the progress of annotation and structural GO changes

    The Light and Shade of Knowledge Recombination: A Systematic Look at the Bioinformatics Patent Scenario

    Get PDF
    This research focuses on a special case of General Purpose Technology: Bioinformatics. It explores whether – and to what extent – Bioinformatics inventions build upon inherently diverse knowledge sources. Precisely, the role of scientific and technological diversity (measured with Shannon-Wiener diversity index) as driver of impactful Bioinformatics inventions (measured at different standard deviations of the forward citations distribution) is investigated. To this purpose, we carried out an analysis of both Non-Patent and Patent references cited into Bioinformatics patented inventions in the period 1976-2014. Results from a series of logistic regression models indicate that different degrees of impact require different degrees of knowledge diversity; at the same time, and importantly for practitioners and scholars, recombining diverse scientific and technological knowledge bases not always lead to impactful inventions. In other terms: the interplay of science and technology is not always the best option to get impactful inventions

    Overview of BioASQ 2021-MESINESP track. Evaluation of advance hierarchical classification techniques for scientific literature, patents and clinical trials

    Get PDF
    CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania,There is a pressing need to exploit recent advances in natural language processing technologies, in particular language models and deep learning approaches, to enable improved retrieval, classification and ultimately access to information contained in multiple, heterogeneous types of documents. This is particularly true for the field of biomedicine and clinical research, where medical experts and scientists need to carry out complex search queries against a variety of document collections, including literature, patents, clinical trials or other kind of content like EHRs. Indexing documents with structured controlled vocabularies used for semantic search engines and query expansion purposes is a critical task for enabling sophisticated user queries and even cross-language retrieval. Due to the complexity of the medical domain and the use of very large hierarchical indexing terminologies, implementing efficient automatic systems to aid manual indexing is extremely difficult. This paper provides a summary of the MESINESP task results on medical semantic indexing in Spanish (BioASQ/ CLEF 2021 Challenge). MESINESP was carried out in direct collaboration with literature content databases and medical indexing experts using the DeCS vocabulary, a similar resource as MeSH terms. Seven participating teams used advanced technologies including extreme multilabel classification and deep language models to solve this challenge which can be viewed as a multi-label classification problem. MESINESP resources, we have released a Gold Standard collection of 243,000 documents with a total of 2179 manual annotations divided in train, development and test subsets covering literature, patents as well as clinical trial summaries, under a cross-genre training and data labeling scenario. Manual indexing of the evaluation subsets was carried out by three independent experts using a specially developed indexing interface called ASIT. Additionally, we have published a collection of large-scale automatic semantic annotations based on NER systems of these documents with mentions of drugs/medications (170,000), symptoms (137,000), diseases (840,000) and clinical procedures (415,000). In addition to a summary of the used technologies by the teams, this paperS

    Automated Patent Categorization and Guided Patent Search using IPC as Inspired by MeSH and PubMed

    Get PDF
    The patent domain is a very important source of scientific information that is currently not used to its full potential. Searching for relevant patents is a complex task because the number of existing patents is very high and grows quickly, patent text is extremely complicated, and standard vocabulary is not used consistently or doesn’t even exist. As a consequence, pure keyword searches often fail to return satisfying results in the patent domain. Major companies employ patent professionals who are able to search patents effectively, but even they have to invest a lot of time and effort into their search. Academic scientists on the other hand do not have access to such resources and therefore often do not search patents at all, but they risk missing up-to-date information that will not be published in scientific publications until much later, if it is published at all. Document search on PubMed, the pre-eminent database for biomedical literature, relies on the annotation of its documents with relevant terms from the Medical Subject Headings ontology (MeSH) for improving recall through query expansion. Similarly, professional patent searches expand beyond keywords by including class codes from various patent classification systems. However, classification-based searches can only be performed effectively if the user has very detailed knowledge of the system, which is usually not the case for academic scientists. Consequently, we investigated methods to automatically identify relevant classes that can then be suggested to the user to expand their query. Since every patent is assigned at least one class code, it should be possible for these assignments to be used in a similar way as the MeSH annotations in PubMed. In order to develop a system for this task, it is necessary to have a good understanding of the properties of both classification systems. In order to gain such knowledge, we perform an in-depth comparative analysis of MeSH and the main patent classification system, the International Patent Classification (IPC). We investigate the hierarchical structures as well as the properties of the terms/classes respectively, and we compare the assignment of IPC codes to patents with the annotation of PubMed documents with MeSH terms. Our analysis shows that the hierarchies are structurally similar, but terms and annotations differ significantly. The most important differences concern the considerably higher complexity of the IPC class definitions compared to MeSH terms and the far lower number of class assignments to the average patent compared to the number of MeSH terms assigned to PubMed documents. As a result of these differences, problems are caused both for unexperienced patent searchers and professionals. On the one hand, the complex term system makes it very difficult for members of the former group to find any IPC classes that are relevant for their search task. On the other hand, the low number of IPC classes per patent points to incomplete class assignments by the patent office, therefore limiting the recall of the classification-based searches that are frequently performed by the latter group. We approach these problems from two directions: First, by automatically assigning additional patent classes to make up for the missing assignments, and second, by automatically retrieving relevant keywords and classes that are proposed to the user so they can expand their initial search. For the automated assignment of additional patent classes, we adapt an approach to the patent domain that was successfully used for the assignment of MeSH terms to PubMed abstracts. Each document is assigned a set of IPC classes by a large set of binary Maximum-Entropy classifiers. Our evaluation shows good performance by individual classifiers (precision/recall between 0:84 and 0:90), making the retrieval of additional relevant documents for specific IPC classes feasible. The assignment of additional classes to specific documents is more problematic, since the precision of our classifiers is not high enough to avoid false positives. However, we propose filtering methods that can help solve this problem. For the guided patent search, we demonstrate various methods to expand a user’s initial query. Our methods use both keywords and class codes that the user enters to retrieve additional relevant keywords and classes that are then suggested to the user. These additional query components are extracted from different sources such as patent text, IPC definitions, external vocabularies and co-occurrence data. The suggested expansions can help unexperienced users refine their queries with relevant IPC classes, and professionals can compose their complete query faster and more easily. We also present GoPatents, a patent retrieval prototype that incorporates some of our proposals and makes faceted browsing of a patent corpus possible

    Automated Patent Categorization and Guided Patent Search using IPC as Inspired by MeSH and PubMed

    No full text
    The patent domain is a very important source of scientific information that is currently not used to its full potential. Searching for relevant patents is a complex task because the number of existing patents is very high and grows quickly, patent text is extremely complicated, and standard vocabulary is not used consistently or doesn’t even exist. As a consequence, pure keyword searches often fail to return satisfying results in the patent domain. Major companies employ patent professionals who are able to search patents effectively, but even they have to invest a lot of time and effort into their search. Academic scientists on the other hand do not have access to such resources and therefore often do not search patents at all, but they risk missing up-to-date information that will not be published in scientific publications until much later, if it is published at all. Document search on PubMed, the pre-eminent database for biomedical literature, relies on the annotation of its documents with relevant terms from the Medical Subject Headings ontology (MeSH) for improving recall through query expansion. Similarly, professional patent searches expand beyond keywords by including class codes from various patent classification systems. However, classification-based searches can only be performed effectively if the user has very detailed knowledge of the system, which is usually not the case for academic scientists. Consequently, we investigated methods to automatically identify relevant classes that can then be suggested to the user to expand their query. Since every patent is assigned at least one class code, it should be possible for these assignments to be used in a similar way as the MeSH annotations in PubMed. In order to develop a system for this task, it is necessary to have a good understanding of the properties of both classification systems. In order to gain such knowledge, we perform an in-depth comparative analysis of MeSH and the main patent classification system, the International Patent Classification (IPC). We investigate the hierarchical structures as well as the properties of the terms/classes respectively, and we compare the assignment of IPC codes to patents with the annotation of PubMed documents with MeSH terms. Our analysis shows that the hierarchies are structurally similar, but terms and annotations differ significantly. The most important differences concern the considerably higher complexity of the IPC class definitions compared to MeSH terms and the far lower number of class assignments to the average patent compared to the number of MeSH terms assigned to PubMed documents. As a result of these differences, problems are caused both for unexperienced patent searchers and professionals. On the one hand, the complex term system makes it very difficult for members of the former group to find any IPC classes that are relevant for their search task. On the other hand, the low number of IPC classes per patent points to incomplete class assignments by the patent office, therefore limiting the recall of the classification-based searches that are frequently performed by the latter group. We approach these problems from two directions: First, by automatically assigning additional patent classes to make up for the missing assignments, and second, by automatically retrieving relevant keywords and classes that are proposed to the user so they can expand their initial search. For the automated assignment of additional patent classes, we adapt an approach to the patent domain that was successfully used for the assignment of MeSH terms to PubMed abstracts. Each document is assigned a set of IPC classes by a large set of binary Maximum-Entropy classifiers. Our evaluation shows good performance by individual classifiers (precision/recall between 0:84 and 0:90), making the retrieval of additional relevant documents for specific IPC classes feasible. The assignment of additional classes to specific documents is more problematic, since the precision of our classifiers is not high enough to avoid false positives. However, we propose filtering methods that can help solve this problem. For the guided patent search, we demonstrate various methods to expand a user’s initial query. Our methods use both keywords and class codes that the user enters to retrieve additional relevant keywords and classes that are then suggested to the user. These additional query components are extracted from different sources such as patent text, IPC definitions, external vocabularies and co-occurrence data. The suggested expansions can help unexperienced users refine their queries with relevant IPC classes, and professionals can compose their complete query faster and more easily. We also present GoPatents, a patent retrieval prototype that incorporates some of our proposals and makes faceted browsing of a patent corpus possible

    Scientometric assessment of R&D priority areas in South Africa : a comparison with other BRICS countries

    Get PDF
    The study aimed to look at the priority areas of South African terms of technology development and the impact thereof. In terms of publications, a bibliometric analysis of selected research priority areas in South Africa was done using the Web of Science database for the period 2001 - 2015. The performance of the country in the areas of biotechnology, energy, astronomy and palaeontology in terms of the publication output in these areas is compared using two classic scientometric indicators, the activity and attractivity indices. These are important priority areas as highlighted in various government policy documents and the aim was to identify if outputs in these fields are corresponding with government policy. The study also identifies leading institutions in the country in terms of publication output, while the performance is also benchmarked against that of the other BRIC (Brazil, Russia, India and China) group of countries, as well as Egypt. It is found that South Africa has a relatively high output in research areas in which it enjoys geographical advantage, such as astronomy and palaeontology, and compares favourably with comparator countries in all areas reviewed. In terms of the institutional profile, and based on publication outputs over the period considered, the University of Cape Town is a leader in energy, the University of Stellenbosch in biotechnology, the University of the Witwatersrand in palaeontology, and the National Research Foundation in the area of astronomy. The study then evaluated the priority areas in terms of patents. It was found that South Africa is the most prolific producer of patents in the African continent. This study assessed the inventive activity through patents registered by South African researchers worldwide, using the WIPO database. The focus of the study was on research priority areas documented in the South African government policy documents. The research priority areas considered were ICT, nanotechnology, biotechnology, climate change, energy and health. Patents in the areas were compared with the BRICS (Brazil, Russia, India, China and South Africa) countries and Egypt. The comparison was done using the revealed technological advantage, sometimes referred to as the specialisation index. It was found that two African countries have not increased their patent share significantly and are yet to find their specialisation. It was found that while South Africa is doing well in terms of patenting in general, with patents showing an upward trend, the profile of inventions being patented are not necessarily aligned with the priority areas as documented in government policy. Another question that remained was how South Africa is progressing in developing emerging technologies, with nanotechnology and nanoscience as a case study. This is one of the country’s priorities and a fast-growing scientific research area internationally, and is classified as an important emerging research area. In response to this, South African researchers and institutions have also increased their efforts in this area. A bibliometric study of articles, as indexed in the Web of Science, considered the development in this field, including the growth in literature, collaboration profile and the research areas that are more within the country’s context. It also looked at public institutions that are more active in this arena, including government policy considerations as guided by the Nanoscience and Nanotechnology Strategy launched in 2005. The study found that the number of nanotechnology publications have shown remarkable growth ever since. The articles are spread through many journals with Electrochimica acta having the most articles, followed by Journal of Nanoscience and Nanotechnology. These publications fall within the traditional domains of chemistry and physics. In terms of the institutional profile and based on publication outputs over the period reviewed, the Council for Scientific and Industrial Research is a leading producer of publications in nanotechnology, followed by the University of Witwatersrand - both institutions are based in Gauteng Province. There is a high level of international collaboration with different countries within this field, the most productive of which is with India, then USA, and thirdly, China, as measured through co-authorship. Finally, R&D efficiency, as expressed by the publication and patent outputs in scientific fields compared with the overall investment in R&D, was studied. The study focussed on the two important fields in South Africa; nanotechnology and biotechnology. In addition to this, South Africa’s R&D efficiency in all scientific fields was compared to that of the other BRICS countries. Data on R&D expenditure was used as input in the R&D process to achieve this comparison. The study found that, within South Africa, nanotechnology has been doing well on both patent and publications produced per US dollar spent on research development. The efficiency in terms of publications in this field started to fall slightly in 2013, to be equivalent to that of biotechnology. In context of the BRICS countries, it was found that South Africa has the highest R&D efficiency as measured by both patents and publications. This may offer some lessons to its bigger BRICS partners in terms of best practice in keeping the cost low and productivity high despite a relatively small science system. Relevant literature reviewed in this research includes the use of bibliometrics methods for science and technology studies. The priority areas and the country-specific issues are also discussed, with particular emphasis to challenges in developing countries. While the study focussed on developing countries, the BRICS grouping, mainstream literature provided a useful background, especially with respect to designing the methodologies for the data collection. The conceptual models discussed in this study – the TENs and the Triple helix – all emphasise the multi-agency approach to innovation, with the government being just one of the actors in the innovation ecosystem. The low level of industrial involvement in development of the priority areas, as indicated in patenting and publication trends, indicates that this one important player is missing in the system that should include all the players, which are the academia, industry and government. Strategies should be put in place to incentivise private sector R&D investment to raise the GERD that is currently very low when compared to other countries.Thesis (PhD)--University of Pretoria, 2018.Graduate School of Technology Management (GSTM)PhdUnrestricte

    Neue Indexingverfahren fĂŒr die Ähnlichkeitssuche in metrischen RĂ€umen ĂŒber großen Datenmengen

    Get PDF
    Ein zunehmend wichtiges Thema in der Informatik ist der Umgang mit Ähnlichkeit in einer großen Anzahl unterschiedlicher DomĂ€nen. Derzeit existiert keine universell verwendbare Infrastruktur fĂŒr die Ähnlichkeitssuche in allgemeinen metrischen RĂ€umen. Ziel der Arbeit ist es, die Grundlage fĂŒr eine derartige Infrastruktur zu legen, die in klassische Datenbankmanagementsysteme integriert werden könnte. Im Rahmen einer Analyse des State of the Art wird der M-Baum als am besten geeignete Basisstruktur identifiziert. Dieser wird anschließend zum EM-Baum erweitert, wobei strukturelle KompatibilitĂ€t mit dem M-Baum erhalten wird. Die Abfragealgorithmen werden im Hinblick auf eine Minimierung notwendiger Distanzberechnungen optimiert. Aufbauend auf einer mathematischen Analyse der Beziehung zwischen Baumstruktur und Abfrageaufwand werden Freiheitsgrade in BaumĂ€nderungsalgorithmen genutzt, um BĂ€ume so zu konstruieren, dass Ähnlichkeitsanfragen mit einer minimalen Anzahl an Anfrageoperationen beantwortet werden können.A topic of growing importance in computer science is the handling of similarity in multiple heterogenous domains. Currently there is no common infrastructure to support this for the general metric space. The goal of this work is lay the foundation for such an infrastructure, which could be integrated into classical data base management systems. After some analysis of the state of the art the M-Tree is identified as most suitable base and enhanced in multiple ways to the EM-Tree retaining structural compatibility. The query algorithms are optimized to reduce the number of necessary distance calculations. On the basis of a mathematical analysis of the relation between the tree structure and the query performance degrees of freedom in the tree edit algorithms are used to build trees optimized for answering similarity queries using a minimal number of distance calculations
    corecore