Search CORE

430 research outputs found

Multiple Retrieval Models and Regression Models for Prior Art Search

Author: Lopez Patrice
Romary Laurent
Publication venue
Publication date: 01/01/2009
Field of study

This paper presents the system called PATATRAS (PATent and Article Tracking, Retrieval and AnalysiS) realized for the IP track of CLEF 2009. Our approach presents three main characteristics: 1. The usage of multiple retrieval models (KL, Okapi) and term index definitions (lemma, phrase, concept) for the three languages considered in the present track (English, French, German) producing ten different sets of ranked results. 2. The merging of the different results based on multiple regression models using an additional validation set created from the patent collection. 3. The exploitation of patent metadata and of the citation structures for creating restricted initial working sets of patents and for producing a final re-ranking regression model. As we exploit specific metadata of the patent documents and the citation relations only at the creation of initial working sets and during the final post ranking step, our architecture remains generic and easy to extend

arXiv.org e-Print Archive

HAL-CentraleSupelec

CiteSeerX

INRIA a CCSD electronic archive server

HAL-Rennes 1

LEVERAGING TEXT MINING FOR THE DESIGN OF A LEGAL KNOWLEDGE MANAGEMENT SYSTEM

Author: Hanke Jannis
Thiesse Frédéric
Publication venue: AIS Electronic Library (AISeL)
Publication date: 10/06/2017
Field of study

In today’s globalized world, companies are faced with numerous and continuously changing legal requirements. To ensure that these companies are compliant with legal regulations, law and consulting firms use open legal data published by governments worldwide. With this data pool growing rapidly, the complexity of legal research is strongly increasing. Despite this fact, only few research papers consider the application of information systems in the legal domain. Against this backdrop, we pro-pose a knowledge management (KM) system that aims at supporting legal research processes. To this end, we leverage the potentials of text mining techniques to extract valuable information from legal documents. This information is stored in a graph database, which enables us to capture the relation-ships between these documents and users of the system. These relationships and the information from the documents are then fed into a recommendation system which aims at facilitating knowledge transfer within companies. The prototypical implementation of the proposed KM system is based on 20,000 legal documents and is currently evaluated in cooperation with a Big 4 accounting company

AIS Electronic Library (AISeL)

Validation of scientific topic models using graph analysis and corpus metadata

Author: Arenas García Jerónimo
Cid Sueiro Jesús
Pereira Delgado Jorge
Vázquez López Manuel Alberto
Publication venue: Springer Nature
Publication date: 30/03/2022
Field of study

Probabilistic topic modeling algorithms like Latent Dirichlet Allocation (LDA) have become powerful tools for the analysis of large collections of documents (such as papers, projects, or funding applications) in science, technology an innovation (STI) policy design and monitoring. However, selecting an appropriate and stable topic model for a specific application (by adjusting the hyperparameters of the algorithm) is not a trivial problem. Common validation metrics like coherence or perplexity, which are focused on the quality of topics, are not a good fit in applications where the quality of the document similarity relations inferred from the topic model is especially relevant. Relying on graph analysis techniques, the aim of our work is to state a new methodology for the selection of hyperparameters which is specifically oriented to optimize the similarity metrics emanating from the topic model. In order to do this, we propose two graph metrics: the first measures the variability of the similarity graphs that result from different runs of the algorithm for a fixed value of the hyperparameters, while the second metric measures the alignment between the graph derived from the LDA model and another obtained using metadata available for the corresponding corpus. Through experiments on various corpora related to STI, it is shown that the proposed metrics provide relevant indicators to select the number of topics and build persistent topic models that are consistent with the metadata. Their use, which can be extended to other topic models beyond LDA, could facilitate the systematic adoption of this kind of techniques in STI policy analysis and design.Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 101004870 (IntelComp project), and has also been partially supported by FEDER/ Spanish Ministry of Science, Innovation and Universities, State Agency of Research, project TEC2017-83838-R

Universidad Carlos III de Madrid e-Archivo

Recommended from our members

Linking Textual Resources to Support Information Discovery

Author: Knoth Petr
Publication venue
Publication date: 14/05/2015
Field of study

A vast amount of information is today stored in the form of textual documents, many of which are available online. These documents come from different sources and are of different types. They include newspaper articles, books, corporate reports, encyclopedia entries and research papers. At a semantic level, these documents contain knowledge, which was created by explicitly connecting information and expressing it in the form of a natural language. However, a significant amount of knowledge is not explicitly stated in a single document, yet can be derived or discovered by researching, i.e. accessing, comparing, contrasting and analysing, information from multiple documents. Carrying out this work using traditional search interfaces is tedious due to information overload and the difficulty of formulating queries that would help us to discover information we are not aware of. In order to support this exploratory process, we need to be able to effectively navigate between related pieces of information across documents. While information can be connected using manually curated cross-document links, this approach not only does not scale, but cannot systematically assist us in the discovery of sometimes non-obvious (hidden) relationships. Consequently, there is a need for automatic approaches to link discovery. This work studies how people link content, investigates the properties of different link types, presents new methods for automatic link discovery and designs a system in which link discovery is applied on a collection of millions of documents to improve access to public knowledge

Open Research Online (The Open University)

Review of Semantic Importance and Role of using Ontologies in Web Information Retrieval Techniques

Author: Ali Ashraf
Publication venue: 'Asian Online Journals'
Publication date: 05/03/2022
Field of study

The Web contains an enormous amount of information, which is managed to accumulate, researched, and regularly used by many users. The nature of the Web is multilingual and growing very fast with its diverse nature of data including unstructured or semi-structured data such as Websites, texts, journals, and files. Obtaining critical relevant data from such vast data with its diverse nature has been a monotonous and challenging task. Simple key phrase data gathering systems rely heavily on statistics, resulting in a word incompatibility problem related to a specific word's inescapable semantic and situation variants. As a result, there is an urgent need to arrange such colossal data systematically to find out the relevant information that can be quickly analyzed and fulfill the users' needs in the relevant context. Over the years ontologies are widely used in the semantic Web to contain unorganized information systematic and structured manner. Still, they have also significantly enhanced the efficiency of various information recovery approaches. Ontological information gathering systems recover files focused on the semantic relation of the search request and the searchable information. This paper examines contemporary ontology-based information extraction techniques for texts, interactive media, and multilingual data types. Moreover, the study tried to compare and classify the most significant developments utilized in the search and retrieval techniques and their major disadvantages and benefits

International Journal of Computer and Information Technology

Automating the search for a patent's prior art with a full text similarity search

Author: Biegler Franziska
Helmers Lea
Horn Franziska
Müller Klaus-Robert
Oppermann Tim
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2019
Field of study

More than ever, technical inventions are the symbol of our society's advance. Patents guarantee their creators protection against infringement. For an invention being patentable, its novelty and inventiveness have to be assessed. Therefore, a search for published work that describes similar inventions to a given patent application needs to be performed. Currently, this so-called search for prior art is executed with semi-automatically composed keyword queries, which is not only time consuming, but also prone to errors. In particular, errors may systematically arise by the fact that different keywords for the same technical concepts may exist across disciplines. In this paper, a novel approach is proposed, where the full text of a given patent application is compared to existing patents using machine learning and natural language processing techniques to automatically detect inventions that are similar to the one described in the submitted document. Various state-of-the-art approaches for feature extraction and document comparison are evaluated. In addition to that, the quality of the current search process is assessed based on ratings of a domain expert. The evaluation results show that our automated approach, besides accelerating the search process, also improves the search results for prior art with respect to their quality

arXiv.org e-Print Archive

Directory of Open Access Journals

FigShare

GRISP: A Massive Multilingual Terminological Database for Scientiﬁc and Technical Domains

Author: Lopez Patrice
Romary Laurent
Publication venue: HAL CCSD
Publication date: 19/05/2010
Field of study

International audienceThe development of a multilingual terminology is a very long and costly process. We present the creation of a multilingual terminological database called GRISP covering multiple technical and scientiﬁc ﬁelds from various open resources. A crucial aspect is the merging of the different resources which is based in our proposal on the deﬁnition of a sound conceptual model, different domain mapping and the use of structural constraints and machine learning techniques for controlling the fusion process. The result is a massive terminological database of several millions terms, concepts, semantic relations and deﬁnitions. This resource has allowed us to improve signiﬁcantly the mean average precision of an information retrieval system applied to a large collection of multilingual and multidomain patent documents

INRIA a CCSD electronic archive server