13,705 research outputs found
Automating the search for a patent's prior art with a full text similarity search
More than ever, technical inventions are the symbol of our society's advance.
Patents guarantee their creators protection against infringement. For an
invention being patentable, its novelty and inventiveness have to be assessed.
Therefore, a search for published work that describes similar inventions to a
given patent application needs to be performed. Currently, this so-called
search for prior art is executed with semi-automatically composed keyword
queries, which is not only time consuming, but also prone to errors. In
particular, errors may systematically arise by the fact that different keywords
for the same technical concepts may exist across disciplines. In this paper, a
novel approach is proposed, where the full text of a given patent application
is compared to existing patents using machine learning and natural language
processing techniques to automatically detect inventions that are similar to
the one described in the submitted document. Various state-of-the-art
approaches for feature extraction and document comparison are evaluated. In
addition to that, the quality of the current search process is assessed based
on ratings of a domain expert. The evaluation results show that our automated
approach, besides accelerating the search process, also improves the search
results for prior art with respect to their quality
Automatic Text Summarization Approaches to Speed up Topic Model Learning Process
The number of documents available into Internet moves each day up. For this
reason, processing this amount of information effectively and expressibly
becomes a major concern for companies and scientists. Methods that represent a
textual document by a topic representation are widely used in Information
Retrieval (IR) to process big data such as Wikipedia articles. One of the main
difficulty in using topic model on huge data collection is related to the
material resources (CPU time and memory) required for model estimate. To deal
with this issue, we propose to build topic spaces from summarized documents. In
this paper, we present a study of topic space representation in the context of
big data. The topic space representation behavior is analyzed on different
languages. Experiments show that topic spaces estimated from text summaries are
as relevant as those estimated from the complete documents. The real advantage
of such an approach is the processing time gain: we showed that the processing
time can be drastically reduced using summarized documents (more than 60\% in
general). This study finally points out the differences between thematic
representations of documents depending on the targeted languages such as
English or latin languages.Comment: 16 pages, 4 tables, 8 figure
REST: A Thread Embedding Approach for Identifying and Classifying User-specified Information in Security Forums
How can we extract useful information from a security forum? We focus on
identifying threads of interest to a security professional: (a) alerts of
worrisome events, such as attacks, (b) offering of malicious services and
products, (c) hacking information to perform malicious acts, and (d) useful
security-related experiences. The analysis of security forums is in its infancy
despite several promising recent works. Novel approaches are needed to address
the challenges in this domain: (a) the difficulty in specifying the "topics" of
interest efficiently, and (b) the unstructured and informal nature of the text.
We propose, REST, a systematic methodology to: (a) identify threads of interest
based on a, possibly incomplete, bag of words, and (b) classify them into one
of the four classes above. The key novelty of the work is a multi-step weighted
embedding approach: we project words, threads and classes in appropriate
embedding spaces and establish relevance and similarity there. We evaluate our
method with real data from three security forums with a total of 164k posts and
21K threads. First, REST robustness to initial keyword selection can extend the
user-provided keyword set and thus, it can recover from missing keywords.
Second, REST categorizes the threads into the classes of interest with superior
accuracy compared to five other methods: REST exhibits an accuracy between
63.3-76.9%. We see our approach as a first step for harnessing the wealth of
information of online forums in a user-friendly way, since the user can loosely
specify her keywords of interest
- …