784 research outputs found
PRIME: A System for Multi-lingual Patent Retrieval
Given the growing number of patents filed in multiple countries, users are
interested in retrieving patents across languages. We propose a multi-lingual
patent retrieval system, which translates a user query into the target
language, searches a multilingual database for patents relevant to the query,
and improves the browsing efficiency by way of machine translation and
clustering. Our system also extracts new translations from patent families
consisting of comparable patents, to enhance the translation dictionary
In no uncertain terms : a dataset for monolingual and multilingual automatic term extraction from comparable corpora
Automatic term extraction is a productive field of research within natural language processing, but it still faces significant obstacles regarding datasets and evaluation, which require manual term annotation. This is an arduous task, made even more difficult by the lack of a clear distinction between terms and general language, which results in low inter-annotator agreement. There is a large need for well-documented, manually validated datasets, especially in the rising field of multilingual term extraction from comparable corpora, which presents a unique new set of challenges. In this paper, a new approach is presented for both monolingual and multilingual term annotation in comparable corpora. The detailed guidelines with different term labels, the domain- and language-independent methodology and the large volumes annotated in three different languages and four different domains make this a rich resource. The resulting datasets are not just suited for evaluation purposes but can also serve as a general source of information about terms and even as training data for supervised methods. Moreover, the gold standard for multilingual term extraction from comparable corpora contains information about term variants and translation equivalents, which allows an in-depth, nuanced evaluation
Towards Bilingual Term Extraction in Comparable Patents
PACLIC 23 / City University of Hong Kong / 3-5 December 200
Designing multilingual information access to Tate Online
The Tate is Britain's premier national art gallery and includes content from internationally-renowned artists such as Constable and Turner. Like most cultural heritage institutions, the Tate provides online access to a large amount of digitized material. Given the international importance of content provided by the Tate Gallery, multilingual access would seem an ideal way in which to increase accessibility to the collections, and thereby increase traffic to the website. In this short paper we propose using the Tate as a case study for cross-language research and evaluation, determining the galleryâs requirements and the multilingual needs of their end-users
Learning languages from parallel corpora
This work describes a blueprint for an application that generates language learning exercises from parallel corpora. Word alignment and parallel structures allow for the automatic assessment of sentence pairs in the source and target languages, while users of the application continuously improve the quality of the data with their interactions, thus crowdsourcing parallel language learning material. Through triangulation, their assessment can be transferred to language pairs other than the original ones if multiparallel corpora are used as a source.
Several challenges need to be addressed for such an application to work, and we will discuss three of them here. First, the question of how adequate learning material can be identified in corpora has received some attention in the last decade, and we will detail what the structure of parallel corpora implies for that selection. Secondly, we will consider which type of exercises can be generated automatically from parallel corpora such that they foster learning and keep learners motivated. And thirdly, we will highlight the potential of employing users, that is both teachers and learners, as crowdsourcers to help improve the material
Toward higher effectiveness for recall-oriented information retrieval: A patent retrieval case study
Research in information retrieval (IR) has largely been directed towards tasks requiring high precision. Recently, other IR applications which can be described as recall-oriented IR tasks have received increased attention in the IR research domain. Prominent among these IR applications are patent search and legal search, where users are typically ready to check hundreds or possibly thousands of documents in order to find any possible relevant document. The main concerns in this kind of application are very different from those in standard precision-oriented IR tasks, where users tend to be focused on finding an answer to their information need that can typically be addressed by one or two relevant documents. For precision-oriented tasks, mean average precision continues to be used as the primary evaluation metric for almost all IR applications. For recall-oriented IR applications the nature of the search task, including objectives, users, queries, and document collections, is different from that of standard precision-oriented search tasks. In this research study, two dimensions in IR are explored for the recall-oriented patent search task. The study includes IR system evaluation and multilingual IR for patent search. In each of these dimensions, current IR techniques are studied and novel techniques developed especially for this kind of recall-oriented IR application are proposed and investigated experimentally in the context of patent retrieval. The techniques developed in this thesis provide a significant contribution toward evaluating the effectiveness of recall-oriented IR in general and particularly patent search, and improving the efficiency of multilingual search for this kind of task
ParaCrawl: Web-Scale Acquisition of Parallel Corpora
We report on methods to create the largest publicly available parallel corpora by crawling the web, using open source software. We empirically compare alternative methods and publish benchmark data sets for sentence alignment and sentence pair filtering. We also describe the parallel corpora released and evaluate their quality and their usefulness to create machine translation systems
Semantic transference for enriching multilingual biomedical knowledge resources
Biomedical knowledge resources (KRs) are mainly expressed in English, and many applications using them suffer from the scarcity of knowledge in non- English languages. The goal of the present work is to take maximum profit from existing multilingual biomedical KRs lexicons to enrich their non-English counterparts. We propose to combine different automatic methods to gener- ate pair-wise language alignments. More specifically, we use two well-known translation methods (GIZA++ and Moses), and we propose a new ad-hoc method specially devised for multilingual KRs. Then, resulting alignments are used to transfer semantics between KRs across their languages. Transfer- ence quality is ensured by checking the semantic coherence of the generated alignments. Experiments have been carried out over the Spanish, French and German UMLS Metathesaurus counterparts. As a result, the enriched Span- ish KR can grow up to 1,514,217 concepts (originally 286,659), the French KR up to 1,104,968 concepts (originally 83,119), and the German KR up to 1,136,020 concepts (originally 86,842)
Recommended from our members
Efficient Inference, Search and Evaluation for Latent Variable Models of Text with Applications to Information Retrieval and Machine Translation
Latent variable models of text, such as topic models, have been explored in many areas of natural language processing, information retrieval and machine translation to aid tasks such as exploratory data analysis, automated topic clustering and finding similar documents in mono- and multilingual collections. Many additional applications of these models, however, could be enabled by more efficient techniques for processing large datasets.
In this thesis, we introduce novel methods that offer efficient inference, search and evaluation for latent variable models of text. We present efficient, online inference for representing documents in several languages in a common topic space and fast approximations for finding near neighbors in the probability simplex representation of mono- and multilingual document collections. Empirical evaluations show that these methods are as accurate as â- and significantly faster than â- Gibbs sampling and brute-force all pairs search respectively. In addition, we present a new extrinsic evaluation metric that achieves very high correlation with common performance metrics while being more efficient to compute. We showcase the efficacy and efficiency of our new approaches on the problems of modeling and finding similar documents in a retrieval system for scientific papers, detecting document translation pairs, and extracting parallel sentences from large comparable corpora. This last task, in turn, allows us to efficiently train a translation model from comparable corpora that outperforms a model trained on parallel data.
Lastly, we improve the latent variable model representation of large documents in mono- and multilingual collections by introducing online inference for topic models with hierarchical Dirichlet prior structure over textual regions such as document sections. Modeling variations across textual regions using online inference offers a more effective and efficient document representation, beyond a bag of words, which is usually a handicap for the performance of these models on large documents
- âŠ