461 research outputs found

    Discovering missing Wikipedia inter-language links by means of cross-lingual word sense disambiguation

    Get PDF
    Wikipedia is a very popular online multilingual encyclopedia that contains millions of articles covering most written languages. Wikipedia pages contain monolingual hypertext links to other pages, as well as inter-language links to the corresponding pages in other languages. These inter-language links, however, are not always complete. We present a prototype for a cross-lingual link discovery tool that discovers missing Wikipedia inter-language links to corresponding pages in other languages for ambiguous nouns. Although the framework of our approach is language-independent, we built a prototype for our application using Dutch as an input language and Spanish, Italian, English, French and German as target languages. The input for our system is a set of Dutch pages for a given ambiguous noun, and the output of the system is a set of links to the corresponding pages in our five target languages. Our link discovery application contains two submodules. In a first step all pages are retrieved that contain a translation (in our five target languages) of the ambiguous word in the page title (Greedy crawler module), whereas in a second step all corresponding pages are linked between the focus language (being Dutch in our case) and the five target languages (Cross-lingual web page linker module). We consider this second step as a disambiguation task and apply a cross-lingual Word Sense Disambiguation framework to determine whether two pages refer to the same content or not

    Evaluation of automatic hypernym extraction from technical corpora in English and Dutch

    Get PDF
    In this research, we evaluate different approaches for the automatic extraction of hypernym relations from English and Dutch technical text. The detected hypernym relations should enable us to semantically structure automatically obtained term lists from domain- and user-specific data. We investigated three different hypernymy extraction approaches for Dutch and English: a lexico-syntactic pattern-based approach, a distributional model and a morpho-syntactic method. To test the performance of the different approaches on domain-specific data, we collected and manually annotated English and Dutch data from two technical domains, viz. the dredging and financial domain. The experimental results show that especially the morpho-syntactic approach obtains good results for automatic hypernym extraction from technical and domain-specific texts

    Normalization of Dutch user-generated content

    Get PDF
    Abstract This paper describes a phrase-based machine translation approach to normalize Dutch user-generated content (UGC). We compiled a corpus of three different social media genres (text messages, message board posts and tweets) to have a sample of this recent domain. We describe the various characteristics of this noisy text material and explain how it has been manually normalized using newly developed guidelines. For the automatic normalization task we focus on text messages, and find that a cascaded SMT system where a token-based module is followed by a translation at the character level gives the best word error rate reduction. After these initial experiments, we investigate the system's robustness on the complete domain of UGC by testing it on the other two social media genres, and find that the cascaded approach performs best on these genres as well. To our knowledge, we deliver the first proof-of-concept system for Dutch UGC normalization, which can serve as a baseline for future work

    Monitoring the reduction in shrinkage cracking of mortars containing superabsorbent polymers

    Get PDF
    Ultra-high performance concrete (UHPC) is characterized by a low water-to-cement ratio, leading to improved durability and mechanical properties. However, the risk for autogenous shrinkage and cracking due to restrained shrinkage increases, which may affect the durability of UHPC as cracks form pathways for ingress of aggressive liquids and gases. These negative features can be prevented by the use of superabsorbent polymers (SAPs) in the mixture. SAPs reduce autogenous shrinkage by means of internal curing: they will absorb water during the hydration process and release it again to the cementitious matrix when water shortage arises. In this way, hydration can continue and shrinkage is diminished

    Noise or music? Investigating the usefulness of normalisation for robust sentiment analysis on social media data

    Get PDF
    In the past decade, sentiment analysis research has thrived, especially on social media. While this data genre is suitable to extract opinions and sentiment, it is known to be noisy. Complex normalisation methods have been developed to transform noisy text into its standard form, but their effect on tasks like sentiment analysis remains underinvestigated. Sentiment analysis approaches mostly include spell checking or rule-based normalisation as preprocess- ing and rarely investigate its impact on the task performance. We present an optimised sentiment classifier and investigate to what extent its performance can be enhanced by integrating SMT-based normalisation as preprocessing. Experiments on a test set comprising a variety of user-generated content genres revealed that normalisation improves sentiment classification performance on tweets and blog posts, showing the model’s ability to generalise to other data genres

    The development of a novel SNP genotyping assay to differentiate cacao clones

    Get PDF
    In this study, a double-mismatch allele-specific (DMAS) qPCR SNP genotyping method has been designed, tested and validated specifically for cacao, using 65 well annotated international cacao reference accessions retrieved from the Center for Forestry Research and Technology Transfer (CEFORTT) and the International Cocoa Quarantine Centre (ICQC). In total, 42 DMAS-qPCR SNP genotyping assays have been validated, with a 98.05% overall efficiency in calling the correct genotype. In addition, the test allowed for the identification of 15.38% off-types and two duplicates, highlighting the problem of mislabeling in cacao collections and the need for conclusive genotyping assays. The developed method showed on average a high genetic diversity (He = 0.416) and information index (I = 0.601), making it applicable to assess intra-population variation. Furthermore, only the 13 most informative markers were needed to achieve maximum differentiation. This simple, effective method provides robust and accurate genotypic data which allows for more efficient resource management (e.g. tackling mislabeling, conserving valuable genetic material, parentage analysis, genetic diversity studies), thus contributing to an increased knowledge on the genetic background of cacao worldwide. Notably, the described method can easily be integrated in other laboratories for a wide range of objectives and organisms

    Target enrichment using parallel nanoliter quantitative PCR amplification

    Get PDF
    Background: Next generation targeted resequencing is replacing Sanger sequencing at high pace in routine genetic diagnosis. The need for well validated, high quality enrichment platforms to complement the bench-top next generation sequencing devices is high. Results: We used the WaferGen Smartchip platform to perform highly parallelized PCR based target enrichment for a set of known cancer genes in a well characterized set of cancer cell lines from the NCI60 panel. Optimization of PCR assay design and cycling conditions resulted in a high enrichment efficiency. We provide proof of a high mutation rediscovery rate and have included technical replicates to enable SNP calling validation demonstrating the high reproducibility of our enrichment platform. Conclusions: Here we present our custom developed quantitative PCR based target enrichment platform. Using highly parallel nanoliter singleplex PCR reactions makes this a flexible and efficient platform. The high mutation validation rate shows this platform’s promise as a targeted resequencing method for multi-gene routine sequencing diagnostics
    • …
    corecore