11 research outputs found
Yet Another Ranking Function for Automatic Multiword Term Extraction
International audienceTerm extraction is an essential task in domain knowledge acquisition. We propose two new measures to extract multiword terms from a domain-specific text. The first measure is both linguistic and statistical based. The second measure is graph-based, allowing assessment of the importance of a multiword term of a domain. Existing measures often solve some problems related (but not completely) to term extraction, e.g., noise, silence, low frequency, large-corpora, complexity of the multiword term extraction process. Instead, we focus on managing the entire set of problems, e.g., detecting rare terms and overcoming the low frequency issue. We show that the two proposed measures outperform precision results previously reported for automatic multiword extraction by comparing them with the state-of-the-art reference measures
Sports collectifs en ZEP : un exemple de 'format pédagogique'
One of the determining factors of the quality of Web search engines is the size of their index. In addition to its influence on search result quality, the size of the indexed Web can also tell us something about which parts of the WWW are directly accessible to the everyday user. We propose a novel method of estimating the size of a Web search engine’s index by extrapolating from document frequencies of words observed in a large static corpus of Web pages. In addition, we provide a unique longitudinal perspective on the size of Google and Bing’s indices over a nine-year period, from March 2006 until January 2015. We find that index size estimates of these two search engines tend to vary dramatically over time, with Google generally possessing a larger index than Bing. This result raises doubts about the reliability of previous one-off estimates of the size of the indexed Web. We find that much, if not all of this variability can be explained by changes in the indexing and ranking infrastructure of Google and Bing. This casts further doubt on whether Web search engines can be used reliably for cross-sectional webometric studies
Supporting experts to handle tweet collections about significant events
We introduce Relevancer that processes a tweet set and enables generating an automatic classifier from it. Relevancer satisfies information needs of experts during significant events. Enabling experts to combine automatic procedures with expertise is the main contribution of our approach and the added value of the tool. Even a small amount of feedback enables the tool to distinguish between relevant and irrelevant information effectively. Thus, Relevancer facilitates the quick understanding of and proper reaction to events presented on Twitter
Extracting meronomy relations from domain-specific, textual corporate databases
Various techniques for learning meronymy relationships from open-domain corpora exist. However, extracting meronymy relationships from domain-specific, textual corporate databases has been overlooked, despite numerous application opportunities particularly in domains like product development and/or customer service. These domains also pose new scientific challenges, such as the absence of elaborate knowledge resources, compromising the performance of supervised meronymy-learning algorithms. Furthermore, the domain-specific terminology of corporate texts makes it difficult to select appropriate seeds for minimally-supervised meronymy-learning algorithms. To address these issues, we develop and present a principled approach to extract accurate meronymy relationships from textual databases of product development and/or customer service organizations by leveraging on reliable meronymy lexico-syntactic patterns harvested from an open-domain corpus. Evaluations on real-life corporate databases indicate that our technique extracts precise meronymy relationships that provide valuable operational insights on causes of product failures and customer dissatisfaction. Our results also reveal that the types of some of the domain-specific meronymy relationships, extracted from the corporate data, cannot be conclusively and unambiguously classified under well-known taxonomies of relationships
Improving product quality and reliability with customer experience data
Advance technology development and wide use of the World Wide Web have made it possible for new product development organizations to access multi-sources of data-related customer complaints. However, the number of customer plaints of highly innovative consumer electronic products is still increasing; that is, product quality and reliability is at risk. This article aims to understand why existing solutions from literature as well as from industry to deal with these increasingly complex multiple data sources are not able to manage product quality and reliability. Three case studies in industry are discussed. On the basis of the case study results, this article also identifies a new research agenda that is needed to improve product quality and reliability under this circumstance