80 research outputs found

    Natural language processing

    Get PDF
    Beginning with the basic issues of NLP, this chapter aims to chart the major research activities in this area since the last ARIST Chapter in 1996 (Haas, 1996), including: (i) natural language text processing systems - text summarization, information extraction, information retrieval, etc., including domain-specific applications; (ii) natural language interfaces; (iii) NLP in the context of www and digital libraries ; and (iv) evaluation of NLP systems

    Duration modeling with semi-Markov Conditional Random Fields for keyphrase extraction

    Full text link
    Existing methods for keyphrase extraction need preprocessing to generate candidate phrase or post-processing to transform keyword into keyphrase. In this paper, we propose a novel approach called duration modeling with semi-Markov Conditional Random Fields (DM-SMCRFs) for keyphrase extraction. First of all, based on the property of semi-Markov chain, DM-SMCRFs can encode segment-level features and sequentially classify the phrase in the sentence as keyphrase or non-keyphrase. Second, by assuming the independence between state transition and state duration, DM-SMCRFs model the distribution of duration (length) of keyphrases to further explore state duration information, which can help identify the size of keyphrase. Based on the convexity of parametric duration feature derived from duration distribution, a constrained Viterbi algorithm is derived to improve the performance of decoding in DM-SMCRFs. We thoroughly evaluate the performance of DM-SMCRFs on the datasets from various domains. The experimental results demonstrate the effectiveness of proposed model

    A User-Centered Concept Mining System for Query and Document Understanding at Tencent

    Full text link
    Concepts embody the knowledge of the world and facilitate the cognitive processes of human beings. Mining concepts from web documents and constructing the corresponding taxonomy are core research problems in text understanding and support many downstream tasks such as query analysis, knowledge base construction, recommendation, and search. However, we argue that most prior studies extract formal and overly general concepts from Wikipedia or static web pages, which are not representing the user perspective. In this paper, we describe our experience of implementing and deploying ConcepT in Tencent QQ Browser. It discovers user-centered concepts at the right granularity conforming to user interests, by mining a large amount of user queries and interactive search click logs. The extracted concepts have the proper granularity, are consistent with user language styles and are dynamically updated. We further present our techniques to tag documents with user-centered concepts and to construct a topic-concept-instance taxonomy, which has helped to improve search as well as news feeds recommendation in Tencent QQ Browser. We performed extensive offline evaluation to demonstrate that our approach could extract concepts of higher quality compared to several other existing methods. Our system has been deployed in Tencent QQ Browser. Results from online A/B testing involving a large number of real users suggest that the Impression Efficiency of feeds users increased by 6.01% after incorporating the user-centered concepts into the recommendation framework of Tencent QQ Browser.Comment: Accepted by KDD 201

    Keyphrase Generation: A Multi-Aspect Survey

    Full text link
    Extractive keyphrase generation research has been around since the nineties, but the more advanced abstractive approach based on the encoder-decoder framework and sequence-to-sequence learning has been explored only recently. In fact, more than a dozen of abstractive methods have been proposed in the last three years, producing meaningful keyphrases and achieving state-of-the-art scores. In this survey, we examine various aspects of the extractive keyphrase generation methods and focus mostly on the more recent abstractive methods that are based on neural networks. We pay particular attention to the mechanisms that have driven the perfection of the later. A huge collection of scientific article metadata and the corresponding keyphrases is created and released for the research community. We also present various keyphrase generation and text summarization research patterns and trends of the last two decades.Comment: 10 pages, 5 tables. Published in proceedings of FRUCT 2019, the 25th Conference of the Open Innovations Association FRUCT, Helsinki, Finlan

    Constructing and modeling text-rich information networks: a phrase mining-based approach

    Get PDF
    A lot of digital ink has been spilled on "big data" over the past few years, which is often characterized by an explosion of information. Most of this surge owes its origin to the unstructured data in the wild like words, images and video as comparing to the structured information stored in fielded form in databases. The proliferation of text-heavy data is particularly overwhelming, reflected in everyone's daily life in forms of web documents, business reviews, news, social posts, etc. In the mean time, textual data and structured entities often come in intertwined, such as authors/posters, document categories and tags, and document-associated geo locations. With this background, a core research challenge presents itself as how to turn massive, (semi-)unstructured data into structured knowledge. One promising paradigm studied in this dissertation is to integrate structured and unstructured data, constructing an organized heterogeneous information network, and developing powerful modeling mechanisms on such organized network. We name it text-rich information network, since it is an integrated representation of both structured and unstructured textual data. To thoroughly develop the construction and modeling paradigm, this dissertation will focus on forming a scalable data-driven framework and propose a new line of techniques relying on the idea of phrase mining to bridge textual documents and structured entities. We will first introduce the phrase mining method named SegPhrase+ to globally discover semantically meaningful phrases from massive textual data, providing a high quality dictionary for text structuralization. Clearly distinct from previous works that mostly focused on raw statistics of string matching, SegPhrase+ looks into the phrase context and effectively rectifies raw statistics to significantly boost the performance. Next, a novel algorithm based on latent keyphrases is developed and adopted to largely eliminate irregularities in massive text via providing an consistent and interpretable document representation. As a critical process in constructing the network, it uses the quality phrases generated in the previous step as candidates. From them a set of keyphrases are extracted to represent a particular document with inferred strength through a statistical model. After this step, documents become more structured and are consistently represented in the form of a bipartite network connecting documents with quality keyphrases. A more heterogeneous text-rich information network can be constructed by incorporating different types of document-associated entities as additional nodes. Lastly, a general and scalable framework, Tensor2vec, are to be added to trational data minining machanism, as the latter cannot readily solve the problem when the organized heterogeneous network has nodes with different types. Tensor2vec is expected to elegantly handle relevance search, entity classification, summarization and recommendation problems, by making use of higher-order link information and projecting multi-typed nodes into a shared low-dimensional vectorial space such that node proximity can be easily computed and accurately predicted

    A tree based keyphrase extraction technique for academic literature

    Get PDF
    Automatic keyphrase extraction techniques aim to extract quality keyphrases to summarize a document at a higher level. Among the existing techniques some of them are domain-specific and require application domain knowledge, some of them are based on higher-order statistical methods and are computationally expensive, and some of them require large train data which are rare for many applications. Overcoming these issues, this thesis proposes a new unsupervised automatic keyphrase extraction technique, named TeKET or Tree-based Keyphrase Extraction Technique, which is domain-independent, employs limited statistical knowledge, and requires no train data. The proposed technique also introduces a new variant of the binary tree, called KeyPhrase Extraction (KePhEx) tree to extract final keyphrases from candidate keyphrases. Depending on the candidate keyphrases the KePhEx tree structure is either expanded or shrunk or maintained. In addition, a measure, called Cohesiveness Index or CI, is derived that denotes the degree of cohesiveness of a given node with respect to the root which is used in extracting final keyphrases from a resultant tree in a flexible manner and is utilized in ranking keyphrases alongside Term Frequency. The effectiveness of the proposed technique is evaluated using an experimental evaluation on a benchmark corpus, called SemEval-2010 with total 244 train and test articles, and compared with other relevant unsupervised techniques by taking the representatives from both statistical (such as Term Frequency-Inverse Document Frequency and YAKE) and graph-based techniques (PositionRank, CollabRank (SingleRank), TopicRank, and MultipartiteRank) into account. Three evaluation metrics, namely precision, recall and F1 score are taken into consideration during the experiments. The obtained results demonstrate the improved performance of the proposed technique over other similar techniques in terms of precision, recall, and F1 scores

    Topic extraction for ontology learning

    Full text link
    This chapter addresses the issue of topic extraction from text corpora for ontology learning. The first part provides an overview of some of the most significant solutions present today in the literature. These solutions deal mainly with the inferior layers of the Ontology Learning Layer Cake. They are related to the challenges of the Terms and Synonyms layers. The second part shows how these pieces can be bound together into an integrated system for extracting meaningful topics. While the extracted topics are not proper concepts as yet, they constitute a convincing approach towards concept building and therefore ontology learning. This chapter concludes by discussing the research undertaken for filling the gap between topics and concepts as well as perspectives that emerge today in the area of topic extraction. © 2011, IGI Global
    corecore