77 research outputs found

    A novel, Language-Independent Keyword Extraction method

    Get PDF
    Obtaining the most representative set of words in a document is a very significant task, since it allows characterizing the document and simplifies search and classification activities. This paper presents a novel method, called LIKE, that offers the ability of automatically extracting keywords from a document regardless of the language used in it. To do so, it uses a three-stage process: the first stage identifies the most representative terms, the second stage builds a numeric representation that is appropriate for those terms, and the third one uses a feed-forward neural network to obtain a predictive model. To measure the efficacy of the LIKE method, the articles published by the Workshop of Computer Science Researchers (WICC) in the last 14 years (1999-2012) were used. The results obtained show that LIKE is better than the KEA method, which is one of the most widely mentioned solutions in literature about this topic.X Workshop bases de datos y minería de datosRed de Universidades con Carreras en Informática (RedUNCI

    Automatic Keyphrase Extraction: A Survey of the State of the Art

    Full text link

    A New Clustering Technique On Text In Sentence For Text Mining

    Get PDF
    Clustering is a commonly considered data mining problem in the text domains. The problem finds numerous applications in customer segmentation, classification, collaborative filtering, visualization, document organization, and indexing. In this paper, the sentence level based clustering algorithm is discussed as a survey. The survey explains about the problems in clustering in sentence level and the solutions to overcome these problems. This paper presents a novel fuzzy clustering algorithm that operates on relational input data; i.e., data in the form of a square matrix of pairwise similarities between data objects Hierarchical Fuzzy Relational Eigenvector Centrality-based Clustering Algorithm (HFRECCA) is extension of FRECCA which is used for the clustering of sentences. Contents present in text documents contain hierarchical structure and there are many terms present in the documents which are related to more than one theme hence HFRECCA will be useful algorithm for natural language documents. In this algorithm single object may belong to more than one cluster

    Automatically Finding Significant Topical Terms from Documents

    Get PDF
    With the pervasion of digital textual data, text mining is becoming more and more important to deriving competitive advantages. One factor for successful text mining applications is the ability of finding significant topical terms for discovering interesting patterns or relationships. Document keyphrases are phrases carrying the most important topical concepts for a given document. In many applications, keyphrases as textual elements are better suited for text mining and could provide more discriminating power than single words. This paper describes an automatic keyphrase identification program (KIP). KIP’s algorithm examines the composition of noun phrases and calculates their scores by looking up a domain-specific glossary database; the ones with higher scores are extracted as keyphrases. KIP’s learning function can enrich its glossary database by automatically adding new identified keyphrases. KIP’s personalization feature allows the user build a glossary database specifically suitable for the area of his/her interest

    A tree based keyphrase extraction technique for academic literature

    Get PDF
    Automatic keyphrase extraction techniques aim to extract quality keyphrases to summarize a document at a higher level. Among the existing techniques some of them are domain-specific and require application domain knowledge, some of them are based on higher-order statistical methods and are computationally expensive, and some of them require large train data which are rare for many applications. Overcoming these issues, this thesis proposes a new unsupervised automatic keyphrase extraction technique, named TeKET or Tree-based Keyphrase Extraction Technique, which is domain-independent, employs limited statistical knowledge, and requires no train data. The proposed technique also introduces a new variant of the binary tree, called KeyPhrase Extraction (KePhEx) tree to extract final keyphrases from candidate keyphrases. Depending on the candidate keyphrases the KePhEx tree structure is either expanded or shrunk or maintained. In addition, a measure, called Cohesiveness Index or CI, is derived that denotes the degree of cohesiveness of a given node with respect to the root which is used in extracting final keyphrases from a resultant tree in a flexible manner and is utilized in ranking keyphrases alongside Term Frequency. The effectiveness of the proposed technique is evaluated using an experimental evaluation on a benchmark corpus, called SemEval-2010 with total 244 train and test articles, and compared with other relevant unsupervised techniques by taking the representatives from both statistical (such as Term Frequency-Inverse Document Frequency and YAKE) and graph-based techniques (PositionRank, CollabRank (SingleRank), TopicRank, and MultipartiteRank) into account. Three evaluation metrics, namely precision, recall and F1 score are taken into consideration during the experiments. The obtained results demonstrate the improved performance of the proposed technique over other similar techniques in terms of precision, recall, and F1 scores

    A comparison of feature and semantic-based summarization algorithms for Turkish

    Get PDF
    Akyokuş, Selim (Dogus Author) -- Conference full title: International Symposium on Innovations in Intelligent Systems and Applicaitons, 21-24June 2010, Kayseri & Cappadocia,TURKEY.In this paper we analyze the performances of a feature-based and two semantic-based text summarization algorithms on a new Turkish corpus. The feature-based algorithm uses the statistical analysis of paragraphs, sentences, words and formal clues found in documents, whereas the two semanticbased algorithms employ Latent Semantic Analysis (LSA) approach which enables the selection of the most important sentences in a semantic way. Performance evaluation is conducted by comparing automatically generated summaries with manual summaries generated by a human summarizer. This is the first study that applies LSA based algorithms to Turkish text summarization and its results are promising

    Aggregating skip bigrams into key phrase-based vector space model for web person disambiguation

    Get PDF
    The 11th Conference on Natural Language Processing (KONVENS) was organized by ÖGAI and was hosted on September 19-21, 2012 in Vienna.2012-2013 > Academic research: refereed > Refereed conference paperVersion of RecordPublishe

    Document clustering for knowledge synthesis and project portfolio funding decision in R&D organizations

    Get PDF
    The paper discusses a method of using document clustering for information/knowledge synthesis and decision facilitation in R&D organisations. The emerging methodologies of machine learning, artificial intelligence and data science in conjunction with fuzzy mathematics can be optimally exploited to catalyse development of information bank for research organisations. This knowledge ecosystem can be utilized by the proposed mechanism to accelerate and reinforce interdisciplinary research for R&D organisations and empower them to make efficacious information-driven decisions related to project portfolio selection and proposal funding
    corecore