9 research outputs found

    Hierarchical clustering of a Finnish newspaper article collection with graded relevance assessments

    Get PDF
    Abstract Search facilitated with agglomerative hierarchical clustering methods was studied in a collection of Finnish newspaper articles (N = 53,893). To allow quick experiments, clustering was applied to a sample (N = 5,000) that was reduced with principal components analysis. The dendrograms were heuristically cut to find an optimal partition, whose clusters were compared with each of the 30 queries to retrieve the best-matching cluster. The fourlevel relevance assessment was collapsed into a binary one by (A) considering all the relevant and (B) only the highly relevant documents relevant, respectively. Single linkage (SL) was the worst method. It created many tiny clusters, and, consequently, searches enabled with it had high precision and low recall. The complete linkage (CL), average linkage (AL), and Ward's methods (WM) returned reasonably-sized clusters typically of 18-32 documents. Their recall (A: 27-52%, B: 50-82%) and precision (A: 83-90%, B: 18-21%) was higher than and comparable to those of the SL clusters, respectively. The AL and WM clustering had 1-8% better effectiveness than nearest neighbor searching (NN), and SL and CL were 1-9% less efficient that NN. However, the differences were statistically insignificant. When evaluated with the liberal assessment A, the results suggest that the AL and WM clustering offer better retrieval ability than NN. Assessment B renders the AL and WM clustering better than NN, when recall is considered more important than precision. The results imply that collections in the highly inflectional and agglutinative languages, such as Finnish, may be clustered as the collections in English, provided that documents are appropriately preprocessed

    Text Classifiers for Automatic Articles Categorization

    No full text

    Knowledge-Based Categorization of Scientific Articles for Similarity Predictions

    No full text
    International audienceStaying aware of new approaches emerging within specific areas can be challenging for researchers who have to follow many feeds such as journals articles, authors’ papers, and other basic keyword-based matching algorithms. Hence, this paper proposes an information retrieval process for scientific articles aiming to suggest semantically related articles using exclusively a knowledge base. The first step categorizes articles by the disambiguation of their keywords by identifying common categories within the knowledge base. Then, similar articles are identified using the information extracted from the categorization, such as synonyms. The experimental evaluation shows that the proposed approach significantly outperforms the well known cosine similarity measure of vectors angles inherited from word2vec embeddings. Indeed, there is a difference of 30% for P@k (k∈[1,100]) in favor of the proposed approach

    Maintaining Reading Experience Continuity Across E-Book Revisions

    Get PDF
    E-book reader supports users to create digital learning footprints in many forms like highlighting sentences or taking memos. Nowadays, it also allows an instructor to update their e-books in the e-book reader. However, e-book users often face problems when trying to find learning footprints they made in a new version e-book. Thus, users’ reading experience continuity across e-book revisions is hard to be maintained and seems to become a shortcoming within the e-book system. In this paper, in order to maintain users’ reading experience continuity, we deal with the transfer of learning footprints such as a marker, memo, and bookmark across e-book revisions on an e-book reader in a coursework scenario. We first give introduction and related works to demonstrate how researchers dedicated on the problem mentioned in this paper and page similarity comparison. Then, we compare three page similarity comparison methods using similarity computing models to compute page pairwise similarity in image level, text level, and image & text level. In the analysis, for each level, we analyze the performance of transferring learning footprint across e-book revisions and also the optimal threshold for similar page determination. After that, we give the analysis results to show the performances of three methods in image level, text level, and image & text level, and then, the error analysis is presented to specify the error types that occur in the results. We then propose page image & text similarity comparison as the optimal method to automatically transfer learning footprints across e-book revisions based on the analysis results and error analysis among three compared methods. Finally, the discussion and conclusions are shown in the end of this paper.PubMedScopu
    corecore