Search CORE

3,457 research outputs found

Scholarly Big Data Quality Assessment: A Case Study of Document Linking and Conflation with S2ORC

Author: Giles C. Lee
Hiltabrand Ryan
Soós Dominik
Wu Jian
Publication venue: ODU Digital Commons
Publication date: 01/01/2022
Field of study

Recently, the Allen Institute for Artificial Intelligence released the Semantic Scholar Open Research Corpus (S2ORC), one of the largest open-access scholarly big datasets with more than 130 million scholarly paper records. S2ORC contains a significant portion of automatically generated metadata. The metadata quality could impact downstream tasks such as citation analysis, citation prediction, and link analysis. In this project, we assess the document linking quality and estimate the document conflation rate for the S2ORC dataset. Using semi-automatically curated ground truth corpora, we estimated that the overall document linking quality is high, with 92.6% of documents correctly linking to six major databases, but the linking quality varies depending on subject domains. The document conflation rate is around 2.6%, meaning that about 97.4% of documents are unique. We further quantitatively compared three near-duplicate detection methods using the ground truth created from S2ORC. The experiments indicated that locality-sensitive hashing was the best method in terms of effectiveness and scalability, achieving high performance (F1=0.960) and a much reduced runtime. Our code and data are available at https://github.com/lamps-lab/docconflation

Old Dominion University

The State-of-the-Art of Set Visualization

Author: Alper
Alsallakh
Bailey
Baron
Basole
Bothorel
Brandes
Caldas
Cantor
Cheng
Chow
Cleveland
Cole
Collins
Dice
Dinkla
Dinkla
Dwyer
Dörk
Eklund
Flower
Flower
Flower
Freiler
Gansner
Ganter
Gottfried
Greenacre
Gurr
Hamers
Henry Riche
Hofmann
Howse
Kestler
Kim
Koffka
Kosara
Krzywinski
Lenz
Lex
Lex
Meulemans
Micallef
Micallef
Micallef
Mäkinen
Nikulenkov
Oelke
Palmer
Park
Rodgers
Rodgers
Rodgers
Rodgers
Rodgers
Ruskey
Sadana
Schulz
Simonetto
Stapleton
Stapleton
Stapleton
Stapleton
Stapleton
Stasko
Steinberger
Tarnita
Treisman
Tunkelang
Tversky
Urbas
Urbas
Vehlow
Venn
Ware
Wertheimer
Wilkinson
Wille
Xu
Zhou
Publication venue: 'Wiley'
Publication date: 01/01/2016
Field of study

Sets comprise a generic data model that has been used in a variety of data analysis problems. Such problems involve analysing and visualizing set relations between multiple sets defined over the same collection of elements. However, visualizing sets is a non-trivial problem due to the large number of possible relations between them. We provide a systematic overview of state-of-the-art techniques for visualizing different kinds of set relations. We classify these techniques into six main categories according to the visual representations they use and the tasks they support. We compare the categories to provide guidance for choosing an appropriate technique for a given problem. Finally, we identify challenges in this area that need further research and propose possible directions to address these challenges. Further resources on set visualization are available at http://www.setviz.net

Crossref

Kent Academic Repository

A Rapid Review of Clustering Algorithms

Author: Aryani Amir
Astudillo Aland
Cao Shengyuan
Nambissan Aishwarya
Petrie Stephen
Yin Hui
Publication venue
Publication date: 14/01/2024
Field of study

Clustering algorithms aim to organize data into groups or clusters based on the inherent patterns and similarities within the data. They play an important role in today's life, such as in marketing and e-commerce, healthcare, data organization and analysis, and social media. Numerous clustering algorithms exist, with ongoing developments introducing new ones. Each algorithm possesses its own set of strengths and weaknesses, and as of now, there is no universally applicable algorithm for all tasks. In this work, we analyzed existing clustering algorithms and classify mainstream algorithms across five different dimensions: underlying principles and characteristics, data point assignment to clusters, dataset capacity, predefined cluster numbers and application area. This classification facilitates researchers in understanding clustering algorithms from various perspectives and helps them identify algorithms suitable for solving specific tasks. Finally, we discussed the current trends and potential future directions in clustering algorithms. We also identified and discussed open challenges and unresolved issues in the field.Comment: 25 pages, 7 figures, 3 table

arXiv.org e-Print Archive

Context dependent spectral unmixing.

Author: Jenzri Hamdi
Publication venue: ThinkIR: The University of Louisville\u27s Institutional Repository
Publication date: 01/08/2014
Field of study

A hyperspectral unmixing algorithm that finds multiple sets of endmembers is proposed. The algorithm, called Context Dependent Spectral Unmixing (CDSU), is a local approach that adapts the unmixing to different regions of the spectral space. It is based on a novel function that combines context identification and unmixing. This joint objective function models contexts as compact clusters and uses the linear mixing model as the basis for unmixing. Several variations of the CDSU, that provide additional desirable features, are also proposed. First, the Context Dependent Spectral unmixing using the Mahalanobis Distance (CDSUM) offers the advantage of identifying non-spherical clusters in the high dimensional spectral space. Second, the Cluster and Proportion Constrained Multi-Model Unmixing (CC-MMU and PC-MMU) algorithms use partial supervision information, in the form of cluster or proportion constraints, to guide the search process and narrow the space of possible solutions. The supervision information could be provided by an expert, generated by analyzing the consensus of multiple unmixing algorithms, or extracted from co-located data from a different sensor. Third, the Robust Context Dependent Spectral Unmixing (RCDSU) introduces possibilistic memberships into the objective function to reduce the effect of noise and outliers in the data. Finally, the Unsupervised Robust Context Dependent Spectral Unmixing (U-RCDSU) algorithm learns the optimal number of contexts in an unsupervised way. The performance of each algorithm is evaluated using synthetic and real data. We show that the proposed methods can identify meaningful and coherent contexts, and appropriate endmembers within each context. The second main contribution of this thesis is consensus unmixing. This approach exploits the diversity and similarity of the large number of existing unmixing algorithms to identify an accurate and consistent set of endmembers in the data. We run multiple unmixing algorithms using different parameters, and combine the resulting unmixing ensemble using consensus analysis. The extracted endmembers will be the ones that have a consensus among the multiple runs. The third main contribution consists of developing subpixel target detectors that rely on the proposed CDSU algorithms to adapt target detection algorithms to different contexts. A local detection statistic is computed for each context and then all scores are combined to yield a final detection score. The context dependent unmixing provides a better background description and limits target leakage, which are two essential properties for target detection algorithms

University of Louisville