19 research outputs found

    BIP! NDR (NoDoiRefs): A Dataset of Citations From Papers Without DOIs in Computer Science Conferences and Workshops

    Full text link
    In the field of Computer Science, conference and workshop papers serve as important contributions, carrying substantial weight in research assessment processes, compared to other disciplines. However, a considerable number of these papers are not assigned a Digital Object Identifier (DOI), hence their citations are not reported in widely used citation datasets like OpenCitations and Crossref, raising limitations to citation analysis. While the Microsoft Academic Graph (MAG) previously addressed this issue by providing substantial coverage, its discontinuation has created a void in available data. BIP! NDR aims to alleviate this issue and enhance the research assessment processes within the field of Computer Science. To accomplish this, it leverages a workflow that identifies and retrieves Open Science papers lacking DOIs from the DBLP Corpus, and by performing text analysis, it extracts citation information directly from their full text. The current version of the dataset contains more than 510K citations made by approximately 60K open access Computer Science conference or workshop papers that, according to DBLP, do not have a DOI

    Piloting topic-aware research impact assessment features in BIP! Services

    Full text link
    Various research activities rely on citation-based impact indicators. However these indicators are usually globally computed, hindering their proper interpretation in applications like research assessment and knowledge discovery. In this work, we advocate for the use of topic-aware categorical impact indicators, to alleviate the aforementioned problem. In addition, we extend BIP! Services to support those indicators and showcase their benefits in real-world research activities.Comment: 5 pages, 2 figure

    ATRAPOS: Evaluating Metapath Query Workloads in Real Time

    Full text link
    Heterogeneous information networks (HINs) represent different types of entities and relationships between them. Exploring, analysing, and extracting knowledge from such networks relies on metapath queries that identify pairs of entities connected by relationships of diverse semantics. While the real-time evaluation of metapath query workloads on large, web-scale HINs is highly demanding in computational cost, current approaches do not exploit interrelationships among the queries. In this paper, we present ATRAPOS, a new approach for the real-time evaluation of metapath query workloads that leverages a combination of efficient sparse matrix multiplication and intermediate result caching. ATRAPOS selects intermediate results to cache and reuse by detecting frequent sub-metapaths among workload queries in real time, using a tailor-made data structure, the Overlap Tree, and an associated caching policy. Our experimental study on real data shows that ATRAPOS accelerates exploratory data analysis and mining on HINs, outperforming off-the-shelf caching approaches and state-of-the-art research prototypes in all examined scenarios.Comment: 13 pages, 19 figure

    jmelot/SoftwareImpactHackathon2023_InstitutionalOSS: Post-hackathon cleanup 2023

    No full text
    <p>Repo as of post-hackathon cleanup. Preliminary results available, but more evaluation and result filtering needed.</p&gt

    BIP! DB: A Dataset of Impact Measures for Scientific Publications

    No full text
    <p>This dataset contains citation-based impact indicators (a.k.a, "measures") for ~153M distinct DOIs that correspond to scientific articles. In particular, for each DOI, we have calculated the following indicators (organized in categories based on the semantics of the impact aspect that they better capture):</p><p><strong>Influence indicators</strong> (i.e., indicators of the "total" impact of each article; how established the article is in general)</p><p><i>Citation Count:</i> The total number of citations of the article, the most well-known influence indicator.</p><p><i>PageRank score:</i><strong> </strong>An influence indicator based on the PageRank [1], a popular network analysis method. PageRank estimates the influence of each article based on its centrality in the whole citation network. It alleviates some issues of the Citation Count indicator (e.g., two articles with the same number of citations can have significantly different PageRank scores if the aggregated influence of the articles citing them is very different - the article receiving citations from more influential articles will get a larger score).  </p><p><strong>Popularity indicators</strong> (i.e., indicators of the "current" impact of each article; how popular the article is currently)</p><p><i>RAM score:</i> A popularity indicator based on the RAM [2] method. It is essentially a Citation Count where recent citations are considered as more important. This type of "time awareness" alleviates problems of methods like PageRank, which are biased against recently published articles (new articles need time to receive a number of citations that can be indicative for their impact).</p><p><i>AttRank score:</i><strong> </strong>A popularity indicator based on the AttRank [3] method. AttRank alleviates PageRank's bias against recently published papers by incorporating an attention-based mechanism, akin to a time-restricted version of preferential attachment, to explicitly capture a researcher's preference to read papers which received a lot of attention recently.</p><p><strong>Impulse indicators</strong> (i.e., indicators of the initial momentum that the article receives after its publication)</p><p><i>Incubation Citation Count (3-year CC): </i>This impulse indicator is a time-restricted version of the Citation Count, where the time window length is fixed for all papers and the time window depends on the publication date of the paper, i.e., only citations 3 years after each paper's publication are counted.</p><p>More details about the aforementioned impact indicators, the way they are calculated and their interpretation can be found <a href="https://bip.imsi.athenarc.gr/site/indicators">here</a> and in the respective references (e.g., in [5]).</p><p>From version 5.1 onward, the impact indicators are calculated in two levels:</p><ul><li>The <strong>DOI level</strong> (assuming that each DOI corresponds to a distinct scientific article).</li><li>The <strong>OpenAIRE-id level</strong> (leveraging DOI synonyms based on OpenAIRE's deduplication algorithm [4] - each distinct article has its own OpenAIRE id).</li></ul><p>Previous versions of the dataset only provided the scores at the DOI level.</p><p>Also, from version 7 onward, for each article in our files we also offer an impact class, which informs the user about the percentile into which the article score belongs compared to the impact scores of the rest articles in the database. The impact classes are: C1 (in top 0.01%), C2 (in top 0.1%), C3 (in top 1%), C4 (in top 10%), and C5 (in bottom 90%).</p><p>Finally, before version 10, the calculation of the impact scores (and classes) was based on a citation network having one node for each article with a distinct DOI that we could find in our input data sources. However, from version 10 onward, the nodes are deduplicated using the most recent version of the <a href="https://graph.openaire.eu/docs/graph-production-workflow/deduplication/research-products">OpenAIRE article deduplication algorithm</a>. This enabled a correction of the scores (more specifically, we avoid counting citation links multiple times when they are made by multiple versions of the same article). As a result, each node in the citation network we build is a deduplicated article having a distinct OpenAIRE id. We still report the scores at DOI level (i.e., we assign a score to each of the versions/instances of the article), however these DOI-level scores are just the scores of the respective deduplicated nodes propagated accordingly (i.e., all version of the same deduplicated article will receive the same scores). We have removed a small number of instances (having a DOI) that were assigned (by error) to multiple deduplicated records in the OpenAIRE Graph.</p><p>For each calculation level (DOI / OpenAIRE-id) we provide five (5) compressed CSV files (one for each measure/score provided) where each line follows the format  "identifier <tab> score <tab> class". The parameter setting of each measure is encoded in the corresponding filename. For more details on the different measures/scores see our extensive experimental study5 and the configuration of AttRank in the original paper.[3] Files for the OpenAIRE-ids case contain the keyword "openaire_ids" in the filename.  </p><p>From version 9 onward, we also provide topic-specific impact classes for DOI-identified publications. In particular, we associated those articles with 2nd level concepts from OpenAlex (284 in total); we chose to keep only the three most dominant concepts for each publication, based on their confidence score, and only if this score was greater than 0.3. Then, for each publication and impact measure, we compute its class within its respective concepts. We provide finally the "topic_based_impact_classes.txt" file where each line follows the format "identifier <tab> concept <tab> pagerank_class <tab> attrank_class <tab> 3-cc_class <tab> cc_class".</p><p>The data used to produce the citation network on which we calculated the provided measures have been gathered from the OpenAIRE Graph v6.0.1, including data from (a) the OpenCitations' COCI dataset (Jan-2023 version), (b) a MAG [6,7] snapshot from Dec-2021, and (c) a Crossref snapshot from May-2023 (before version 10, these sources were gathered independently). The union of all distinct citations that could be found in these sources have been considered. In addition, versions later than v.10 leverage the filtering rules described <a href="https://graph.openaire.eu/docs/graph-production-workflow/aggregation/non-compatible-sources/doiboost/#crossref-filtering">here</a> to remove from the dataset DOIs with problematic metadata.</p><p>References:</p><p>[1] R. Motwani L. Page, S. Brin and T. Winograd. 1999. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report. Stanford InfoLab.</p><p>[2] Rumi Ghosh, Tsung-Ting Kuo, Chun-Nan Hsu, Shou-De Lin, and Kristina Lerman. 2011. Time-Aware Ranking in Dynamic Citation Networks. In Data Mining Workshops (ICDMW). 373–380</p><p>[3] I. Kanellos, T. Vergoulis, D. Sacharidis, T. Dalamagas, Y. Vassiliou: Ranking Papers by their Short-Term Scientific Impact. CoRR abs/2006.00951 (2020)</p><p>[4]  P. Manghi, C. Atzori, M. De Bonis, A. Bardi, Entity deduplication in big data graphs for scholarly communication, Data Technologies and Applications (2020).</p><p>[5] I. Kanellos, T. Vergoulis, D. Sacharidis, T. Dalamagas, Y. Vassiliou: Impact-Based Ranking of Scientific Publications: A Survey and Experimental Evaluation. TKDE 2019 (early access)</p><p>[6] Arnab Sinha, Zhihong Shen, Yang Song, Hao Ma, Darrin Eide, Bo-June (Paul) Hsu, and Kuansan Wang. 2015. An Overview of Microsoft Academic Service (MA) and Applications. In Proceedings of the 24th International Conference on World Wide Web (WWW '15 Companion). ACM, New York, NY, USA, 243-246. DOI=http://dx.doi.org/10.1145/2740908.2742839</p><p>[7] K. Wang et al., "A Review of Microsoft Academic Services for Science of Science Studies", Frontiers in Big Data, 2019, doi: 10.3389/fdata.2019.00045    </p><p>Find our Academic Search Engine built on top of these data <a href="https://bip.imsi.athenarc.gr/">here</a>. Further note, that we also provide all calculated scores through <a href="https://bip-api.imsi.athenarc.gr/documentation">BIP! Finder's API</a>. </p><p><i>Terms of use:</i> These data are provided "as is", without any warranties of any kind. The data are provided under the Creative Commons Attribution 4.0 International license.</p><p>More details about BIP! DB can be found in our relevant peer-reviewed publication:</p><p><i>Thanasis Vergoulis, Ilias Kanellos, Claudio Atzori, Andrea Mannocci, Serafeim Chatzopoulos, Sandro La Bruzzo, Natalia Manola, Paolo Manghi: BIP! DB: A Dataset of Impact Measures for Scientific Publications. WWW (Companion Volume) 2021: 456-460</i></p><p>We kindly request that any published research that makes use of BIP! DB cite the above article.</p>Please cite: Thanasis Vergoulis, Ilias Kanellos, Claudio Atzori, Andrea Mannocci, Serafeim Chatzopoulos, Sandro La Bruzzo, Natalia Manola, Paolo Manghi: BIP! DB: A Dataset of Impact Measures for Scientific Publications. WWW (Companion Volume) 2021: 456-46

    BIP! DB: A Dataset of Impact Measures for Research Products

    No full text
    <p>This dataset contains citation-based impact indicators (a.k.a, "measures") for ~168.8M distinct PIDs (persistent identifiers) that correspond to research products (scientific publications, datasets, etc). In particular, for each PID, we have calculated the following indicators (organized in categories based on the semantics of the impact aspect that they better capture):</p> <p><strong>Influence indicators</strong> (i.e., indicators of the "total" impact of each research product; how established it is in general)</p> <p><em>Citation Count:</em> The total number of citations of the product, the most well-known influence indicator.</p> <p><em>PageRank score:</em><strong> </strong>An influence indicator based on the PageRank [1], a popular network analysis method. PageRank estimates the influence of each product based on its centrality in the whole citation network. It alleviates some issues of the Citation Count indicator (e.g., two products with the same number of citations can have significantly different PageRank scores if the aggregated influence of the products citing them is very different - the product receiving citations from more influential products will get a larger score).  </p> <p><strong>Popularity indicators</strong> (i.e., indicators of the "current" impact of each research product; how popular the product is currently)</p> <p><em>RAM score:</em> A popularity indicator based on the RAM [2] method. It is essentially a Citation Count where recent citations are considered as more important. This type of "time awareness" alleviates problems of methods like PageRank, which are biased against recently published products (new products need time to receive a number of citations that can be indicative for their impact).</p> <p><em>AttRank score:</em><strong> </strong>A popularity indicator based on the AttRank [3] method. AttRank alleviates PageRank's bias against recently published products by incorporating an attention-based mechanism, akin to a time-restricted version of preferential attachment, to explicitly capture a researcher's preference to examine products which received a lot of attention recently.</p> <p><strong>Impulse indicators</strong> (i.e., indicators of the initial momentum that the research product received right after its publication)</p> <p><em>Incubation Citation Count (3-year CC): </em>This impulse indicator is a time-restricted version of the Citation Count, where the time window length is fixed for all products and the time window depends on the publication date of the product, i.e., only citations 3 years after each product's publication are counted.</p> <p>More details about the aforementioned impact indicators, the way they are calculated and their interpretation can be found <a href="https://bip.imsi.athenarc.gr/site/indicators">here</a> and in the respective references (e.g., in [5]).</p> <p>From version 5.1 onward, the impact indicators are calculated in two levels:</p> <ul> <li>The <strong>PID level</strong> (assuming that each PID corresponds to a distinct research product).</li> <li>The <strong>OpenAIRE-id level</strong> (leveraging PID synonyms based on OpenAIRE's deduplication algorithm [4] - each distinct article has its own OpenAIRE id).</li> </ul> <p>Previous versions of the dataset only provided the scores at the PID level.</p> <p>From version 12 onward, two types of PIDs are included in the dataset: DOIs and PMIDs (before that version, only DOIs were included). </p> <p>Also, from version 7 onward, for each product in our files we also offer an impact class, which informs the user about the percentile into which the product score belongs compared to the impact scores of the rest products in the database. The impact classes are: C1 (in top 0.01%), C2 (in top 0.1%), C3 (in top 1%), C4 (in top 10%), and C5 (in bottom 90%).</p> <p>Finally, before version 10, the calculation of the impact scores (and classes) was based on a citation network having one node for each product with a distinct PID that we could find in our input data sources. However, from version 10 onward, the nodes are deduplicated using the most recent version of the <a href="https://graph.openaire.eu/docs/graph-production-workflow/deduplication/research-products">OpenAIRE article deduplication algorithm</a>. This enabled a correction of the scores (more specifically, we avoid counting citation links multiple times when they are made by multiple versions of the same product). As a result, each node in the citation network we build is a deduplicated product having a distinct OpenAIRE id. We still report the scores at PID level (i.e., we assign a score to each of the versions/instances of the product), however these PID-level scores are just the scores of the respective deduplicated nodes propagated accordingly (i.e., all version of the same deduplicated product will receive the same scores). We have removed a small number of instances (having a PID) that were assigned (by error) to multiple deduplicated records in the OpenAIRE Graph.</p> <p>For each calculation level (PID / OpenAIRE-id) we provide five (5) compressed CSV files (one for each measure/score provided) where each line follows the format  "identifier <tab> score <tab> class". The parameter setting of each measure is encoded in the corresponding filename. For more details on the different measures/scores see our extensive experimental study [5] and the configuration of AttRank in the original paper. [3] Files for the OpenAIRE-ids case contain the keyword "openaire_ids" in the filename.  </p> <p>From version 9 onward, we also provide topic-specific impact classes for PID-identified products. In particular, we associated those products with 2nd level concepts from OpenAlex; we chose to keep only the three most dominant concepts for each product, based on their confidence score, and only if this score was greater than 0.3. Then, for each product and impact measure, we compute its class within its respective concepts. We provide finally the "topic_based_impact_classes.txt" file where each line follows the format "identifier <tab> concept <tab> pagerank_class <tab> attrank_class <tab> 3-cc_class <tab> cc_class".</p> <p>The data used to produce the citation network on which we calculated the provided measures have been gathered from the OpenAIRE Graph v7.0.0, including data from (a) OpenCitations' COCI & POCI dataset, (b) MAG [6,7], and (c) Crossref. The union of all distinct citations that could be found in these sources have been considered. In addition, versions later than v.10 leverage the filtering rules described <a href="https://graph.openaire.eu/docs/graph-production-workflow/aggregation/non-compatible-sources/doiboost/#crossref-filtering">here</a> to remove from the dataset PIDs with problematic metadata.</p> <p>References:</p> <p>[1] R. Motwani L. Page, S. Brin and T. Winograd. 1999. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report. Stanford InfoLab.</p> <p>[2] Rumi Ghosh, Tsung-Ting Kuo, Chun-Nan Hsu, Shou-De Lin, and Kristina Lerman. 2011. Time-Aware Ranking in Dynamic Citation Networks. In Data Mining Workshops (ICDMW). 373–380</p> <p>[3] I. Kanellos, T. Vergoulis, D. Sacharidis, T. Dalamagas, Y. Vassiliou: Ranking Papers by their Short-Term Scientific Impact. CoRR abs/2006.00951 (2020)</p> <p>[4]  P. Manghi, C. Atzori, M. De Bonis, A. Bardi, Entity deduplication in big data graphs for scholarly communication, Data Technologies and Applications (2020).</p> <p>[5] I. Kanellos, T. Vergoulis, D. Sacharidis, T. Dalamagas, Y. Vassiliou: Impact-Based Ranking of Scientific Publications: A Survey and Experimental Evaluation. TKDE 2019 (early access)</p> <p>[6] Arnab Sinha, Zhihong Shen, Yang Song, Hao Ma, Darrin Eide, Bo-June (Paul) Hsu, and Kuansan Wang. 2015. An Overview of Microsoft Academic Service (MA) and Applications. In Proceedings of the 24th International Conference on World Wide Web (WWW '15 Companion). ACM, New York, NY, USA, 243-246. DOI=http://dx.doi.org/10.1145/2740908.2742839</p> <p>[7] K. Wang et al., "A Review of Microsoft Academic Services for Science of Science Studies", Frontiers in Big Data, 2019, doi: 10.3389/fdata.2019.00045    </p> <p>Find our Academic Search Engine built on top of these data <a href="https://bip.imsi.athenarc.gr/">here</a>. Further note, that we also provide all calculated scores through <a href="https://bip-api.imsi.athenarc.gr/documentation">BIP! Finder's API</a>. </p> <p><em>Terms of use:</em> These data are provided "as is", without any warranties of any kind. The data are provided under the Creative Commons Attribution 4.0 International license.</p> <p>More details about BIP! DB can be found in our relevant peer-reviewed publication:</p> <p><em>Thanasis Vergoulis, Ilias Kanellos, Claudio Atzori, Andrea Mannocci, Serafeim Chatzopoulos, Sandro La Bruzzo, Natalia Manola, Paolo Manghi: BIP! DB: A Dataset of Impact Measures for Scientific Publications. WWW (Companion Volume) 2021: 456-460</em></p> <p>We kindly request that any published research that makes use of BIP! DB cite the above article.</p>Please cite: Thanasis Vergoulis, Ilias Kanellos, Claudio Atzori, Andrea Mannocci, Serafeim Chatzopoulos, Sandro La Bruzzo, Natalia Manola, Paolo Manghi: BIP! DB: A Dataset of Impact Measures for Scientific Publications. WWW (Companion Volume) 2021: 456-46

    DIANA-mirExTra v2.0: Uncovering microRNAs and transcription factors with crucial roles in NGS expression data

    No full text
    Differential expression analysis (DEA) is one of the main instruments utilized for revealing molecular mechanisms in pathological and physiological conditions. DIANA-mirExTra v2.0 (http://www.microrna.gr/mirextrav2) performs a combined DEA of mRNAs and microRNAs (miRNAs) to uncover miRNAs and transcription factors (TFs) playing important regulatory roles between two investigated states. The web server uses as input miRNA/RNA-Seq read count data sets that can be uploaded for analysis. Users can combine their data with 350 small-RNA-Seq and 65 RNA-Seq in-house analyzed libraries which are provided by DIANA-mirExTra v2.0. The web server utilizes miRNA:mRNA, TF:mRNA and TF:miRNA interactions derived from extensive experimental data sets. More than 450 000 miRNA interactions and 2 000 000 TF binding sites from specific or high-throughput techniques have been incorporated, while accurate miRNA TSS annotation is obtained from microTSS experimental/in silico framework. These comprehensive data sets enable users to perform analyses based solely on experimentally supported information and to uncover central regulators within sequencing data: miRNAs controlling mRNAs and TFs regulating mRNA or miRNA expression. The server also supports predicted miRNA:gene interactions from DIANA-microT-CDS for 4 species (human, mouse, nematode and fruit fly). DIANA-mirExTra v2.0 has an intuitive user interface and is freely available to all users without any login requirement
    corecore