481 research outputs found

    Quantifying and suppressing ranking bias in a large citation network

    Get PDF
    It is widely recognized that citation counts for papers from different fields cannot be directly compared because different scientific fields adopt different citation practices. Citation counts are also strongly biased by paper age since older papers had more time to attract citations. Various procedures aim at suppressing these biases and give rise to new normalized indicators, such as the relative citation count. We use a large citation dataset from Microsoft Academic Graph and a new statistical framework based on the Mahalanobis distance to show that the rankings by well known indicators, including the relative citation count and Google's PageRank score, are significantly biased by paper field and age. Our statistical framework to assess ranking bias allows us to exactly quantify the contributions of each individual field to the overall bias of a given ranking. We propose a general normalization procedure motivated by the z-score which produces much less biased rankings when applied to citation count and PageRank score

    BiblioDAP'21: The 1st Workshop on Bibliographic Data Analysis and Processing

    Get PDF
    Automatic processing of bibliographic data becomes very important in digital libraries, data science and machine learning due to its importance in keeping pace with the significant increase of published papers every year from one side and to the inherent challenges from the other side. This processing has several aspects including but not limited to I) Automatic extraction of references from PDF documents, II) Building an accurate citation graph, III) Author name disambiguation, etc. Bibliographic data is heterogeneous by nature and occurs in both structured (e.g. citation graph) and unstructured (e.g. publications) formats. Therefore, it requires data science and machine learning techniques to be processed and analysed. Here we introduce BiblioDAP'21: The 1st Workshop on Bibliographic Data Analysis and Processing

    Cybersecurity Deep: Approaches, Attacks Dataset, and Comparative Study

    Get PDF
    Cyber attacks are increasing rapidly due to advanced digital technologies used by hackers. In addition, cybercriminals are conducting cyber attacks, making cyber security a rapidly growing field. Although machine learning techniques worked well in solving large-scale cybersecurity problems, an emerging concept of deep learning (DL) that caught on during this period caused information security specialists to improvise the result. The deep learning techniques analyzed in this study are convolution neural networks, recurrent neural networks, and deep neural networks in the context of cybersecurity.A framework is proposed, and a realtime laboratory setup is performed to capture network packets and examine this captured data using various DL techniques. A comparable interpretation is presented under the DL techniques with essential parameters, particularly accuracy, false alarm rate, precision, and detection rate. The DL techniques experimental output projects improvise the performance of various realtime cybersecurity applications on a real-time dataset. CNN model provides the highest accuracy of 98.64% with a precision of 98% with binary class. The RNN model offers the secondhighest accuracy of 97.75%. CNN model provides the highest accuracy of 98.42 with multiclass class. The study shows that DL techniques can be effectively used in cybersecurity applications. Future research areas are being elaborated, including the potential research topics to improve several DL methodologies for cybersecurity applications.publishedVersio

    Exploring and Evaluating the Scalability and Efficiency of Apache Spark using Educational Datasets

    Get PDF
    Research into the combination of data mining and machine learning technology with web-based education systems (known as education data mining, or EDM) is becoming imperative in order to enhance the quality of education by moving beyond traditional methods. With the worldwide growth of the Information Communication Technology (ICT), data are becoming available at a significantly large volume, with high velocity and extensive variety. In this thesis, four popular data mining methods are applied to Apache Spark, using large volumes of datasets from Online Cognitive Learning Systems to explore the scalability and efficiency of Spark. Various volumes of datasets are tested on Spark MLlib with different running configurations and parameter tunings. The thesis convincingly presents useful strategies for allocating computing resources and tuning to take full advantage of the in-memory system of Apache Spark to conduct the tasks of data mining and machine learning. Moreover, it offers insights that education experts and data scientists can use to manage and improve the quality of education, as well as to analyze and discover hidden knowledge in the era of big data

    VERSE: Versatile Graph Embeddings from Similarity Measures

    Full text link
    Embedding a web-scale information network into a low-dimensional vector space facilitates tasks such as link prediction, classification, and visualization. Past research has addressed the problem of extracting such embeddings by adopting methods from words to graphs, without defining a clearly comprehensible graph-related objective. Yet, as we show, the objectives used in past works implicitly utilize similarity measures among graph nodes. In this paper, we carry the similarity orientation of previous works to its logical conclusion; we propose VERtex Similarity Embeddings (VERSE), a simple, versatile, and memory-efficient method that derives graph embeddings explicitly calibrated to preserve the distributions of a selected vertex-to-vertex similarity measure. VERSE learns such embeddings by training a single-layer neural network. While its default, scalable version does so via sampling similarity information, we also develop a variant using the full information per vertex. Our experimental study on standard benchmarks and real-world datasets demonstrates that VERSE, instantiated with diverse similarity measures, outperforms state-of-the-art methods in terms of precision and recall in major data mining tasks and supersedes them in time and space efficiency, while the scalable sampling-based variant achieves equally good results as the non-scalable full variant.Comment: In WWW 2018: The Web Conference. 10 pages, 5 figure
    • …
    corecore