4,995 research outputs found
The Big Picture: A Holistic View of E-Book Acquisitions
The merging of two departments into the Acquisitions and Collection Development Department afforded Loyola Marymount University an opportunity to rethink existing workflows, with the acquisition of electronic books (e-books) being identified as a critical task to review. Process mapping was used to show the complexity of different tasks being performed in the department and to provide a visualization mechanism for staff to see how their work fit into a sequence of actions as part of a larger workflow. The authors listed the types of acquisition models used at their library for e-books and constructed process maps for the following six major types: 1. Firm order e-books; 2. Firm order e-book collections; 3. Approval order e-books; 4. Demand-driven e-books; 5. Standing order e-books, and; 6. Subscription e-book database. The authors merged the individual process maps into a single visualization to view the entirety of the acquisition process as a whole and to show how the different e-book acquisition models relate and diverge from one another
scalable bioinformatics via workflow conversion
Background Reproducibility is one of the tenets of the scientific method.
Scientific experiments often comprise complex data flows, selection of
adequate parameters, and analysis and visualization of intermediate and end
results. Breaking down the complexity of such experiments into the joint
collaboration of small, repeatable, well defined tasks, each with well defined
inputs, parameters, and outputs, offers the immediate benefit of identifying
bottlenecks, pinpoint sections which could benefit from parallelization, among
others. Workflows rest upon the notion of splitting complex work into the
joint effort of several manageable tasks. There are several engines that give
users the ability to design and execute workflows. Each engine was created to
address certain problems of a specific community, therefore each one has its
advantages and shortcomings. Furthermore, not all features of all workflow
engines are royalty-free —an aspect that could potentially drive away members
of the scientific community. Results We have developed a set of tools that
enables the scientific community to benefit from workflow interoperability. We
developed a platform-free structured representation of parameters, inputs,
outputs of command-line tools in so-called Common Tool Descriptor documents.
We have also overcome the shortcomings and combined the features of two
royalty-free workflow engines with a substantial user community: the Konstanz
Information Miner, an engine which we see as a formidable workflow editor, and
the Grid and User Support Environment, a web-based framework able to interact
with several high-performance computing resources. We have thus created a free
and highly accessible way to design workflows on a desktop computer and
execute them on high-performance computing resources. Conclusions Our work
will not only reduce time spent on designing scientific workflows, but also
make executing workflows on remote high-performance computing resources more
accessible to technically inexperienced users. We strongly believe that our
efforts not only decrease the turnaround time to obtain scientific results but
also have a positive impact on reproducibility, thus elevating the quality of
obtained scientific results
RekomGNN: Visualizing, Contextualizing and Evaluating Graph Neural Networks Recommendations
Content recommendation tasks increasingly use Graph Neural Networks, but it
remains challenging for machine learning experts to assess the quality of their
outputs. Visualization systems for GNNs that could support this interrogation
are few. Moreover, those that do exist focus primarily on exposing GNN
architectures for tuning and prediction tasks and do not address the challenges
of recommendation tasks. We developed RekomGNN, a visual analytics system that
supports ML experts in exploring GNN recommendations across several dimensions
and making annotations about their quality. RekomGNN straddles the design space
between Neural Network and recommender system visualization to arrive at a set
of encoding and interaction choices for recommendation tasks. We found that
RekomGNN helps experts make qualitative assessments of the GNN's results, which
they can use for model refinement. Overall, our contributions and findings add
to the growing understanding of visualizing GNNs for increasingly complex
tasks
SQL for GPU Data Frames in RAPIDS Accelerating end-to-end data science workflows using GPUs
International audienceIn this work, we present BlazingSQL [2] a SQL engine build on RAPIDS open-source software, which allows us to query enterprise data lakes lightning fast with full interoperability with the RAPIDs stack. BlazingSQL makes it simple for data scientists to SQL query raw files directly into GPU memory. RAPIDS can then take these results to continue machine learning, deep learning, and visualiza-tion workloads. We present two demo workflows using BlazingSQL and RAPIDS. Moreover, our solution presents an average from 20-100x faster than an identical query on Spark Cluster at price parity. This significant gain in speed allows us to evaluate the solution on a large, realistic, and challenging set of database use cases. The increasing availability of data has created a necessity to develop better techniques and methods in order to discover knowledge from massive volumes of complex data. For these challenges, CPU impose limits on performance to deliver these kind of solutions. Resorting to GPU programming is one approach to overcome these performance limitations. GPUs in Machine Learning CPUs can no longer handle the growing data. AI/ML is unable to keep up with the growth of data being processed [3]. GPUs are well known for accelerating the training. GPUs are able to scale to the new data demands. The bigger the dataset is, the higher the training performance difference is between CPU and GPU [4]. However data preparation still happens on CPUs, and can't keep up with GPU accelererated machine learning. RAPIDS RAPIDS [5] is an end-to-end analytics solution on GPUs. More extensively, RAPIDS is a set of open source libraries for GPU accelerating data preparation and machine learning built by multiple contributors like NVIDIA, Anaconda, BlazingDB, etc. It covers all the steps of the most common data science pipelines. It is composed of cuDF for data preparation, cuML for machine learning, and cu-GRAPH for graph analytics all under the standard specification of Apache Arrow [1] in GPU memory. BlazingSQL and RAPIDS Ecosystem RAPIDS [5] allows data scientists to accelerate end-to-end data an-alytics solution on GPUs. Part fundamental of RAPIDS is the GPU DataFrame (GDF) which has the goal to support interoperability between GPU applications and define a common GPU in-memory data layer. In this context, CUDA DataFrame (cuDF) from RAPIDS covers the GPU Data Processing for GDFs (formed by GPU compute kernels and a pandas-like API) [6]. BlazingSQL [2] provides a simple SQL interface to ETL massive datasets into GPU memory for AI and Deep Learning workloads. Furthermore, BlazingSQL can directly query files, such as CSV and Apache Parquet, on data lakes, like HDFS and AWS S3, all these processes directly into GPU memory End-to-end workflows Mortgage: Load Risk end-to-end processing Train a model to assess risk of new mortgage loans based on Fannie Mae loan performance data. BlazingSQL + XGBoost Loan Risk Demo The end to end analytics workload: • Data Lake → ETL/Feature Engineering → XGBoost Training • We built two price equivalent clusters on GCP, one for Apache Spark and another for BlazingSQL • BlazingSQL ran the ETL phase of this workload 20x faster than Apache Spark RAPIDS + BlazingSQL outperforms traditional CPU pipelines Netflow Analysis: ETL + Visualization BlazingSQL, the GPU SQL engine built on RAPIDS, worked with our partners at Graphistry to show how you can analyze log data over 100x faster than using Apache Spark at price parity. Visually analyze the VAST netflow data set inside Graphistry in order to quickly detect anomalous events We took 65M rows of netflow data in Apache Parquet, and in less than a second our query built a table of nodes and edges to render a visual graph
- …