Search CORE

1,083 research outputs found

Methods to Improve Applicability and Efficiency of Distributed Data-Centric Compute Frameworks

Author: Kambatla Karthik Shashank
Publication venue: 'Purdue University (bepress)'
Publication date: 01/01/2016
Field of study

The success of modern applications depends on the insights they collect from their data repositories. Data repositories for such applications currently exceed exabytes and are rapidly increasing in size, as they collect data from varied sources - web applications, mobile phones, sensors and other connected devices. Distributed storage and data-centric compute frameworks have been invented to store and analyze these large datasets. This dissertation focuses on extending the applicability and improving the efficiency of distributed data-centric compute frameworks

Purdue E-Pubs

Pregelix: Big(ger) Graph Analytics on A Dataflow Engine

Author: Borkar Vinayak
Bu Yingyi
Carey Michael J.
Condie Tyson
Jia Jianfeng
Publication venue
Publication date: 02/07/2014
Field of study

There is a growing need for distributed graph processing systems that are capable of gracefully scaling to very large graph datasets. Unfortunately, this challenge has not been easily met due to the intense memory pressure imposed by process-centric, message passing designs that many graph processing systems follow. Pregelix is a new open source distributed graph processing system that is based on an iterative dataflow design that is better tuned to handle both in-memory and out-of-core workloads. As such, Pregelix offers improved performance characteristics and scaling properties over current open source systems (e.g., we have seen up to 15x speedup compared to Apache Giraph and up to 35x speedup compared to distributed GraphLab), and makes more effective use of available machine resources to support Big(ger) Graph Analytics

arXiv.org e-Print Archive

CiteSeerX

New and Existing Approaches Reviewing of Big Data Analysis with Hadoop Tools

Author: Aljuboori Abbas Fadhil
Mutasher Watheq Ghanim
Publication venue: College of Science for Women - University of Baghdad
Publication date: 01/08/2022
Field of study

الجميع متصل بوسائل التواصل الاجتماعي مثل) الفيس بوك وتويتر ولنكدان والانستغرام ...الخ) , التي تتولد من خلالها كميات هائلة من البيانات لا تستطيع التطبيقات التقليدية من معالجتها , حيث تعتبر وسائل التواصل الاجتماعي منصة مهمة لتبادل المعلومات والآراء والمعرفة التي يجريها العديد من المشتركين ,على الرغم من هذه السمات الأساسية ، تساهم البيانات الضخمة أيضًا في العديد من المشكلات ، مثل جمع البيانات ، والتخزين ، والنقل ، والتحديث ، والمراجعة ، والنشر ، والمسح الضوئي ، والتصور ، وحماية البيانات ... إلخ. للتعامل مع كل هذه المشاكل، ظهرت الحاجة إلى نظام مناسب لا يقوم فقط بإعداد التفاصيل، بل يوفر أيضًا تحليلًا ذا مغزى للاستفادة من المواقف الصعبة، سواء ذات الصلة بالأعمال التجارية، أو القرار المناسب، أو الصحة، أو وسائل التواصل الاجتماعي، أو العلوم، الاتصالات، البيئة... إلخ.يلاحظ المؤلفون من خلال قراءة الدراسات السابقة أن هناك تحليلات مختلفة من خلال Hadoop وأدواته المختلفة مثل المشاعر في الوقت الفعلي وغيرها. ومع ذلك، فإن التعامل مع هذه البيانات الضخمة يعد مهمة صعبة. لذلك فإن هذا النوع من التحليل يكون بكفاءه أكثر أكثر كفاءة فقط من خلال نظام Hadoop البيئي.، الغرض من هذه الورقة هو تحليل الأدبيات المتعلقة بتحليل البيانات الضخمة لوسائل التواصل الاجتماعي باستخدام إطار Hadoop لمعرفة أدوات التحليل تقريبًا الموجودة في العالم تحت مظلة Hadoop وتوجهاتها بالإضافة إلى الصعوبات والأساليب الحديثة لها للتغلب على تحديات البيانات الضخمة في المعالجة غير المتصلة وفي الوقت الفعلي. تعمل التحليلات في الوقت الفعلي على تسريع عملية اتخاذ القرار إلى جانب توفير الوصول إلى مقاييس الأعمال وإعداد التقارير. كما تم توضيح المقارنة بين Hadoop و spark.Everybody is connected with social media like (Facebook, Twitter, LinkedIn, Instagram…etc.) that generate a large quantity of data and which traditional applications are inadequate to process. Social media are regarded as an important platform for sharing information, opinion, and knowledge of many subscribers. These basic media attribute Big data also to many issues, such as data collection, storage, moving, updating, reviewing, posting, scanning, visualization, Data protection, etc. To deal with all these problems, this is a need for an adequate system that not just prepares the details, but also provides meaningful analysis to take advantage of the difficult situations, relevant to business, proper decision, Health, social media, science, telecommunications, the environment, etc. Authors notice through reading of previous studies that there are different analyzes through HADOOP and its various tools such as the sentiment in real-time and others. However, dealing with this Big data is a challenging task. Therefore, such type of analysis is more efficiently possible only through the Hadoop Ecosystem. The purpose of this paper is to analyze literature related analysis of big data of social media using the Hadoop framework for knowing almost analysis tools existing in the world under the Hadoop umbrella and its orientations in addition to difficulties and modern methods of them to overcome challenges of big data in offline and real –time processing. Real-time Analytics accelerates decision-making along with providing access to business metrics and reporting. Comparison between Hadoop and spark has been also illustrated

Baghdad Science Journal

LiteMat: a scalable, cost-efficient inference encoding scheme for large RDF graphs

Author: Amann Bernd
Curé Olivier
Naacke Hubert
Randriamalala Tendry
Publication venue
Publication date: 12/10/2015
Field of study

The number of linked data sources and the size of the linked open data graph keep growing every day. As a consequence, semantic RDF services are more and more confronted with various "big data" problems. Query processing in the presence of inferences is one them. For instance, to complete the answer set of SPARQL queries, RDF database systems evaluate semantic RDFS relationships (subPropertyOf, subClassOf) through time-consuming query rewriting algorithms or space-consuming data materialization solutions. To reduce the memory footprint and ease the exchange of large datasets, these systems generally apply a dictionary approach for compressing triple data sizes by replacing resource identifiers (IRIs), blank nodes and literals with integer values. In this article, we present a structured resource identification scheme using a clever encoding of concepts and property hierarchies for efficiently evaluating the main common RDFS entailment rules while minimizing triple materialization and query rewriting. We will show how this encoding can be computed by a scalable parallel algorithm and directly be implemented over the Apache Spark framework. The efficiency of our encoding scheme is emphasized by an evaluation conducted over both synthetic and real world datasets.Comment: 8 pages, 1 figur

arXiv.org e-Print Archive

Crossref

Hal-Diderot

HAL-Ecole des Ponts ParisTech

HAL - UPEC / UPEM

Mapping Large Scale Research Metadata to Linked Data: A Performance Comparison of HBase, CSV and XML

Author: Huang Jyun-Yao
Karim Farah
Lange Christoph
Vahdati Sahar
Publication venue
Publication date: 01/01/2015
Field of study

OpenAIRE, the Open Access Infrastructure for Research in Europe, comprises a database of all EC FP7 and H2020 funded research projects, including metadata of their results (publications and datasets). These data are stored in an HBase NoSQL database, post-processed, and exposed as HTML for human consumption, and as XML through a web service interface. As an intermediate format to facilitate statistical computations, CSV is generated internally. To interlink the OpenAIRE data with related data on the Web, we aim at exporting them as Linked Open Data (LOD). The LOD export is required to integrate into the overall data processing workflow, where derived data are regenerated from the base data every day. We thus faced the challenge of identifying the best-performing conversion approach.We evaluated the performances of creating LOD by a MapReduce job on top of HBase, by mapping the intermediate CSV files, and by mapping the XML output.Comment: Accepted in 0th Metadata and Semantics Research Conferenc

arXiv.org e-Print Archive

Fraunhofer-ePrints

Distributed Semantic Web Data Management in HBase and MySQL Cluster

Author: Abraham John
Brazier Pearl
Chebotko Artem
Franke Craig
Morin Samuel
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 11/05/2011
Field of study

Various computing and data resources on the Web are being enhanced with machine-interpretable semantic descriptions to facilitate better search, discovery and integration. This interconnected metadata constitutes the Semantic Web, whose volume can potentially grow the scale of the Web. Efficient management of Semantic Web data, expressed using the W3C's Resource Description Framework (RDF), is crucial for supporting new data-intensive, semantics-enabled applications. In this work, we study and compare two approaches to distributed RDF data management based on emerging cloud computing technologies and traditional relational database clustering technologies. In particular, we design distributed RDF data storage and querying schemes for HBase and MySQL Cluster and conduct an empirical comparison of these approaches on a cluster of commodity machines using datasets and queries from the Third Provenance Challenge and Lehigh University Benchmark. Our study reveals interesting patterns in query evaluation, shows that our algorithms are promising, and suggests that cloud computing has a great potential for scalable Semantic Web data management.Comment: In Proc. of the 4th IEEE International Conference on Cloud Computing (CLOUD'11

arXiv.org e-Print Archive

Crossref

Distributed Semantic Web data management in HBase and MySQL cluster

Author: Franke Craig M.
Publication venue: ScholarWorks @ UTRGV
Publication date: 01/05/2011
Field of study

Various computing and data resources on the Web are being enhanced with machine-interpretable semantic descriptions to facilitate better search, discovery and integration. This interconnected metadata constitutes the Semantic Web, whose volume can potentially grow the scale of the Web. Efficient management of Semantic Web data, expressed using the W3C\u27s Resource Description Framework (RDF), is crucial for supporting new data-intensive, semantics-enabled applications. In this work, we study and compare two approaches to distributed RDF data management based on emerging cloud computing technologies and traditional relational database clustering technologies. In particular, we design distributed RDF data storage and querying schemes for HBase and MySQL Cluster and conduct an empirical comparison of these approaches on a cluster of commodity machines using datasets and queries from the Third Provenance Challenge and Lehigh University Benchmark. Our study reveals interesting patterns in query evaluation, shows that our algorithms are promising, and suggests that cloud computing has a great potential for scalable Semantic Web data management

Scholarworks@UTRGV Univ. of Texas RioGrande Valley