Nested TripleGroup Algebra – An Intermediate Algebra for Optimizing RDF Graph Pattern Matching on MapReduce Platforms

Presented at: 8th Extended Semantic Web Conference (ESWC2011)

by Padmashree Ravindra, Hyeongsik Kim, Kemafor Anyanwu

Processing a join operation using MapReduce platforms such as Hadoop incurs communication and I/O costs due to data transfer between the Map and Reduce phases. This cost is prohibitive for RDF graph patterns which typically involve several joins. Existing approaches on optimizing RDF graph pattern matching exploit the existence of star-join structures in RDF graph patterns and propose the use of bushy query execution plans instead of linear ones. However, some Hadoop-based data processing systems such as Yahoo’s Pig only support linear execution plans, and require significant modifications to the scheduler to support bushy plans. In this paper, we propose an approach for “sneaking in” bushy query execution plans into Hadoop by interpreting star joins as “groups of triples” known as TripleGroups. We present an alternative intermediate TripleGroup-based algebra called the Nested Triple Group Algebra (NTGA) for rewriting star-join queries as TripleGroup queries. We also propose a data representation format (RDFMap) that more efficiently supports TripleGroup-based processing than the existing relational tuple infrastructure. We present a comparative performance evaluation of the traditional Pig approach and RAPID+ (Pig extended with NTGA) for graph pattern matching queries on the BSBM benchmark dataset. Results show over 60% performance of our approach over traditional Pig for some tasks.

Keywords: MapReduce, Optimization Techniques, RDF Graph Pattern Matching

Resource URI on the dog food server:

Explore this resource elsewhere: