Details

      Description

      Add support for "star" joins. This is a join pattern where many attributes are extracted for a common key prefix. For such joins, we can replace N attribute joins with a single star join extracting those N attribute values.

      This join has edge cases for scale-out which need to be handled specially. When the prefix runs across the left or right boundary of the shard, the attributes are not all available locally. These cases could be handled either using an alternative execution plan (a subrule which does the normal attribute joins) or by deliberately fetching the necessary rows from the right or left sibling shard so as to complete the star join locally.

        Activity

        Hide
        bryanthompson bryanthompson added a comment -

        This is trivially implemented for the SPO or SPOC index if we constrain the choice of the separator key for index partition splits such that it does not split the statements for a given subject (this is not wise for the POS index, but is fine for SPO or SPOC).

        The join will consist of the constraint to be satisfied (predicate and object constraints) plus a list of the additional predicates whose objects should be bound. The predicate position could be a variable in which case all predicates would be matched.

        The original query would be rewritten to specify the star join for the attributes of some subject. When executed, the star join would override the join task on the shard. If the constraint is satisfied, then we can reposition the tuple cursor to the start of the subject (which will typically be on the same leaf) and then scan to pick up the desired predicate and value bindings. When a list of predicates is given, we could also advance the tuple cursor to each desired predicate in turn and then advance it to the start of the next possible subject when the last statement of interest was processed for that subject.

        Unlike a normal join task, the star join can bind many variables in the object position at once. Optionals should be supported for predicates. When there is more than one attribute value for a given predicate, we need to output multiple solutions. The easiest way to do all this is to buffer the matched tuples and then generate and emit the necessary solutions.

        This is a great place to start for pluggable join algorithms.

        Show
        bryanthompson bryanthompson added a comment - This is trivially implemented for the SPO or SPOC index if we constrain the choice of the separator key for index partition splits such that it does not split the statements for a given subject (this is not wise for the POS index, but is fine for SPO or SPOC). The join will consist of the constraint to be satisfied (predicate and object constraints) plus a list of the additional predicates whose objects should be bound. The predicate position could be a variable in which case all predicates would be matched. The original query would be rewritten to specify the star join for the attributes of some subject. When executed, the star join would override the join task on the shard. If the constraint is satisfied, then we can reposition the tuple cursor to the start of the subject (which will typically be on the same leaf) and then scan to pick up the desired predicate and value bindings. When a list of predicates is given, we could also advance the tuple cursor to each desired predicate in turn and then advance it to the start of the next possible subject when the last statement of interest was processed for that subject. Unlike a normal join task, the star join can bind many variables in the object position at once. Optionals should be supported for predicates. When there is more than one attribute value for a given predicate, we need to output multiple solutions. The easiest way to do all this is to buffer the matched tuples and then generate and emit the necessary solutions. This is a great place to start for pluggable join algorithms.
        Hide
        bryanthompson bryanthompson added a comment -

        One more point, the "join" algorithm choice is currently per rule. It needs to be per join dimension. The "join" (the AND) is currently implicit in the IRule. It needs to be made explicit and generalized to handle things like star joins and md-joins.

        Show
        bryanthompson bryanthompson added a comment - One more point, the "join" algorithm choice is currently per rule. It needs to be per join dimension. The "join" (the AND) is currently implicit in the IRule. It needs to be made explicit and generalized to handle things like star joins and md-joins.
        Hide
        bryanthompson bryanthompson added a comment -

        Matt,

        Can you put together some examples of queries which should be reduced to a single star join or to a mixture of star joins and joins of star joins? In particular, I would like to make certain that we capture the full set of constraints which we would like to be able to impose on the star join when considering multiple tuples clustered on the appropriate statement index (CSPO or SPO(C), and we need to specify which) and how the application can request attributes which will be materialized by the star join (e.g., all, all for some property, all for some set of properties).

        I see the star join really as an alternative access path which accepts a binding set from the upstream join dimension and emits binding sets rather than elements.

        It would probably be a good idea for us to look at inlining simple data type values at this time and exploiting them for FILTERs translated into the STAR-JOIN. [1]

        Bryan

        [1] https://sourceforge.net/apps/trac/bigdata/ticket/59

        Show
        bryanthompson bryanthompson added a comment - Matt, Can you put together some examples of queries which should be reduced to a single star join or to a mixture of star joins and joins of star joins? In particular, I would like to make certain that we capture the full set of constraints which we would like to be able to impose on the star join when considering multiple tuples clustered on the appropriate statement index (CSPO or SPO(C), and we need to specify which) and how the application can request attributes which will be materialized by the star join (e.g., all, all for some property, all for some set of properties). I see the star join really as an alternative access path which accepts a binding set from the upstream join dimension and emits binding sets rather than elements. It would probably be a good idea for us to look at inlining simple data type values at this time and exploiting them for FILTERs translated into the STAR-JOIN. [1] Bryan [1] https://sourceforge.net/apps/trac/bigdata/ticket/59
        Hide
        mroycsi mroycsi added a comment -

        One example query that is run often contains multiple star joins:

        PREFIX glitter: <http://openanzo.org/glitter/builtin/functions#>
        PREFIX fn: <http://www.w3.org/2006/sparql-functions#>
        PREFIX owl: <http://www.w3.org/2002/07/owl#>
        PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
        PREFIX dc: <http://purl.org/dc/elements/1.1/>
        PREFIX frame: <http://cambridgesemantics.com/ontologies/2008/07/OntologyService#>

        SELECT ?class ?value

        FROM NAMED <http://cambridgesemantics.com/ontologies/2009/08/Film>
        FROM NAMED <http://cambridgesemantics.com/ontologies/2009/08/Film_frame>

        WHERE {
        GRAPH ?frameOnt {
        ?frameOnt frame:ontology ?ont .
        ?frameClass a frame:FrameClass .
        ?frameClass frame:ontologyClass ?class .
        ?frameClass frame:frameProperty ?frameProp .
        ?frameProp frame:ontologyProperty ?value .
        ?frameProp frame:multiValued ?multiValued1 .
        OPTIONAL { ?frameProp frame:propertyRange ?range1 . } .
        }
        GRAPH ?ont {
        ?value a ?type1 .
        { ?value rdfs:label ?title1 . ?title1 <http://openanzo.org/ontologies/2008/07/Anzo#textmatch>"""name*""" . } UNION
        { ?value rdfs:comment ?description1. ?description1 <http://openanzo.org/ontologies/2008/07/Anzo#textmatch> """name*""" . }
        OPTIONAL { ?value rdfs:label ?title1 } .
        OPTIONAL { ?value rdfs:comment ?description1 } .
        OPTIONAL { ?class rdfs:label ?classTitle1 } .
        }
        }
        GROUP BY ?class ?value
        ORDER BY ?classTitle ?title ?value
        OFFSET 0 LIMIT 20

        Show
        mroycsi mroycsi added a comment - One example query that is run often contains multiple star joins: PREFIX glitter: < http://openanzo.org/glitter/builtin/functions# > PREFIX fn: < http://www.w3.org/2006/sparql-functions# > PREFIX owl: < http://www.w3.org/2002/07/owl# > PREFIX rdfs: < http://www.w3.org/2000/01/rdf-schema# > PREFIX dc: < http://purl.org/dc/elements/1.1/ > PREFIX frame: < http://cambridgesemantics.com/ontologies/2008/07/OntologyService# > SELECT ?class ?value FROM NAMED < http://cambridgesemantics.com/ontologies/2009/08/Film > FROM NAMED < http://cambridgesemantics.com/ontologies/2009/08/Film_frame > WHERE { GRAPH ?frameOnt { ?frameOnt frame:ontology ?ont . ?frameClass a frame:FrameClass . ?frameClass frame:ontologyClass ?class . ?frameClass frame:frameProperty ?frameProp . ?frameProp frame:ontologyProperty ?value . ?frameProp frame:multiValued ?multiValued1 . OPTIONAL { ?frameProp frame:propertyRange ?range1 . } . } GRAPH ?ont { ?value a ?type1 . { ?value rdfs:label ?title1 . ?title1 < http://openanzo.org/ontologies/2008/07/Anzo#textmatch >"""name*""" . } UNION { ?value rdfs:comment ?description1. ?description1 < http://openanzo.org/ontologies/2008/07/Anzo#textmatch > """name*""" . } OPTIONAL { ?value rdfs:label ?title1 } . OPTIONAL { ?value rdfs:comment ?description1 } . OPTIONAL { ?class rdfs:label ?classTitle1 } . } } GROUP BY ?class ?value ORDER BY ?classTitle ?title ?value OFFSET 0 LIMIT 20
        Hide
        bryanthompson bryanthompson added a comment -

        Star joins have been implemented and are passing the test suites when only one AccessPathTask at a time runs in parallel. However, star joins show a decreased throughput with only one access path task. BryanT has modified the JoinTask to support parallel access path tasks, but no improvement has been observed there and some inconsistencies have been observed on LUBM U50 with star joins and more than one concurrent AccessPathTask. MikeP will look into the star-join operator further and see if the source of this inconsistency can be identified and whether he can identify any issues with the implementation which might explain why the star-join is not showing the expected performance boost.

        Show
        bryanthompson bryanthompson added a comment - Star joins have been implemented and are passing the test suites when only one AccessPathTask at a time runs in parallel. However, star joins show a decreased throughput with only one access path task. BryanT has modified the JoinTask to support parallel access path tasks, but no improvement has been observed there and some inconsistencies have been observed on LUBM U50 with star joins and more than one concurrent AccessPathTask. MikeP will look into the star-join operator further and see if the source of this inconsistency can be identified and whether he can identify any issues with the implementation which might explain why the star-join is not showing the expected performance boost.
        Hide
        mikepersonick mikepersonick added a comment -

        Implemented. Did not produce any benefit.

        Show
        mikepersonick mikepersonick added a comment - Implemented. Did not produce any benefit.

          People

          • Assignee:
            mikepersonick mikepersonick
            Reporter:
            bryanthompson bryanthompson
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: