Details

      Description

      We have received a request to add support for virtual graphs. A virtual graph would be a managed collection of named graphs and would be available in the QUADS mode, though it MIGHT be possible to define virtual graphs in the SIDs/PROVENANCE mode of the database as well. A virtual graph could

      A virtual graph could reference in a FROM or FROM NAMED clause in a query. As an initial implementation, the reference would be translated into the collection of named graphs in the virtual graph. Query evaluation would be unchanged since this would be only a syntactic sugar.

      The management of virtual graphs would be through the management of assertions associating the URI of the virtual graph with the URIs of the named graphs in that virtual graph. One possibility to express that relationship is a graph is composed of a collection of other graphs using the SPARQL 1.1 Service Description model [1].

      A wiki page exists to document this feature [2].

      [1] http://www.w3.org/TR/sparql11-service-description/#sd-GraphCollection
      [2] https://sourceforge.net/apps/mediawiki/bigdata/index.php?title=VirtualGraphs&action=submit (Virtual Graphs wiki page)

        Activity

        Hide
        bryanthompson bryanthompson added a comment -

        Our current sequence as it relates to a virtual graphs feature and the parser step in which we perform the resolution of URIs to IVs (internal values resolved against the indices) is as follows. I have left out some steps which do not really pertain to this issue:

        // Run the parser.
        final ASTQueryContainer qc = SyntaxTreeBuilder.parseQuery(queryStr);
        
        // Resolve BigdataValues with their associated IVs.
        new BatchRDFValueResolver(context).process(qc);
        
        // Build the bigdata AST (our syntax oriented abstraction) from the parse tree.
        final QueryRoot queryRoot = buildQueryModel(qc, context);
        
        // Extract the dataset. IVs are already resolved.
        final Dataset dataset = DatasetDeclProcessor.process(qc);
                    
        // Attach the Bigdata specific data set model to the query model.
        queryRoot.setDataset(new DatasetNode(dataset));
        

        It seems that the FROM/FROM NAMED or FROM DATASET / FROM NAMED DATASET should be handled by a new step immediately after the BatchRDFValueResolver. This "batch" resolution step helps to avoid overhead on a cluster by performing a single scattered read for all URIs (and Literals) which need to be resolved against the Value => IV index mapping (our ID2TERM index).

        That new step would pre-process the DataSet declarations, looking up on the SPO(C) index with (uriIV, sd:namedGraph, ?iv). If this access pattern is non-empty, then we would either replace the uriIV with the contents of the access pattern -or- add the contents of the access pattern to the data set. The former would mean that any pre-existing named graph would be "lost" in some sense if it were made into a virtual graph. The latter would mean that a virtual graph included the union of the named graph for that URI and the collection of named graphs for that URI.

        Are you thinking that we would use the sd:defaultGraph predicate to model virtual graphs for FROM or FROM DATASET? That means that collections which comprise a virtual "defaultGraph" would be independent of collections which comprise a virtual named graph.

        In the absence of any experience with the service description stuff, I am inclined to a syntax extension in order to clearly differentiate the use of virtual graphs from named graphs. This also makes it possible for us to avoid the additional resolution step(s) against the statement indices for the sd:namedGraph predicate.

        Presumably we should also expose access to the service descriptions via a bare GET at the HTTP SPARQL endpoint [1]. However, I am concerned that such service descriptions could be extremely large if they are required to report all graphs in the default graph or in the available named graphs. It would also seem that the service description as reported by a bare GET would not necessarily provide a means to report on the virtual graphs available from the end point.

        An alternative would be to manage the virtual graphs as distinct HTTP end points and use a RESTful API to create those end points, associating the end point with some set of named graphs. With that approach there would be no SPARQL QUERY syntax extension and each end point could be described, but I still have reservations about sending back very large named graph collections in a service description.

        [1] http://www.w3.org/TR/sparql11-service-description/#accessing

        Show
        bryanthompson bryanthompson added a comment - Our current sequence as it relates to a virtual graphs feature and the parser step in which we perform the resolution of URIs to IVs (internal values resolved against the indices) is as follows. I have left out some steps which do not really pertain to this issue: // Run the parser. final ASTQueryContainer qc = SyntaxTreeBuilder.parseQuery(queryStr); // Resolve BigdataValues with their associated IVs. new BatchRDFValueResolver(context).process(qc); // Build the bigdata AST (our syntax oriented abstraction) from the parse tree. final QueryRoot queryRoot = buildQueryModel(qc, context); // Extract the dataset. IVs are already resolved. final Dataset dataset = DatasetDeclProcessor.process(qc); // Attach the Bigdata specific data set model to the query model. queryRoot.setDataset(new DatasetNode(dataset)); It seems that the FROM/FROM NAMED or FROM DATASET / FROM NAMED DATASET should be handled by a new step immediately after the BatchRDFValueResolver. This "batch" resolution step helps to avoid overhead on a cluster by performing a single scattered read for all URIs (and Literals) which need to be resolved against the Value => IV index mapping (our ID2TERM index). That new step would pre-process the DataSet declarations, looking up on the SPO(C) index with (uriIV, sd:namedGraph, ?iv). If this access pattern is non-empty, then we would either replace the uriIV with the contents of the access pattern -or- add the contents of the access pattern to the data set. The former would mean that any pre-existing named graph would be "lost" in some sense if it were made into a virtual graph. The latter would mean that a virtual graph included the union of the named graph for that URI and the collection of named graphs for that URI. Are you thinking that we would use the sd:defaultGraph predicate to model virtual graphs for FROM or FROM DATASET? That means that collections which comprise a virtual "defaultGraph" would be independent of collections which comprise a virtual named graph. In the absence of any experience with the service description stuff, I am inclined to a syntax extension in order to clearly differentiate the use of virtual graphs from named graphs. This also makes it possible for us to avoid the additional resolution step(s) against the statement indices for the sd:namedGraph predicate. Presumably we should also expose access to the service descriptions via a bare GET at the HTTP SPARQL endpoint [1] . However, I am concerned that such service descriptions could be extremely large if they are required to report all graphs in the default graph or in the available named graphs. It would also seem that the service description as reported by a bare GET would not necessarily provide a means to report on the virtual graphs available from the end point. An alternative would be to manage the virtual graphs as distinct HTTP end points and use a RESTful API to create those end points, associating the end point with some set of named graphs. With that approach there would be no SPARQL QUERY syntax extension and each end point could be described, but I still have reservations about sending back very large named graph collections in a service description. [1] http://www.w3.org/TR/sparql11-service-description/#accessing
        Hide
        bryanthompson bryanthompson added a comment -

        I've been talking with DavidBooth regarding virtual graphs support. A summary of my current position follows:

        I woke up today feeling like I have a decision on this. It seems that you have a concern with regard to data in the user space causing accidental merging of graphs within virtual graphs which are then accidentally queried. On the other hand, I have the concerns that (a) there is an added cost to the step resolving a virtual graph to its component graphs; and (b) without introducing a new index, we have no place outside of the "user data" to store those virtual graph => component graph associations.

        Given both of our concerns, I think that it makes sense to do something like Anzo using an explicit syntax to address the virtual graphs and using a bigdata specific predicate to create those associations.

        I believe the Anzo syntax is:

        FROM DATASET
        FROM NAMED DATASET

        Or we could use

        FROM VIRTUAL GRAPH
        FROM NAMED VIRTUAL GRAPH

        Personally, I prefer the latter as it more directly conveys the concept.

        In terms of modeling the virtual graph relationships:

        :vg bd:virtualGraph :vgx

        Would establish a membership relationship indicating that :vgx is a member of the virtual graph :vg. The bd: namespace would be the namespace for bigdata specific extensions.

        As far as I can see, there is no distinction when modeling virtual graph membership between named graphs and the default graph. That is simply a question of whether you reference the virtual graph in the FROM VIRTUAL GRAPH or FROM NAMED VIRTUAL GRAPH clause of the query.

        When a graph identifier is encountered in FROM (NAMED) VIRTUAL GRAPH clause it will be resolved to the member graphs and the data set will be rewritten to use the member graphs instead of the specified graphs.

        I am agnostic as to where those bd:virtualGraph assertions should live. I presume that we will use the SPOC index to locate them. In that case, the assertions will be clustered in the index and can be quickly resolved whether they are in the same graph or different graphs.

        I do not believe the presence of the bd:virtualGraph assertions will cause troubles with user data due to the open world nature of RDF and the combination of the bigdata specific assertion and the deliberate use of a syntactic means to indicate that the virtual graph is to be addressed.

        I have added a wiki page to document this feature [1].

        [1] https://sourceforge.net/apps/mediawiki/bigdata/index.php?title=VirtualGraphs&action=submit (Virtual Graphs wiki page)

        Show
        bryanthompson bryanthompson added a comment - I've been talking with DavidBooth regarding virtual graphs support. A summary of my current position follows: I woke up today feeling like I have a decision on this. It seems that you have a concern with regard to data in the user space causing accidental merging of graphs within virtual graphs which are then accidentally queried. On the other hand, I have the concerns that (a) there is an added cost to the step resolving a virtual graph to its component graphs; and (b) without introducing a new index, we have no place outside of the "user data" to store those virtual graph => component graph associations. Given both of our concerns, I think that it makes sense to do something like Anzo using an explicit syntax to address the virtual graphs and using a bigdata specific predicate to create those associations. I believe the Anzo syntax is: FROM DATASET FROM NAMED DATASET Or we could use FROM VIRTUAL GRAPH FROM NAMED VIRTUAL GRAPH Personally, I prefer the latter as it more directly conveys the concept. In terms of modeling the virtual graph relationships: :vg bd:virtualGraph :vgx Would establish a membership relationship indicating that :vgx is a member of the virtual graph :vg. The bd: namespace would be the namespace for bigdata specific extensions. As far as I can see, there is no distinction when modeling virtual graph membership between named graphs and the default graph. That is simply a question of whether you reference the virtual graph in the FROM VIRTUAL GRAPH or FROM NAMED VIRTUAL GRAPH clause of the query. When a graph identifier is encountered in FROM (NAMED) VIRTUAL GRAPH clause it will be resolved to the member graphs and the data set will be rewritten to use the member graphs instead of the specified graphs. I am agnostic as to where those bd:virtualGraph assertions should live. I presume that we will use the SPOC index to locate them. In that case, the assertions will be clustered in the index and can be quickly resolved whether they are in the same graph or different graphs. I do not believe the presence of the bd:virtualGraph assertions will cause troubles with user data due to the open world nature of RDF and the combination of the bigdata specific assertion and the deliberate use of a syntactic means to indicate that the virtual graph is to be addressed. I have added a wiki page to document this feature [1] . [1] https://sourceforge.net/apps/mediawiki/bigdata/index.php?title=VirtualGraphs&action=submit (Virtual Graphs wiki page)
        Hide
        bryanthompson bryanthompson added a comment -

        Implemented support for VIRTUAL GRAPHS.

        Modified BatchRDFValueResolver to cache the Value on the IV for
        resolved IVs.

        Added unit test for VIRTUAL GRAPH extention to SPARQL at the AST layer.

        Added unit test for evaluation of VIRTUAL GRAPH extension. (In fact,
        this is hardly required since the translation is identical to the
        translation of FROM and FROM NAMED).

        There is some tricky business concerning the absence of a named graph
        or default graph data set. I followed the notes in DataSetSummary on
        this point:

                /*
                 * Note: Per DAWG tests graph-02 and graph-04, a query against an empty
                 * default graph collection or an empty named graph collection should
                 * be constrained to NO graphs.  This is different from the case where
                 * the dataset is simply not specified, which is interpreted as having
                 * no constraint on the visited graphs.  If you uncomment the next two
                 * lines, both graph-02 and graph-04 in the TCK will fail.
                 */
        

        Committed revision r6059.

        Show
        bryanthompson bryanthompson added a comment - Implemented support for VIRTUAL GRAPHS. Modified BatchRDFValueResolver to cache the Value on the IV for resolved IVs. Added unit test for VIRTUAL GRAPH extention to SPARQL at the AST layer. Added unit test for evaluation of VIRTUAL GRAPH extension. (In fact, this is hardly required since the translation is identical to the translation of FROM and FROM NAMED). There is some tricky business concerning the absence of a named graph or default graph data set. I followed the notes in DataSetSummary on this point: /* * Note: Per DAWG tests graph-02 and graph-04, a query against an empty * default graph collection or an empty named graph collection should * be constrained to NO graphs. This is different from the case where * the dataset is simply not specified, which is interpreted as having * no constraint on the visited graphs. If you uncomment the next two * lines, both graph-02 and graph-04 in the TCK will fail. */ Committed revision r6059.

          People

          • Assignee:
            bryanthompson bryanthompson
            Reporter:
            bryanthompson bryanthompson
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: