Uploaded image for project: 'Blazegraph (by SYSTAP)'
  1. Blazegraph (by SYSTAP)
  2. BLZG-110

Inline predeclared URIs and namespaces in 2-3 bytes

    Details

      Description

      The current design provides for pre-declared vocabularies. This is used mainly to write the inference rules which rely on the ability to obtain the IV for a given URI from the Vocabulary class.

              IV rdfType = vocab.get(RDF.TYPE);
      

      The Vocabulary is declared to the AbstractTripleStore when it is created and can not change thereafter (this constraint arises we guarantee a single representation for a given RDF Value).

      IVs are now much more flexible than the original long termId representation. That flexibility can be exploited to encode the representation of pre-declared URIs (but not literals) within 1-2 bytes as follows. (This can not be done for literals because there is already an interpretation for the DTE of a Literal as the natural datatype of the Literal).

      A Vocabulary having no more than 256 distinct URIs would be represented fully inline within 2 bytes. The first byte would be coded as:

      [VTE:=URI; Inline:=true; Extension:=false; DTE=XSDByte]
      

      The second byte would be the index of the URI in the Vocabulary class. A new method Vocabulary#get(int):URI would be used to retrieve the URI for that byte code.

      Exactly the same mechanism could be used for a Vocabulary having up to 64k distinct URIs. The DTE would be set to XSDShort. The short value would be passed to Vocabulary#get(int) to decode the URI from its inline IV.

      From a practical standpoint, it would be a good practice to provide the namespace of an ontology as well as the distinct URIs within that ontology when creating a Vocabulary class. This is because the Vocabulary class can not be "upgraded" later for a given AbstractTripleStore instance (since the IV to Value mapping must be one to one and stable). Therefore, if there are extensions to the ontology in the future the new URIs can be represented using the inline (2-3 byte) representation of the namespace and the inline representation of the localName for the new ontology URIs.

        Activity

        Hide
        bryanthompson bryanthompson added a comment -

        I have implemented URIByteIV and URIShortIV per the description above. These classes pass their test suites and will provide for very compact encoding of pre-declared vocabulary Values.

        The next step will be to extend the existing Vocabulary interface to support the resolution of an IV, where the IV wraps either a short or byte code using URIByteIV or URIShortIV, and to drop the BaseVocabulary class and its triple store integration in favor of stateless Vocabulary implementations.

        Show
        bryanthompson bryanthompson added a comment - I have implemented URIByteIV and URIShortIV per the description above. These classes pass their test suites and will provide for very compact encoding of pre-declared vocabulary Values. The next step will be to extend the existing Vocabulary interface to support the resolution of an IV, where the IV wraps either a short or byte code using URIByteIV or URIShortIV, and to drop the BaseVocabulary class and its triple store integration in favor of stateless Vocabulary implementations.
        Hide
        bryanthompson bryanthompson added a comment -

        I've modified the Vocabulary interface and the BaseVocabulary implementation per the above design. However, it still serializes the Values and IVs against the sparse row store. I want to think more about how we might version vocabulary class files before changing over to a (more or less) stateless BaseVocabulary class.

        The new Vocabulary mechanism is passing the rdf and sail test suites.

        I am going to leave this issue open until we resolve the questions about serialization of the vocabulary class.

        Before closing this issue, we should also define a default vocabulary class which has all URIs (and the base namespaces for those vocabularies for purposes of efficient representation of vocabulary extensions introduced later) from at least the rdf, rdfs, owl, skos and other common vocabularies. The bigdata namespace should be included, and probably the namespaces for various benchmarks (lubm, bsbm, etc). All of these URIs will be packed inline within the statement indices in only 2-3 bytes. Instance data for the namespaces (or URIs defined after the vocabulary class) will be packed inline within the statement indices in 2-3 bytes plus enough bytes to represent the Unicode compressed local name of the URI. Quite a savings!

        Show
        bryanthompson bryanthompson added a comment - I've modified the Vocabulary interface and the BaseVocabulary implementation per the above design. However, it still serializes the Values and IVs against the sparse row store. I want to think more about how we might version vocabulary class files before changing over to a (more or less) stateless BaseVocabulary class. The new Vocabulary mechanism is passing the rdf and sail test suites. I am going to leave this issue open until we resolve the questions about serialization of the vocabulary class. Before closing this issue, we should also define a default vocabulary class which has all URIs (and the base namespaces for those vocabularies for purposes of efficient representation of vocabulary extensions introduced later) from at least the rdf, rdfs, owl, skos and other common vocabularies. The bigdata namespace should be included, and probably the namespaces for various benchmarks (lubm, bsbm, etc). All of these URIs will be packed inline within the statement indices in only 2-3 bytes. Instance data for the namespaces (or URIs defined after the vocabulary class) will be packed inline within the statement indices in 2-3 bytes plus enough bytes to represent the Unicode compressed local name of the URI. Quite a savings!
        Hide
        bryanthompson bryanthompson added a comment -

        I've factored out a VocabularyDecl interface so we can make this more modular. There are now VocabularyDecls for:


        - RDF, RDFS, XMLSchema, and OWL.

        That is a bare bones list. We could easily extend the basic vocabulary coverage to dublin core, skos, etc, etc. People working with a specific ontology will also want to provide a vocabulary declaration for that ontology to take advantage of this (much) more compact representation.

        I've trimmed down the serialized representation and added several consistency checks when a vocabulary is rebuilt following deserialization, and added a unit test for the RDFSVocabulary (in addition to the integration test with the AbstractTripleStore).

        This issue is code complete. Follow up work should:

        1. provide additional vocabulary declarations; and

        2. update our sample code and properties files to remove the references to NoVocabulary and RDFSVocabulary.

        Show
        bryanthompson bryanthompson added a comment - I've factored out a VocabularyDecl interface so we can make this more modular. There are now VocabularyDecls for: - RDF, RDFS, XMLSchema, and OWL. That is a bare bones list. We could easily extend the basic vocabulary coverage to dublin core, skos, etc, etc. People working with a specific ontology will also want to provide a vocabulary declaration for that ontology to take advantage of this (much) more compact representation. I've trimmed down the serialized representation and added several consistency checks when a vocabulary is rebuilt following deserialization, and added a unit test for the RDFSVocabulary (in addition to the integration test with the AbstractTripleStore). This issue is code complete. Follow up work should: 1. provide additional vocabulary declarations; and 2. update our sample code and properties files to remove the references to NoVocabulary and RDFSVocabulary.
        Hide
        bryanthompson bryanthompson added a comment -

        Added vocabulary declarations for DCTERMS, DCELEMENTS, SKOS, and FOAF. Also added declarations for LUBM and BSBM for use with those benchmarks.

        The last work item for this issue is (2) above.

        Show
        bryanthompson bryanthompson added a comment - Added vocabulary declarations for DCTERMS, DCELEMENTS, SKOS, and FOAF. Also added declarations for LUBM and BSBM for use with those benchmarks. The last work item for this issue is (2) above.
        Hide
        bryanthompson bryanthompson added a comment -

        I am going to extend the vocabulary support such that we inline the 1st 255 URIs in 2 bytes and the rest of the vocabulary in 3 bytes. That will provide a nice added boost.

        Show
        bryanthompson bryanthompson added a comment - I am going to extend the vocabulary support such that we inline the 1st 255 URIs in 2 bytes and the rest of the vocabulary in 3 bytes. That will provide a nice added boost.
        Hide
        bryanthompson bryanthompson added a comment -

        Raised priority
        - we need to resolve this before we start benchmarking the TERMS branch.

        Show
        bryanthompson bryanthompson added a comment - Raised priority - we need to resolve this before we start benchmarking the TERMS branch.
        Hide
        bryanthompson bryanthompson added a comment -

        Now supports up to 256 2 byte declarations with the next 64k declarations in 3 bytes.

        Committed revision r5723.

        Show
        bryanthompson bryanthompson added a comment - Now supports up to 256 2 byte declarations with the next 64k declarations in 3 bytes. Committed revision r5723.

          People

          • Assignee:
            bryanthompson bryanthompson
            Reporter:
            bryanthompson bryanthompson
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: