Uploaded image for project: 'Blazegraph (by SYSTAP)'
  1. Blazegraph (by SYSTAP)
  2. BLZG-2017

RDF errors in HA test suite when validation is enabled for invalid nquads data

    Details

      Description

      There appear to be some RDF 1.1 related errors in recent db-enterprise test suite runs.

      See https://ci.blazegraph.com/job/db-enterprise/1645/testReport/

      java.util.concurrent.ExecutionException: java.util.concurrent.ExecutionException: org.openrdf.query.UpdateExecutionException: java.lang.RuntimeException: Could not load: url=file:/var/jenkins/workspace/db-enterprise/bigdata-ha-test/src/test/resources/data/foaf/data-2.nq.gz, cause=org.openrdf.rio.RDFParseException: node16s114au1x1 [line 9671]
       at java.util.concurrent.FutureTask.report(FutureTask.java:122)
       at java.util.concurrent.FutureTask.get(FutureTask.java:188)
       at com.bigdata.rdf.sail.webapp.BigdataServlet.submitApiTask(BigdataServlet.java:281)
       at com.bigdata.rdf.sail.webapp.QueryServlet.doSparqlUpdate(QueryServlet.java:458)
       at com.bigdata.rdf.sail.webapp.QueryServlet.doPost(QueryServlet.java:239)
       at com.bigdata.rdf.sail.webapp.RESTServlet.doPost(RESTServlet.java:269)
       at com.bigdata.rdf.sail.webapp.MultiTenancyServlet.doPost(MultiTenancyServlet.java:192)
      
      

        Activity

        Hide
        bryanthompson bryanthompson added a comment -

        The root cause appears to be illegal IRIs in the source data. The source data is from a foaf crawl from several years ago. nquads requires that IRIs be full IRIs. Therefore it does not automatically completely them with the rdf base for a source document (unlike turtle). Hence the following does not produce valid IRIs when interpreted as nquads.

        <a> <b> <c> <d> .
        

        With a recent change, RDF validation is now enabled by default, thus causing the invalid IRIs to lead to test failures through rejected parse of the source data.

        The following data files are used in 3 locations:

        • data-0.nq.gz
        • data-1.nq.gz
        • data-2.nq.gz
        • data-3.nq.gz

        The data files need to be fixed and then replicated into each of these locations:

        • bigdata-rdf-test/src/test/resources/data/foaf
        • bigdata-jini-test/src/test/resources/data/foaf
        • bigdata-ha-test/src/test/resources/data/foaf
        Show
        bryanthompson bryanthompson added a comment - The root cause appears to be illegal IRIs in the source data. The source data is from a foaf crawl from several years ago. nquads requires that IRIs be full IRIs. Therefore it does not automatically completely them with the rdf base for a source document (unlike turtle). Hence the following does not produce valid IRIs when interpreted as nquads. <a> <b> <c> <d> . With a recent change, RDF validation is now enabled by default, thus causing the invalid IRIs to lead to test failures through rejected parse of the source data. The following data files are used in 3 locations: data-0.nq.gz data-1.nq.gz data-2.nq.gz data-3.nq.gz The data files need to be fixed and then replicated into each of these locations: bigdata-rdf-test/src/test/resources/data/foaf bigdata-jini-test/src/test/resources/data/foaf bigdata-ha-test/src/test/resources/data/foaf
        Hide
        michaelschmidt michaelschmidt added a comment -

        At first glance, this doesn't look like an RDF 1.1 problem. The problem seems to be that the DataLoader is now more strict on processing inputs. Here's the relevant input from the nquads file:

        <http://www.lassila.org/ora.rdf#me> <http://xmlns.com/foaf/0.1/knows> <node16s114au1x1> <http://www.lassila.org/ora.rdf> .
        <node16s114au1x1> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person> <http://www.lassila.org/ora.rdf> .
        <node16s114au1x1> <http://xmlns.com/foaf/0.1/name> "Deepali Khushraj" <http://www.lassila.org/ora.rdf> .
        <node16s114au1x1> <http://xmlns.com/foaf/0.1/mbox> <mailto:deepali.khushraj@nokia.com> <http://www.lassila.org/ora.rdf> .
        

        According to the NQuads standard, <node16s114au1x1> is not a valid URI (there's no implicit prefix defined for nquads, as it is the case for Turtle with the @base specification, for instance):

        IRIs may be written only as absolute IRIs. IRIs are enclosed in '<' and '>' and may contain numeric escape sequences (described below). For example <http://example.org/#green-goblin>. 
        

        Here's what I plan to do:

        • Write a correct rejection test for such cases, in order to make sure that's indeed the root cause of the problem
        • Fix the input files for the tests (transforming the broken URIs into valid ones) and get the tests through again
        Show
        michaelschmidt michaelschmidt added a comment - At first glance, this doesn't look like an RDF 1.1 problem. The problem seems to be that the DataLoader is now more strict on processing inputs. Here's the relevant input from the nquads file: <http: //www.lassila.org/ora.rdf#me> <http://xmlns.com/foaf/0.1/knows> <node16s114au1x1> <http://www.lassila.org/ora.rdf> . <node16s114au1x1> <http: //www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://xmlns.com/foaf/0.1/Person> <http://www.lassila.org/ora.rdf> . <node16s114au1x1> <http: //xmlns.com/foaf/0.1/name> "Deepali Khushraj" <http://www.lassila.org/ora.rdf> . <node16s114au1x1> <http: //xmlns.com/foaf/0.1/mbox> <mailto:deepali.khushraj@nokia.com> <http://www.lassila.org/ora.rdf> . According to the NQuads standard, <node16s114au1x1> is not a valid URI (there's no implicit prefix defined for nquads, as it is the case for Turtle with the @base specification, for instance): IRIs may be written only as absolute IRIs. IRIs are enclosed in '<' and '>' and may contain numeric escape sequences (described below). For example <http: //example.org/#green-goblin>. Here's what I plan to do: Write a correct rejection test for such cases, in order to make sure that's indeed the root cause of the problem Fix the input files for the tests (transforming the broken URIs into valid ones) and get the tests through again
        Hide
        igorkim igorkim added a comment - - edited

        According to RDF 1.1 IRI definition, IRIs should conform to the syntax defined in RFC 3987, which defines following grammar:

        IRI            = scheme ":" ihier-part [ "?" iquery ] [ "#" ifragment ]
        scheme         = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )
        

        RFC 3986 defines similar grammar for URIs as well.

        So, scheme part of at least one ALPHA followed by colon is mandatory, thus IRI <node16s114au1x1> is rejected by org.openrdf.rio.helpers.RDFParserBase.createURI()
        Although, note that rejection does not come from Sesame code itself, it uses BigdataValueFactoryImpl, which calls
        com.bigdata.rdf.model.BigdataURIImpl.BigdataURIImpl(BigdataValueFactory, String) and the last one throws IllegalArgumentException since commit 4a2155e91ca7d6159c612011520ccbc6aea73792 (thompsonbry, 2008-04-16 22:59:01)

        Show
        igorkim igorkim added a comment - - edited According to RDF 1.1 IRI definition , IRIs should conform to the syntax defined in RFC 3987 , which defines following grammar: IRI = scheme ":" ihier-part [ "?" iquery ] [ "#" ifragment ] scheme = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." ) RFC 3986 defines similar grammar for URIs as well. So, scheme part of at least one ALPHA followed by colon is mandatory, thus IRI <node16s114au1x1> is rejected by org.openrdf.rio.helpers.RDFParserBase.createURI() Although, note that rejection does not come from Sesame code itself, it uses BigdataValueFactoryImpl, which calls com.bigdata.rdf.model.BigdataURIImpl.BigdataURIImpl(BigdataValueFactory, String) and the last one throws IllegalArgumentException since commit 4a2155e91ca7d6159c612011520ccbc6aea73792 (thompsonbry, 2008-04-16 22:59:01)
        Hide
        michaelschmidt michaelschmidt added a comment -

        Added a correct rejection test for invalid URIs and fixed the input files, here are the two PRs:

        Note that out of the four files listed above, only data-2.nq and data-3.nq were affected. I replaced them in all three relations. Triggered CI and will merge down if it goes through.

        bryanthompson

        Show
        michaelschmidt michaelschmidt added a comment - Added a correct rejection test for invalid URIs and fixed the input files, here are the two PRs: https://github.com/SYSTAP/bigdata/pull/447 https://github.com/SYSTAP/db-enterprise/pull/31 Note that out of the four files listed above, only data-2.nq and data-3.nq were affected. I replaced them in all three relations. Triggered CI and will merge down if it goes through. bryanthompson
        Hide
        michaelschmidt michaelschmidt added a comment -

        Bryan and I merged down both branches. Closing issue.

        Show
        michaelschmidt michaelschmidt added a comment - Bryan and I merged down both branches. Closing issue.

          People

          • Assignee:
            michaelschmidt michaelschmidt
            Reporter:
            bryanthompson bryanthompson
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: