Uploaded image for project: 'Blazegraph (by SYSTAP)'
  1. Blazegraph (by SYSTAP)
  2. BLZG-1895

DataLoader ignores errors in data when used with default config

    Details

    • Type: Improvement
    • Status: Done
    • Priority: Medium
    • Resolution: Done
    • Affects Version/s: BLAZEGRAPH_2_1_0
    • Fix Version/s: BLAZEGRAPH_2_2_0
    • Component/s: DataLoader
    • Labels:
      None

      Description

      As of now, the DataLoader uses default com.bigdata.rdf.rio.RDFParserOptions.stopAtFirstError=false. When loading large amounts of files, this means that there's no way to figure out (even from the logs) which files have been loaded successfully – and parts of the data might just be missing. So I'd highly recommend to adjust the default.

        Activity

        Hide
        michaelschmidt michaelschmidt added a comment -

        Here's an NT file which is rejected iff stopAtFirstError (could be used for test case):

        <http://s> <http://p> <http://o> .
        < http://my.uri> a <http://URI> .
        

        (note the whitespace in the beginning of the second subject's triple).

        Show
        michaelschmidt michaelschmidt added a comment - Here's an NT file which is rejected iff stopAtFirstError (could be used for test case): <http: //s> <http://p> <http://o> . < http: //my.uri> a <http://URI> . (note the whitespace in the beginning of the second subject's triple).
        Hide
        bryanthompson bryanthompson added a comment -

        I recommend that we simply change RDFParserOptions.DEFAULT_STOP_AT_FIRST_ERROR to true.

        [~michaelschdmit] Please document the cases where you assert that this option does not halt if there is an error in the RDF data. Note that bad URLs will always stop the load.

        Show
        bryanthompson bryanthompson added a comment - I recommend that we simply change RDFParserOptions.DEFAULT_STOP_AT_FIRST_ERROR to true. [~michaelschdmit] Please document the cases where you assert that this option does not halt if there is an error in the RDF data. Note that bad URLs will always stop the load.
        Hide
        michaelschmidt michaelschmidt added a comment -

        It seems that I forgot to ticket that up. Looking at the triple above, what we observed when executing withstopAtFirstError=true is that

        a.) the n3 parser correctly rejects the document and
        b.) the nq parser (you may need to add a quads component) does not throw an error (I believe it just skips invalid triples).

        I'm not sure how much we can do there, but it might be worth a look and setting up an example & test case + reporting it back to Sesame if there's nothing we can fix about that.

        Show
        michaelschmidt michaelschmidt added a comment - It seems that I forgot to ticket that up. Looking at the triple above, what we observed when executing withstopAtFirstError=true is that a.) the n3 parser correctly rejects the document and b.) the nq parser (you may need to add a quads component) does not throw an error (I believe it just skips invalid triples). I'm not sure how much we can do there, but it might be worth a look and setting up an example & test case + reporting it back to Sesame if there's nothing we can fix about that.
        Hide
        bryanthompson bryanthompson added a comment -

        nquads does not believe in the existence of errors. This is due to A. Harth in the original implementation. Not sure that there is an interest in changing this.

        Show
        bryanthompson bryanthompson added a comment - nquads does not believe in the existence of errors. This is due to A. Harth in the original implementation. Not sure that there is an interest in changing this.
        Show
        maria.krokhaleva maria.krokhaleva added a comment - PR: https://github.com/SYSTAP/bigdata/pull/440
        Hide
        igorkim igorkim added a comment -
        Show
        igorkim igorkim added a comment - Merged to master: https://github.com/SYSTAP/bigdata/pull/440#event-716865741

          People

          • Assignee:
            igorkim igorkim
            Reporter:
            michaelschmidt michaelschmidt
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: