Details

    • Type: Sub-task
    • Status: Done
    • Priority: Medium
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: BLAZEGRAPH_2_2_0
    • Component/s: None
    • Labels:
      None

      Activity

      Hide
      michaelschmidt michaelschmidt added a comment -

      Started benchmarks on benchmark server.

      Show
      michaelschmidt michaelschmidt added a comment - Started benchmarks on benchmark server.
      Hide
      michaelschmidt michaelschmidt added a comment - - edited

      Query performance wise, the benchmarks results look good, pretty much similiar to the recent 2.1.1 benchmarks. The following show comparisons to 2.1.0:

      All but BSBM EXPLORE+UPDATE64 in the range of +/- 0.5%, the latter exhibits the same small regression that we observed for 2.1.1 (caused by the INSERT+DELETE fix).

      +4% speedup, which compensates for unexplained regressions that we observed in 2.1.0, so this is back to normal again now.

      -0.01% -> stable.

      + 0.03% -> stable.

      The only real variance that we observe is regarding loading times:
      a.) For govtrack: -4% loading performance compared to 2.1.0 (and even a little more compared to 2.1.1)
      b.) For lubm: -9% loading performance compared to 2.1.0 (and even a little more compared to 2.1.1)
      c.) For BSBM: ~10% regression

      igorkim bryanthompson Do you have any explanation for the loading performance regressions?

      Show
      michaelschmidt michaelschmidt added a comment - - edited Query performance wise, the benchmarks results look good, pretty much similiar to the recent 2.1.1 benchmarks. The following show comparisons to 2.1.0: BSBM: https://docs.google.com/spreadsheets/d/1i-JnEy_W5Pt4AWg87oxg564GYkz3zaxxmIS9H4OKssE/edit#gid=949537558 All but BSBM EXPLORE+UPDATE64 in the range of +/- 0.5%, the latter exhibits the same small regression that we observed for 2.1.1 (caused by the INSERT+DELETE fix). LUBM: https://docs.google.com/spreadsheets/d/12rbe77GOqnRmi4yFjWE1D1hXnd2jB-WPUV2uVJZEmwE/edit#gid=569702611 +4% speedup, which compensates for unexplained regressions that we observed in 2.1.0, so this is back to normal again now. GOVTRACK: https://docs.google.com/spreadsheets/d/1MFF3kQmzQv7LzBFjbNX7bhc1eLgCVPQH0vQA6SxeHuQ/edit#gid=369313628 -0.01% -> stable. SP2Bench: https://docs.google.com/spreadsheets/d/1xtOf9C-SycmynC8tZg6aDkjVWU0dcsj-l_da4o0gjtw/edit#gid=1547237788 + 0.03% -> stable. The only real variance that we observe is regarding loading times: a.) For govtrack: -4% loading performance compared to 2.1.0 (and even a little more compared to 2.1.1) b.) For lubm: -9% loading performance compared to 2.1.0 (and even a little more compared to 2.1.1) c.) For BSBM: ~10% regression igorkim bryanthompson Do you have any explanation for the loading performance regressions?
      Hide
      bryanthompson bryanthompson added a comment -

      1. Was master pulled up into this branch?

      2. I suggest that we examine how plain text literals are being stored in the ID2TERM index. Let's verify that the xsd:string datatype is being stripped when storing the literal and then restored when the literal is materialized. Igor, can you please point to the specific locations / code paths along which this is happening?

      Show
      bryanthompson bryanthompson added a comment - 1. Was master pulled up into this branch? 2. I suggest that we examine how plain text literals are being stored in the ID2TERM index. Let's verify that the xsd:string datatype is being stripped when storing the literal and then restored when the literal is materialized. Igor, can you please point to the specific locations / code paths along which this is happening?
      Hide
      igorkim igorkim added a comment -

      XSD.string datatype is stripped off from byte array in com.bigdata.rdf.lexicon.LexiconKeyBuilder.value2Key(Value)#218
      RDF.langString datatype has not been added into byte array already in com.bigdata.rdf.lexicon.LexiconKeyBuilder.languageCodeLiteral2key(String, String)

      These datatypes are added back while deserializing
      com.bigdata.rdf.model.BigdataValueSerializer.deserialize(DataInputBuffer, StringBuilder)
      by providing corresponding datatypes (string, langString) in:
      com.bigdata.rdf.model.BigdataValueFactoryImpl.createLiteral(String label)
      com.bigdata.rdf.model.BigdataValueFactoryImpl.createLiteral(String label, String language)

      Show
      igorkim igorkim added a comment - XSD.string datatype is stripped off from byte array in com.bigdata.rdf.lexicon.LexiconKeyBuilder.value2Key(Value)#218 RDF.langString datatype has not been added into byte array already in com.bigdata.rdf.lexicon.LexiconKeyBuilder.languageCodeLiteral2key(String, String) These datatypes are added back while deserializing com.bigdata.rdf.model.BigdataValueSerializer.deserialize(DataInputBuffer, StringBuilder) by providing corresponding datatypes (string, langString) in: com.bigdata.rdf.model.BigdataValueFactoryImpl.createLiteral(String label) com.bigdata.rdf.model.BigdataValueFactoryImpl.createLiteral(String label, String language)
      Hide
      igorkim igorkim added a comment -

      michaelschmidt, bryanthompson RDFParser implementations were changes in sesame 2.8
      https://jira.blazegraph.com/secure/attachment/10433/Changelog%20-%20Sesame%202.8.0%20to%202.8.10.txt

      • RDF 1.1 support, including updates to all parsers and writers.
      • [SES-1893] - Update N-Quads parser and writer to W3C Candidate Recommendation
      • [SES-1894] - Update N-Triples parser and writer to W3C Candidate Recommendation
      • [SES-1952] - Update TriG parser and writer to W3C Proposed Recommendation

      I've checked load of BSBM 10M dataset (created using sh ./generate -fc -pc 28482 -fn td_10m/dataset -dir td_10m/td_data)
      using sesame NTriplesParser 2.7.12 and 2.8.10.
      Results are as follows:
      Sesame 2.7.12: Loaded 10095723 triples in 161.037895183 s, speed = 62691.59807713233 triples/s
      Sesame 2.8.10: Loaded 10095723 triples in 176.546756829 s, speed = 57184.414946679164 triples/s

      Performance differs from run to run, but on average, NTriplesParser version 2.8.10 is about 8% slower than version 2.7.12.
      There is additional code in this class checking for unescaped spaces in URIs and handles escapes:
      https://bitbucket.org/openrdf/sesame/diff/core/rio/ntriples/src/main/java/org/openrdf/rio/ntriples/NTriplesParser.java?diff1=3b3d8283942dede052d4a2d6eeafd5cd4f1d652d&diff2=3a4001559765&at=2.9.x#Lcore/rio/ntriples/src/main/java/org/openrdf/rio/ntriples/NTriplesParser.javaT445
      it might be slowing down parsing and overall load speed.

      Show
      igorkim igorkim added a comment - michaelschmidt , bryanthompson RDFParser implementations were changes in sesame 2.8 https://jira.blazegraph.com/secure/attachment/10433/Changelog%20-%20Sesame%202.8.0%20to%202.8.10.txt RDF 1.1 support, including updates to all parsers and writers. [SES-1893] - Update N-Quads parser and writer to W3C Candidate Recommendation [SES-1894] - Update N-Triples parser and writer to W3C Candidate Recommendation [SES-1952] - Update TriG parser and writer to W3C Proposed Recommendation I've checked load of BSBM 10M dataset (created using sh ./generate -fc -pc 28482 -fn td_10m/dataset -dir td_10m/td_data) using sesame NTriplesParser 2.7.12 and 2.8.10. Results are as follows: Sesame 2.7.12: Loaded 10095723 triples in 161.037895183 s, speed = 62691.59807713233 triples/s Sesame 2.8.10: Loaded 10095723 triples in 176.546756829 s, speed = 57184.414946679164 triples/s Performance differs from run to run, but on average, NTriplesParser version 2.8.10 is about 8% slower than version 2.7.12. There is additional code in this class checking for unescaped spaces in URIs and handles escapes: https://bitbucket.org/openrdf/sesame/diff/core/rio/ntriples/src/main/java/org/openrdf/rio/ntriples/NTriplesParser.java?diff1=3b3d8283942dede052d4a2d6eeafd5cd4f1d652d&diff2=3a4001559765&at=2.9.x#Lcore/rio/ntriples/src/main/java/org/openrdf/rio/ntriples/NTriplesParser.javaT445 it might be slowing down parsing and overall load speed.
      Hide
      bryanthompson bryanthompson added a comment -

      Let's merge this PR. Based on the discussion, the regression is from the parser changes to support the updated standards.

      Show
      bryanthompson bryanthompson added a comment - Let's merge this PR. Based on the discussion, the regression is from the parser changes to support the updated standards.
      Hide
      igorkim igorkim added a comment -

      Attached source code used to measure performance change between sesame 2.7 and 2.8

      Show
      igorkim igorkim added a comment - Attached source code used to measure performance change between sesame 2.7 and 2.8

        People

        • Assignee:
          michaelschmidt michaelschmidt
          Reporter:
          beebs Brad Bebee
        • Votes:
          0 Vote for this issue
          Watchers:
          4 Start watching this issue

          Dates

          • Created:
            Updated:
            Resolved: