Uploaded image for project: 'Blazegraph (by SYSTAP)'
  1. Blazegraph (by SYSTAP)
  2. BLZG-533

Vector the query engine on the native heap

    Details

      Description

      Improve vectoring, heap management and throughput of the query engine and pipeline operators.

      1. Rationalize the various kinds of vectored / chunked iterator mechanisms.

      2. Simplify to remove the IAsynchronousIterator construct. This was developed back in the day when we used asynchronous iterators to move data between nodes.

      3. Efficiently represent solutions flowing through the query engine using a technique similar to that developed for vectoring the HTree [1]. In particular, solutions were encoded as IV[]s and a separate cache was maintained to map IVs to materialized BigdataValues. However, this effort will need to go further and probably use a lower level access to the encoded IV[] data for efficient operations WITHOUT deserializing solutions and without creating huge numbers of iterators for the IBindingSet lists.

      4. Use the same vectoring mechanism to efficient represent data for wire transfers.

      @see https://sourceforge.net/apps/trac/bigdata/ticket/395 (HTree hash join performance)

        Issue Links

          Activity

          Hide
          bryanthompson bryanthompson added a comment -

          We can easily write out the intermediate solutions queued in front of an operator using a SolutionSetStream backed by the MemStore for the query. This is a low overhead sequential encoding of the solutions with optional compression.

          Show
          bryanthompson bryanthompson added a comment - We can easily write out the intermediate solutions queued in front of an operator using a SolutionSetStream backed by the MemStore for the query. This is a low overhead sequential encoding of the solutions with optional compression.
          Hide
          bryanthompson bryanthompson added a comment -

          Proposed approach: Refactor the BlockingBuffer pattern to allow the use of the native heap (SolutionSetStream backed by MemStore) when the analytic mode is specified.

          Show
          bryanthompson bryanthompson added a comment - Proposed approach: Refactor the BlockingBuffer pattern to allow the use of the native heap (SolutionSetStream backed by MemStore) when the analytic mode is specified.
          Hide
          bryanthompson bryanthompson added a comment - - edited

          I have created a new IChunkHandler implementation based on a SolutionSetStream. QueryHints now defines a new field (QUERY_ENGINE_CHUNK_HANDLER) whose value can be controlled by the System variable "queryEngineChunkHandler".

          PR: https://github.com/SYSTAP/bigdata/pull/468

          The default (in the PR) is to use the native heap to store all intermediate solutions. This may have a performance impact on light weight queries. The policy could be made adaptive easily enough by declaring a different IChunkHandler which looked at the #of solutions in the chunk to decide whether or not to move them onto the native heap. Also note that the IRunningQuery is available. This can be used to access the QueryEngine and from there the QueryEngineCounters. Counters defined and tracked there could be used to define a dynamic policy based on the total number of solutions on the managed object heap or the native object heap.

          This PR does not address BLZG-1968, which would put the output solutions of the top-level queries onto the native heap.

              /**
               * Controls where the intermediate solutions output by operators will be
               * stored. Options include the managed object heap, the native heap, or
               * potentially some policy which stores things dynamically depending on the
               * size of the chunk or the total memory burden on the query engine.
               * <p>
               * The effective value of this property is determined by effective value of
               * the system property {@value #QUERY_ENGINE_CHUNK_HANDLER}.
               * 
               * @see BLZG-533 Vector query engine on native heap.
               */
              String QUERY_ENGINE_CHUNK_HANDLER = "queryEngineChunkHandler";
          
              IChunkHandler DEFAULT_QUERY_ENGINE_CHUNK_HANDLER = 
                      ClassPathUtil.classForName(//
                              System.getProperty(QUERY_ENGINE_CHUNK_HANDLER,
          //                          com.bigdata.bop.engine.ManagedHeapStandloneChunkHandler.class.getName()
                                      com.bigdata.bop.engine.NativeHeapStandloneChunkHandler.class.getName()
                                      ), // preferredClassName,
                              null, // defaultClass,
                              IChunkHandler.class, // sharedInterface,
                              IChunkHandler.class.getClassLoader() // classLoader
                        );
          

          TODO:

          • done. Track bytes on the native heap (QueryEngineCounters) or remove that counter (not updated at this point).
          • done. Fix CI errors (run com.bigdata.rdf.sparql.ast.TestAll to see errors).
            • done. Headless value factory should not be asked for its namespace
            • done. java.lang.ClassCastException: java.lang.Integer cannot be cast to com.bigdata.rdf.internal.IV (pretty sure this is a bug in deferred IV resolution or Constant handling).
          • done. Must enable native heap usage for the chunk messages automatically for the analytic mode (if analytic, then default is the native heap for the chunk messages).
          • done. Query hint to enable/disable just this feature (hint:queryEngineChunkHandler with values of "native", "managed", etc.)
            • Brad Bebee The query hint "queryEngineChunkHandler" with possible values "Managed" and "Native" needs to be added to https://wiki.blazegraph.com/wiki/index.php/QueryHints. Scope is "Query". Since 2.2.0. Purpose is to select either the managed object heap or the native heap for the storage of intermediate solutions in the input queues for the operators for the query engine.
          • Benchmark branch to assess performance impact and examine alternative / dynamic policies

          igorkim michaelschmidt

          Show
          bryanthompson bryanthompson added a comment - - edited I have created a new IChunkHandler implementation based on a SolutionSetStream. QueryHints now defines a new field (QUERY_ENGINE_CHUNK_HANDLER) whose value can be controlled by the System variable "queryEngineChunkHandler". PR: https://github.com/SYSTAP/bigdata/pull/468 The default (in the PR) is to use the native heap to store all intermediate solutions. This may have a performance impact on light weight queries. The policy could be made adaptive easily enough by declaring a different IChunkHandler which looked at the #of solutions in the chunk to decide whether or not to move them onto the native heap. Also note that the IRunningQuery is available. This can be used to access the QueryEngine and from there the QueryEngineCounters. Counters defined and tracked there could be used to define a dynamic policy based on the total number of solutions on the managed object heap or the native object heap. This PR does not address BLZG-1968 , which would put the output solutions of the top-level queries onto the native heap. /** * Controls where the intermediate solutions output by operators will be * stored. Options include the managed object heap, the native heap, or * potentially some policy which stores things dynamically depending on the * size of the chunk or the total memory burden on the query engine. * <p> * The effective value of this property is determined by effective value of * the system property {@value #QUERY_ENGINE_CHUNK_HANDLER}. * * @see BLZG-533 Vector query engine on native heap. */ String QUERY_ENGINE_CHUNK_HANDLER = "queryEngineChunkHandler"; IChunkHandler DEFAULT_QUERY_ENGINE_CHUNK_HANDLER = ClassPathUtil.classForName(// System.getProperty(QUERY_ENGINE_CHUNK_HANDLER, // com.bigdata.bop.engine.ManagedHeapStandloneChunkHandler.class.getName() com.bigdata.bop.engine.NativeHeapStandloneChunkHandler.class.getName() ), // preferredClassName, null, // defaultClass, IChunkHandler.class, // sharedInterface, IChunkHandler.class.getClassLoader() // classLoader ); TODO: done. Track bytes on the native heap (QueryEngineCounters) or remove that counter (not updated at this point). done. Fix CI errors (run com.bigdata.rdf.sparql.ast.TestAll to see errors). done. Headless value factory should not be asked for its namespace done. java.lang.ClassCastException: java.lang.Integer cannot be cast to com.bigdata.rdf.internal.IV (pretty sure this is a bug in deferred IV resolution or Constant handling). done. Must enable native heap usage for the chunk messages automatically for the analytic mode (if analytic, then default is the native heap for the chunk messages). done. Query hint to enable/disable just this feature (hint:queryEngineChunkHandler with values of "native", "managed", etc.) Brad Bebee The query hint "queryEngineChunkHandler" with possible values "Managed" and "Native" needs to be added to https://wiki.blazegraph.com/wiki/index.php/QueryHints . Scope is "Query". Since 2.2.0. Purpose is to select either the managed object heap or the native heap for the storage of intermediate solutions in the input queues for the operators for the query engine. Benchmark branch to assess performance impact and examine alternative / dynamic policies igorkim michaelschmidt
          Hide
          bryanthompson bryanthompson added a comment -

          PR: https://github.com/SYSTAP/bigdata/pull/468 (vector query engine on the native heap). Also includes:

          • BLZG-2051 (SolutionSetStream incorrectly decodes VTE of MockIVs
          • BLZG-2052 (XSDBooleanIV MUST NOT share the (true|false) instances as constants)

          Brad Bebee Ready for merge to master and 2.1.x on clean CI (the CI pass in which the managed object heap behavior is restored by default please!)

          Show
          bryanthompson bryanthompson added a comment - PR: https://github.com/SYSTAP/bigdata/pull/468 (vector query engine on the native heap). Also includes: BLZG-2051 (SolutionSetStream incorrectly decodes VTE of MockIVs BLZG-2052 (XSDBooleanIV MUST NOT share the (true|false) instances as constants) Brad Bebee Ready for merge to master and 2.1.x on clean CI (the CI pass in which the managed object heap behavior is restored by default please!)
          Hide
          bryanthompson bryanthompson added a comment -

          michaelschmidt Ready for benchmarking on clean CI. Please note: we need to benchmark the impact on the analytic query mode for this branch. You can force the default behavior in QueryHints.java (or via the environment). Please make sure that you override both DEFAULT_ANALYTIC and DEFAULT_QUERY_ENGINE_CHUNK_HANDLER

          Bryan

          Show
          bryanthompson bryanthompson added a comment - michaelschmidt Ready for benchmarking on clean CI. Please note: we need to benchmark the impact on the analytic query mode for this branch. You can force the default behavior in QueryHints.java (or via the environment). Please make sure that you override both DEFAULT_ANALYTIC and DEFAULT_QUERY_ENGINE_CHUNK_HANDLER Bryan
          Hide
          bryanthompson bryanthompson added a comment -

          CI is clean on this branch.

          michaelschmidt Can you please benchmark this branch? You can do a normal benchmark run, which would simply verify that there is no performance regression against master.

          michaelschmidt Secondly, I am interested in assessing the performance impact (+/-) of using native memory to store the intermediate solutions. We should perhaps discuss how to best achieve this goal, but it would not block the merge to master or release of the feature.

          Show
          bryanthompson bryanthompson added a comment - CI is clean on this branch. michaelschmidt Can you please benchmark this branch? You can do a normal benchmark run, which would simply verify that there is no performance regression against master. michaelschmidt Secondly, I am interested in assessing the performance impact (+/-) of using native memory to store the intermediate solutions. We should perhaps discuss how to best achieve this goal, but it would not block the merge to master or release of the feature.
          Hide
          bryanthompson bryanthompson added a comment -

          NOTE: I am merging this in to master before benchmarks are available. I do not anticipate any image on the benchmarks for this branch UNLESS it is run in the analytic mode (e.g., govtrack) since the default is to use the managed object heap for the intermediate solutions when not in the analytic mode.

          Show
          bryanthompson bryanthompson added a comment - NOTE: I am merging this in to master before benchmarks are available. I do not anticipate any image on the benchmarks for this branch UNLESS it is run in the analytic mode (e.g., govtrack) since the default is to use the managed object heap for the intermediate solutions when not in the analytic mode.
          Hide
          bryanthompson bryanthompson added a comment -

          Note: we could easily define an adaptive policy for the intermediate solutions which stores them on the native heap once there are more than X solutions on the object heap. The only issue is that we can not measure the bytes occupied by the native solutions on the object heap. We could also make the policy adaptive based on GC OH.

          Show
          bryanthompson bryanthompson added a comment - Note: we could easily define an adaptive policy for the intermediate solutions which stores them on the native heap once there are more than X solutions on the object heap. The only issue is that we can not measure the bytes occupied by the native solutions on the object heap. We could also make the policy adaptive based on GC OH.

            People

            • Assignee:
              bryanthompson bryanthompson
              Reporter:
              bryanthompson bryanthompson
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: