This ticket would provide a capability for concurrent unisolated write operations against distinct KBs on the same RWStore or MemStore Journal. In fact, this amounts to using pessimistic locking rather than optimistic concurrency control with KB or index level granularity. The advantage of pessimistic locking in this case is more scalable updates (when compared to transactional isolation based on the use of a B+Tree to buffer the write set) and the opportunity for concurrent writers against distinct triple/quad store instances.
The current approach for transactional isolation uses a B+Tree to buffer writes for each index on which the transaction writes. This is an MVCC strategy. When the transaction prepares, the write set is validated against the ground state on which the transaction was reading. Write-write conflicts are detected through the use of revision timestamps. Certain kinds of write-write conflicts can be reconciled. If the write set has conflicts which can not be reconciled, then the transaction is aborted. Otherwise it commits. The commit protocol involves copying the write set onto the corresponding unisolated index. For historical reasons, the write set is buffered on a temporary store associated with the transaction (this prevented the WORM store from "leaking" storage associated with the write set).
The proposed approach would use the allocation context mechanisms provided by the RWStore and the MemStore to keep allocations within the context of the transaction. It would NOT use a B+Tree to buffer the write set. Instead, the combination of a transaction local allocation context and the copy-on-write semantics of the B+Tree or HTree would provide isolation. In order to prevent concurrent writers from touching the same indices, a lock would be established. The lock could either be on the namespace of the triple store, the namespaces of the relations, or the name spaces of the indices to be accessed. We have code to support 2PL, but if the lock(s) are simply declared when the transaction is created and sorted, then the potential for deadlocks can be avoided. The commit protocol would flush the dirty indices to the backing store, graph the semaphore for the Journal which controls "global" write access, and then do a Journal level commit. Finally, the allocation context associated with the transaction would be released on either abort() or commit().
Another use case for this pessimistic locking mechanism is SPARQL UPDATE of named solution sets. Right now, there is only one writer. If we used 2PL then we could allow concurrent SPARQL UPDATE operations which wrote on different named solution sets. These operations would not block unless they attempted to write on the same resource, i.e., on the same triple store, quad store, named index, or named solution set. That could significantly increase the parallelism for SPARQL UPDATE operations which are modifying the named solution sets without causing changes to the triple / quad store itself.
One other twist is that we use unisolated views to write on the lexicon indices, but we use an eventually consistent design which avoids the need for locks. However, the proposed locking approach would be per-KB, so it would not be possible for there to be 2 writers on the same lexicon indices. Therefore, this design does not appear to be an issue for the proposed approach.
Martyn and I discussed possible implementation strategies. Probably the best approach would be to provide a different ITx implementation based on pessimistic locking. This would be used for transactions when the BigdataSail did not enable isolatable indices. No change should be required at the application layer.
A workaround for a highly scalable architecture is to break up the workload into a "load" cluster (which can also do inference) and a "read" cluster (which handles queries). Durable queues can be used to present updates to the load cluster. The change log mechanism can be used to extract the delta (including inferences that are asserted or retracted) and then drop that delta onto a durable queue for the "read" cluster. An efficient low-level task can use the existing job-centric concurrency and group commit logic to bulk load the deltas into the "read" cluster. This approach has the significant advantage that the inference workload is removed from the cluster that is servicing the queries. If the inferences are performed using per-tenant journals, then the inference throughput can be scaled independent of the query workload using a pool of machines dedicated to computing the inferences for each tenant. The only detraction is how to handle low-latency (versus high throughput) updates when also using inference.
See BLZG-461 (AbstractTask uses one TemporaryStoreFactory per read-only or read/write tx task)
See BLZG-14 (HA doLocalAbort() should interrupt NSS requests and AbstractTasks)
BLZG-1036 (Name2Addr.indexNameScan(prefix) uses scan + filter)
See BLZG-688 (GIST
- Generalized Indices)
See BLZG-1152 (SPARQL UPDATE warning with jetty client)
See BLZG-1167 (Adapt blueprints integration to support group commit?)
BLZG-192 (Enable group commit by default (release 1.5.2))
BLZG-1171 (NPE in Leaf.hasDeleteMarkers in HA test suite with group commit)
BLZG-1172 (ClocksNotSynchronizedException not visible to HA client when GROUP COMMIT is enabled)
BLZG-1173 (DELETE-WITH-QUERY and UPDATE-WITH-QUERY (GROUP COMMIT))
BLZG-1174 (GlobalRowStoreHelper can hold hard reference to GSR index (GROUP COMMIT))
BLZG-193 (Code review on "instanceof Journal")