When running a soak test with characteristics to follow, I get a very hard to understand error, very rarely, and in conditions I have failed to replicate other than in my own test harness (that is testing my own code, not bigdata).
My code interacts with bigdata only through the http interface to the NSS. Enabling logging on bigdata makes the problem vanish, but I can reproduce the problem with logging in my code. In particular enabling the ASTEvalHelper log makes the problem disappear.
Attached are 6 logs:
- a log of the NSS being basically the stdout
- showing two stack traces, one at approx 14:00:29, which is the one for which I had the other logging enabled
- five logs from my code, one for each of five different namespaces being used in the period 21:00:28 to 21:00:29 (note the 7hour time zone difference)
The update that failed is in sparqlu2iDMc.log.part
I note that the error concerns a boolean(true) but there are no such values in any of the logs. There were some boolean(true)s being used in earlier completed queries and updates; and I would expect some boolean(true) values to be in the triple store.
The version of bigdata I was running is 1.3.2 + five additional commits and patches as agreed with Systap, in particular a patch fixing 1026.
The test itself has the following characteristics.
Every forty minutes there is a new round of tests.
There are five concurrent parts to the test, each of which is identical.
Each part creates a namespaces, does some operations, maybe taking 15 minutes over a 35 minute period, and then deletes the namespace.
The namespace names are reused, not on every round, but ... I have 15 namespace names, and at any time 5 are in use. Each part logs all the queries and updates it is sending in several separate log files.
Typically each query involves resources that start with a URI http://localsyapsehost:NNNNN/ where the number NNNNN is assigned by jenkins differently to each of the five parts, hence it is easy to tell the queries from each namespace apart.
I have not seen the problem in the first round of testing (i.e. the first 40 minutes), and I believe it requires the reuse of namespace names to be seen; on the other hand, reusing a namespace name does not guarantee an issue. It typically takes 3 or 4 hours to get a single fault. Staggering the parts by one minute also seems to make the problem go away.
I have seen the error report occur in SELECT queries as well as UPDATEs (this particular instance is an update).
The error always seem to occur in only two parts of my test suite, this is one of them.
I have longer logs but not complete logs of all the operations since the beginning of the journal file.