Uploaded image for project: 'Blazegraph (by SYSTAP)'
  1. Blazegraph (by SYSTAP)
  2. BLZG-1643

DELETE does not delete properly with Wikidata Query service

    XMLWordPrintable

    Details

      Description

      We have a problem in Wikidata query service that seems to be the result of DELETE not deleting all triples it should on update. The result is data duplication e.g. as described here:

      https://phabricator.wikimedia.org/T116622

      As an example, this query:

      PREFIX wd: <http://www.wikidata.org/entity/>
      PREFIX wdt: <http://www.wikidata.org/prop/direct/>
      PREFIX wikibase: <http://wikiba.se/ontology#>
      PREFIX p: <http://www.wikidata.org/prop/>
      PREFIX v: <http://www.wikidata.org/prop/statement/>
      PREFIX q: <http://www.wikidata.org/prop/qualifier/>
      PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
      
      SELECT * WHERE {
        wd:Q1163717 rdfs:label ?x .
        FILTER (lang(?x) = "en")
        }
      

      returns two labels - "Schadenfreude" and "schadenfreude", even though the entity never has more than one. This is because when updating, old label for some reason was not deleted. It can not be skipped update or caching issue, since the new data is there, and new data always only contains one label.

      This also happens with other updates, and is reproducible with edits to Q4115189, as described in the bug above. I.e. if I do

      prefix wds: <http://www.wikidata.org/entity/statement/> 
      DESCRIBE wds:Q4115189-cba17f58-4bc5-7155-f0dc-efd77847efaf
      

      now I see two references ( wdref: ) even though looking at https://www.wikidata.org/wiki/Q4115189 (or https://www.wikidata.org/wiki/Special:EntityData/Q4115189.ttl?flavor=dump&nocache=11123) there is only one reference for wds:Q4115189-cba17f58-4bc5-7155-f0dc-efd77847efaf (which is "child:H├Ągar the Horrible" claim). So somehow the old data is not deleted. Since this happens for both statements and labels, it does not seem to be limited to cases where we have custom vocabulary. It also does not happen all the time - many times deletion works just fine.

      Another consequence of it is this query:

      PREFIX wd: <http://www.wikidata.org/entity/>
      PREFIX p: <http://www.wikidata.org/prop/>
      PREFIX wikibase: <http://wikiba.se/ontology#>
      
      SELECT (count(?doid) as ?c) WHERE {
         ?doid wikibase:rank wikibase:NormalRank .
         ?doid wikibase:rank wikibase:DeprecatedRank .
      }
      

      Which should return 0 since the statement can not have two ranks. However, it returns over 4000 results. Which probably because rank changed but old data was not deleted properly.

      Also, if I manually update any of the affected entities, the problem is gone - so it is not persistent, it happens intermittently during running the batch updates, but I never could reproduce it through individual update to an entity.

      Any help with this is highly appreciated as the data quality suffers a lot because of this issue.

        Attachments

        1. log.gz
          186 kB
        2. q.txt
          29 kB

          Issue Links

            Activity

              People

              Assignee:
              michaelschmidt michaelschmidt
              Reporter:
              stasmalyshev stasmalyshev
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: