The basic tradeoff is either:
(1) Run one sub-query per source solution, in which case we can use LIMIT 1.
(2) Run a sub-plan in which all source solutions are used to build a hash index, we flood the solutions from the hash index into the sub-plan, and then the solutions surviving the sub-plan are hash joined back to the source solutions and surviving source solutions are passed on.
Based on earlier experience, we know that (2) tends to be 100x faster than (1).
(A) In order to reduce the work for (2), we might try to improve the query analysis in order to identify the distinct (or reduced) set of solutions based on the variables that need to be in scope in the sub-plan. This could reduce the number of solutions flowing into the sub-plan and hence the work in the sub-plan. The basic analysis is, find the minimum set of projected variables that are required by the sub-plan (corresponding to the FILTER EXISTS), apply a DISTINCT filter on the projection of the source solutions as the first operation in the sub-plan to reduce the number of solutions flowing through the sub-plan. There should not be anything additional that needs to be done at the hash join of the sub-plan with the solutions in the hash index since no bindings from the FILTER EXIST sub-plan will be visible in the outer query.
(B) Like (A), but somehow use sideways information passing to reduce the effort in the sub-plan. I have no concrete suggestions here.
(C) Improve the query analysis to identify edge cases where either bottom-up evaluation of the FILTER EXISTS would be more efficient; or (ii) issuing one sub-query per source solution for the FILTER EXISTS would be more efficient.
Of these, I would rank (A) as the most interesting option. Let me know if you want to try this approach or discuss what would be involved in more depth.