Skip to content

Commit fa5de85

Browse files
fix(dedupe): prevent duplicate test processing in batch dedupe command
Finding.Meta.ordering includes multiple columns (numerical_severity, date, title, epss_score, epss_percentile). When Django generates SELECT DISTINCT test_id ... ORDER BY those columns, PostgreSQL requires them in the SELECT list, so Django silently adds them. The DISTINCT then operates on the full tuple instead of test_id alone, causing the same test to appear multiple times in the iterator and be processed repeatedly. Fix by calling .order_by("test_id") before .values_list().distinct() to override the model-level ordering, so the query stays SELECT DISTINCT test_id ORDER BY test_id.
1 parent 876ff9c commit fa5de85

1 file changed

Lines changed: 5 additions & 1 deletion

File tree

dojo/management/commands/dedupe.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -171,7 +171,11 @@ def _dedupe_batch_mode(self, findings_queryset, *, dedupe_sync: bool = True):
171171
logger.info(f"Processing {total_findings} findings in batches of max {batch_max_size} per test ({mode_str})")
172172

173173
# Group findings by test_id to process them in batches per test
174-
test_ids = findings_queryset.values_list("test_id", flat=True).distinct()
174+
# Use order_by("test_id") to override the Finding model's default ordering
175+
# (numerical_severity, date, title, ...). Without this, Django includes those
176+
# ordering columns in the SELECT for DISTINCT, making test_ids non-unique and
177+
# causing the same test to be processed multiple times.
178+
test_ids = findings_queryset.order_by("test_id").values_list("test_id", flat=True).distinct()
175179
total_tests = len(test_ids)
176180
total_processed = 0
177181

0 commit comments

Comments
 (0)