I’m new to digiKam. I’m using version 8.5 on windows 11. I’m using the Mysql Internal database.
My machine is a Ryzen 9 7900X3D with 64 gb ram, Nvidia 4090 and I’m using Western Digital Black SN850X 2 and 4 tb NVMe drives. The image collection is on it’s own physical drive.
I have 286,583 images that I need to organize. Some images have many, many copies. I’m attempting to search for duplicates between 90 and 100 similarity.
The problem I’m having is that it took 20hrs for the duplicate search to get to 4% complete.
In the Windows Task Manager, the memory used is under 2 gb, the cpu is under 10% and the disk is under 0.5% while I’m searching for duplicates.
I’ve tried a different computer, different drives, reinstalling, changing the database type, deleting the database and starting fresh (a few times) and updating the fingerprints repeatedly.
The odd thing is, when I right click on an image and click “Find Similar” with the search set from 50 to 100, it only takes a few seconds.
Can anyone give me any suggestions as to what I can do to speed things up?
*UPDATE 1:
Duplicate search for 1 hr at 100 to 100 similarity has reached 1%.
*UPDATE 2:
Duplicate search for 9.5 hrs at 100 to 100 similarity has reached 26%
*UPDATE 3:
The entire program crashed at some point and I’ve had to start over 
I have recently investigated the “Find Duplicates” functionality in Digikam. It uses some advanced techniques to find “similar” but not necessarily “identical” images. This is a very hard problem to solve if you think about it. Digikam calculates a Haar matrix for every image, which I can’t begin to explain since I barely understand it myself
but think of it as turning the image into some numbers in a pretty concise way that ignores tiny details in the image. Now you can compare two images by comparing those numbers. The trouble is that you don’t know which image to compare it to, so I think Digikam ends up comparing every image to every other image (!) which is very slow. If you have a constrained set of images this is quick and works well, but it isn’t really designed to run on an entire image collection. Let me know if this helps, I don’t work on the project, but I have coincidentally gotten interested in how it works and I looked through the source code a bit.
Hello .
I’m new in digikam too, but maybe have a clue to help you .
Do you run the search for duplicates on the real full size source images ? Or do you run the search on D.K. generated thumbnails ?
Imho if you do on the source images it may explain why it take so long to “recognize” similarity, as the inspected source/original pictures may be large/big files… So, maybe try the search on generated thumbnails instead ? Thumbnails are much more smaller in size/weight than originals (so they may be potentialy quicker to inspect for the recognition process even with high accuracy exigence) and each thumbnails is linked to its original full size image . Result : when you erase a thumbnails it removes the original linked file too . You see what I mean ?