I’ve found Digikam’s auto-tag assignment to be pretty good at what it does, but only in cases where an image’s content fits well into the predefined categories. Immich has attempted to solve this problem by using CLIP models to implement free-text image search, and I would really love to be able to do the same thing in Digikam.
In short, CLIP models project text in addition to images into the same semantic embedding space, so that any arbitrary text can be used to retrieve the most relevant images.
Furthermore, CLIP embeddings could easily supplement the existing methods of visual similarity searching by comparing a query image’s embedding with the fingerprint database to find pictures with similar content in addition to similar pixels. I can’t say how this would compare to existing method, but it could some aspects of organizing images much easier.
I would love to try to implement this myself, but I haven’t worked on large and complex c++ projects before and becoming familiar enough with the code base to do it well is rather intimidating. With luck, the current frameworks for visual similarity and auto-tagging may reduce the effort it would take to implementing this. If anyone familiar with the project could point me in the right direction, I’d be glad to give it a try.
If anyone has opinions of this, I’d love to hear them!