Baloo: huge index file

I have always disabled baloo as it caused high memory and CPU usage with no visible benefit.

Now I decided to give it a try again, maybe if I set it up right, it speeds up file searches to be instant?


I have 300.000 files indexed and the index is 12,5GB.

That is 41,5kB per file, which seems huge?

If I write a textfile with 300 characters, it is 300 bytes big. That should be more than enough to store filename, location, mimetype, size and whatever is needed?

Now this is pretty strange too

$ balooctl6 indexSize
File Size: 12,92 GiB
Used:      2,66 GiB

           PostingDB:       1,82 GiB    68.370 %
          PositionDB:       3,93 GiB   147.712 %
            DocTerms:     846,71 MiB    31.062 %
    DocFilenameTerms:      23,95 MiB     0.879 %
       DocXattrTerms:            0 B     0.000 %
              IdTree:       5,87 MiB     0.215 %
          IdFileName:      27,20 MiB     0.998 %
             DocTime:      13,19 MiB     0.484 %
             DocData:      11,07 MiB     0.406 %
   ContentIndexingDB:     800,00 KiB     0.029 %
         FailedIdsDB:            0 B     0.000 %
             MTimeDB:       2,95 MiB     0.108 %

At the current state, baloo is still not really usable it seems…

Fedora 41, KDE Plasma 6.3

I tend to have this problem too actually. I tend to blend work and life, where I have a very large /work directory that I sync against to external disk, and it has a lot of files there for customers and projects I’ve worked on, including code, documentation, configurations, etc that seem to always case baloo issues.

I need to always turn baloo off. It’d be nice if it’d behave better, or show better value for what a pain it is.

At one point I thought about using it for RAG data for ollama, but it didn’t seem particularly good for that either, so all in all made me question what it is good at to begin with these days.

Just to further expand some, it seems like for what it’s doing, which I assume is some indexing of files, it does more harm than good beyond a certain size or complexity of the data it’s scanning. Mine tended to freak out and cause weird memory and cpu issues if I remember last, and killing it didn’t seem to affect me to mostly improve my life.

I’m not a fan of java-anything like opensearch, but for what baloo does, it seems it would be worth having a more versatile sort of indexer like opensearch instead to add relevancy to documentation. For RAG it’s more necessary though I’ve yet to try, but maybe somewhat worth the cost if one can find a happy medium for jvm memory and gc issues.

Why not just exclude that folder?
For example I exclude my Downloads folder because it frequently has changes and files that I don’t store for a long time - I see no point in them being indexed. I also occasionally dabble a bit with learning coding and the folder I store those files I also don’t include.

3 Likes

You can exclude directories and that kinda seems to work

Well if anything, I DO want it to index that folder for interesting content used in ā€œrecentā€ and such data in kde. Only it sucks at doing so and misbehaves usually, and I never really figure out what is causing it before I just kill it.

It appears that you’re indexing file contents, too. That’s not the cause of your woes, I’m just mentioning it so that you can have appropriate expectations. Likewise:

Should only occur when the machine has been idle for some time.

It’s really only using 2.66GB… But yeh I’m not sure it’s meant to do this. It seems widely reported and someone has even submitted code to add a command to balooctl to deal with it, so I’m sure the devs know this is a thing. Still, if you want a real fix, best to ask them.

In the meantime, nuke the index and reindex (Using balooctl6, disable, then purge, then enable, then check), and it should hopefully have a smaller difference between those two sizes when the index rebuilds.

This may be of interest depending on your search requirements Use `ripgrep-all` / `ripgrep` to improve search in Dolphin - KDE Blogs
TL;DR install ripgrep and dolphin will use it when baloo is disabled

3 Likes

That piece of junk always does that and is the reason the first thing I do is make certain it is disabled. Anytime I have left it enabled the index is always around the size you have found.

Are you short of disk space? It’s certainly larger than my baloo index, but why is the size of the index a factor in whether or not to use it?

I used to turn it off, as it would frequently cause CPU related fan noise, but with the improvements in recent years I now enable it (with file indexing), and after the initial index run it never bothers me again.

2 Likes

Thanks, yes it indexes content by default which is kinda bad.

I changed that setting but

  • balooctl has no stop argument ?
  • the setting does not seem to be applied

So if really nuking it and doing it again is the solution… at least changing the preset to not scan file contents would be needed.


But cool, Dolphin uses ripgrep now? This is really nice!

My main purpose would be to speed up kfind though, so I will probably use both?

That’s incredible.

āÆ balooctl6 indexSize
File Size: 637.13 MB
Used:      383.17 MB

           PostingDB:      107.33 MB    28.011 %
          PositionDB:      169.83 MB    44.323 %
            DocTerms:       58.14 MB    15.174 %
    DocFilenameTerms:       14.20 MB     3.707 %
       DocXattrTerms:       65.54 kB     0.017 %
              IdTree:        2.26 MB     0.589 %
          IdFileName:       15.24 MB     3.978 %
             DocTime:        8.41 MB     2.195 %
             DocData:        6.52 MB     1.701 %
   ContentIndexingDB:            0 B     0.000 %
         FailedIdsDB:            0 B     0.000 %
             MTimeDB:        1.17 MB     0.306 %

I think you did more than simply enable it, are you trying to use it to completely replace any search?

Keep it simple, or you’ll suffer. Baloo is great, completely invisible and completely working if you don’t get ambitious.

Ah, I remember now - I think that (and the fact that I tried to include a couple of directories which also had rather too many junk files in them - I forget now).

1 Like

There are 2 kinds of folks using Plasma.

There are the vast majority, who have no issues with baloo and don’t even remember it’s running for years until someone pops up in the forum.

Then there are the ones that create issues when meddling with it - possibly including too many, or the wrong kinds of directories… also indexing contents is possibly something worth considering disabling unless it’s really necessary.

8 years with Plasma, I had issues with Baloo (off and on) during the first year… the next 7 years no issues at all… and I can instantly pull up nearly all the media (music/TV/movies/books/images) on my main SSD and two main media storage HDD’s instantly.

1 Like

the defaults in kubuntu are that baloo indexing is off.

but i’ve always turned it on so i can index my music collection and see the duration info in dolphin.

to that end i also include content but not hidden files/folders

what works best is to only add the directories you want to have indexed and possibly even exclude ones you want to skip

for instance mine is set to include ~/Documents and ~/Pictures, but not to index ~/snap or any hidden folders where it might start indexing it’s own index and spiral.

1 Like

I think we can agree that file content indexing should be disabled by default. I am rescanning currently, after disabling and purging it, as there seems to be no other way

Baloo File Indexer is running
Indexer state: Idle
Total files indexed: 181,278
Files waiting for content indexing: 0
Files failed to index: 0
Current size of index is 135.62 MiB

File Size: 135.61 MiB
Used:      93.84 MiB

           PostingDB:      17.50 MiB    18.645 %
          PositionDB:      18.38 MiB    19.590 %
            DocTerms:      12.71 MiB    13.545 %
    DocFilenameTerms:      14.86 MiB    15.831 %
       DocXattrTerms:            0 B     0.000 %
              IdTree:       3.80 MiB     4.050 %
          IdFileName:      17.20 MiB    18.332 %
             DocTime:       7.92 MiB     8.442 %
             DocData:            0 B     0.000 %
   ContentIndexingDB:            0 B     0.000 %
         FailedIdsDB:            0 B     0.000 %
             MTimeDB:       1.47 MiB     1.565 %

this makes WAY more sense

opened a bug report on this, if someone knows where the default value for ā€œindex file contentā€ is placed, this could be an easy fix

1 Like

I wouldn’t think so. Having content indexed, returns the greatest benefit, as searching content is by far slower without the index. It makes sense to have it on, as it has such a great benefit, and (ignoring the problem in this thread so, most of the time) costs very little, just a small amount of cheap diskspace and some idle CPU cycles.

If the user has some unusual data being indexed and it’s resulting in more disk space or idle CPU time taken than they might like, they can just exclude that data from the indexing, and there’s a nice GUI for that.

To clarify my post above:

Naturally, indexing content will result in a larger index size than not indexing it. However, the problem here was not the ā€˜real’ size of the index, it was the large amount of ā€˜empty’ space in that file, as seen in the difference between ā€˜File Size:’ and ā€˜Used:’.

This large amount of wasted space was not caused by indexing content. That should be expected to increase the ā€˜real’ size but not the ā€˜empty’ space. There’s no problem seen in this thread with indexing content, except that you thought that you weren’t, and that gave you an inappropriate expectation of the (ā€˜real’) size of the index. I mentioned it exclusively for the purpose of informing that ā€˜kB per file’ napkin math you showed. It’s really off-topic with regards to the bug seen here.

As expected from the reports we saw elsewhere, re-indexing seems to have that sorted out though, so, you probably could turn on content indexing if you want it. There’s a chance you’ve discovered some new peculiarity with certain content which triggers this behaviour from baloo… Which is to say, maybe your files are weird and they broke it, and that would be a new bug, or at least a new way to reproduce the old bug. But hopefully not. Only one way to find out! :smiley:

3 Likes

If I remember rightly, quite a few years back now, I included hidden folders which also included one of my Plex folders which has a massive tree of essentially junk files - this also caused my first timeshift failures as I messed around with folder settings there - it was trying to parse several million files for about 20 minutes, then failed to restore.

As usual, most Plasma errors are PEBCAK errors - sometimes blamed on Plasma because there aren’t any guard rails against innocently f@rking it up with a few bad choices.

2 Likes

I have 100,000 files indexed (with contents) and my index file size is 5.5 GB, so a worse index size relative to number of files. (sorry we use different thousand separators and decimal indicators :slightly_smiling_face: )

So if really nuking it and doing it again is the solution…

Did that reduce disk space? Do you have any problem beyond not liking the disk space baloo takes up? If its just disk space then would you, for example, want System Settings > File Search to show the size of its index?

at least changing the preset to not scan file contents would be needed.

That would be a loss of functionality for people using Baloo without configuring it; it’s hard to know how many such users it has.

Baloo has bugs and limitations (some of which I document at KDE Community wiki) because it’s a problem with a lot of moving parts, and I wish Baloo and its dependencies like kfilemetadata and catdoc got more developer love, but I would struggle without it. Baloo quickly finds files by content for me where a ripgrep search would fail (rg(1) won’t find in PDFs and Microsoft/LibreOffice files unless you monkey with --pre scripts and command-line xxtotext utilities). Baloo finds files by type (--type=image :heart:) for me where a command line using locate(1), find(1), or its Rust replacement fd(1) would be fiddly to set up.

4 Likes

I like Baloo, but you need to spend some time configuring it appropriately.
This is an old article I wrote, just in case:

1 Like

I must say that as a long timer user of Plasma’s content indexing, sometimes heavily with a good-sized amount of different files across multiple drives, I have not had any issues for some years, until the past few weeks.

A purge and reindex seems to have fixed whatever issue I had with CPU usage on my older i5-7500T system.

This is nothing like the early days, into the less early days.

Hi, yes I disabled baloo, removed the index, disabled content scanning and enabled it again, the index is pretty small now

Yes for sure the index filesize should be shown. And the default should be on names only, as it is waaay easier on the OS

the issue was that my CPU load was a lot for more than an hour and would likely stay like that.

Interesting that it can find so many files! This is absolutely useful, but I would like to configure it. I have quite some junk files, web archives, PDF collections etc, that I dont want to be all scanned.

Until then, I think not indexing file content is best