Okular exports unreadable text from PDF

Hi,

I’m trying to export/save a PDF to text. It works fine for most PDFs, but sometimes there are PDFs that produce unreadable text. I’ve googled and found that it could be because the embedded font in the PDF doesn’t have a correct ROS (Registry-Ordering-Supplement). Does anyone here know how to handle this?

you could try opening the document in a pdf editor like libre office draw and change the font(s).

there doesn’t seem to be any kind of global font selection tho, so you might be down to changing the font of each text box individually (tedious).

or you could try scanning it and using the scanner OCR to extract the text… tho, you may get similar gibberish or worse.

Thanks for the suggestion, but I’ve tried masterpdfeditor4 before and when I tried to change the font the text was unreadable due to the missing mapping (probably due to the lack of ROS). I’ve also tried Fontforge. Fontforge can import fonts from a pdf, same problem. The problem is also that each page in such a document has its own fonts (several), half of them missing ROS.

$ pdffonts test.pdf | wc -l 
2409

that’s a lot of fonts.

perhaps you could target the font replacement efforts to the areas of the document that are most likely to generate gibberish.

i don’t understand what this means… when you select the text and change the font it changes to different letters than what you see on the pdf?

perhaps some screen shots would help.

Yes. The character mapping is completely weird. It looks correct when rendered, but the characters are not mapped to something that is actually structured and readable in a decoded sense.

Like this. First Okular and an example in Swedish, meaning “conversations of interest” and below what the clipboard decodes the characters as. It’s the same in masterpdfeditor4, or any application for that matter.

99.99% the file is wrongly generated and there’s nothing we can do.

1 Like

Yes, I agree. I’ll have to find some discussion group about PDFs and fonts. I’m just adding more PDF information so that those who find this discussion can get something out of it. Some text has been redacted due to sensitive information.

$ pdfinfo test.pdf


Title:           <redacted> 
Author:          <redacted> 
Creator:         Runner L-03.00.03 
Producer:        3-Heights(TM) Document Converter 6.25.0.4 (www.pdf-tools.com); modified using iText® Core 8.0.1 (production version) ©2000-2023 Apryse Group NV, <redacted>
CreationDate:    <redacted>
ModDate:         <redacted>
Custom Metadata: no 
Metadata Stream: yes 
Tagged:          no 
UserProperties:  no 
Suspects:        no 
Form:            none 
JavaScript:      no 
Pages:           1332 
Encrypted:       no 
Page size:       595 x 842 pts (A4) 
Page rot:        0 
File size:       159294977 bytes 
Optimized:       no 
PDF version:     1.4

that looks like whoever created that document used a different keyboard layout that what your system is expecting, maybe not even an HID compliant keyboard.