I’m trying to export/save a PDF to text. It works fine for most PDFs, but sometimes there are PDFs that produce unreadable text. I’ve googled and found that it could be because the embedded font in the PDF doesn’t have a correct ROS (Registry-Ordering-Supplement). Does anyone here know how to handle this?
Thanks for the suggestion, but I’ve tried masterpdfeditor4 before and when I tried to change the font the text was unreadable due to the missing mapping (probably due to the lack of ROS). I’ve also tried Fontforge. Fontforge can import fonts from a pdf, same problem. The problem is also that each page in such a document has its own fonts (several), half of them missing ROS.
Yes. The character mapping is completely weird. It looks correct when rendered, but the characters are not mapped to something that is actually structured and readable in a decoded sense.
Like this. First Okular and an example in Swedish, meaning “conversations of interest” and below what the clipboard decodes the characters as. It’s the same in masterpdfeditor4, or any application for that matter.
Yes, I agree. I’ll have to find some discussion group about PDFs and fonts. I’m just adding more PDF information so that those who find this discussion can get something out of it. Some text has been redacted due to sensitive information.
that looks like whoever created that document used a different keyboard layout that what your system is expecting, maybe not even an HID compliant keyboard.