Do PDF's created under Linux differ from PDF's being created under Windows?

I have noticed that PDF’s created under Linux seem to be somehow different than ones being created under Windows. I’m keeping it a bit vague on purpose, as I am looking for pointers to research and maybe avoid problems in the future.

What I experience:

  • Sometimes I cannot upload a PDF on a Webservice, because it cannot be “parsed” by the service
  • Some AI tools like LM Studio or GPT4all won’t be able to parse the PDF correctly too

I’m using mostly Libre Office to generate a PDF, but I noticed similar behavior with PDF being generated by Kate as well.

I assume that most services are optimized for PDF’s created by Adobe products, and I noticed differences/problems with PDF’s created by other tools under Windows (PDF24, Foxit, others) too.

What I would like to find out is, if there might be settings, that would help with producing PDF’s that would get parsed correctly?

I hope someone who works more with PDF’s might know some tricks on that behalf :wink:

It would depend on the program or library in question. Some programs format/structure the PDF better than others.

Yeah, this ain’t easy to figure out, as sometimes I just notice that an upload or email attechment will fail, but the “error message” hardly every describes the problem, that’s why I’m kind of looking for someone who is some sort of expert on that matter. It’s a bit like trying to design an email that looks the same on all possible clients. :wink:

PDF is an open spec. I would suggest going through the spec guides adobe publishes in addition to seeking answers.

It feels a bit like most “scans” are optimized only against adobe pdf’s and if your pdf differs from that, it automagically gets flagged as malicious. For the AI tools I can’t even guess what’s happening there, some get parsed in completely different unicode fonts, looked like a thai font or similar. :-/

If a PDF is rasterized, it’s more difficult to search for text if it isn’t. Also, there’s multiple ways text can be rendered in a PDF document. It’s an extremely convoluted spec, so the programs in question like you said, are only tested against how adobe exports them.

I highly doubt that adobe works correct on behalf of those specs, last time i tried to use adobe reader it took me like 3 minutes to close all the advertisement popups for new super hyper features, functions and buying a subscription, before i could see the content. :wink:

It’s really kind of a monster format, the animated 3D view part is a monster alone. I do doubt that it is set to rasterize the text by default under linux though. But that’s just guessing for me, I’m more of a typography and graphics guy, rather than a coder. :-/

I would be surprised if the operating system had any influence on PDF creation.

LibreOffice, for example, should be generating the same PDF on Linux and Windows.
Same for Kate.

if their respective variants do indeed generate different output based on platform it might be possible to use a diff tool to get to some of the differences

You could alternatively, jump into the realm of nroff, HTML, or even LaTeX.

Nowadays i would rather like to improve my skillz in markdown, but it’s the same kind of issue there, there is a ton of “flavours” but not a robust underlying standard that behaves the same in EVERY of those flavours. :frowning:

There is a standard called commonmark, the real issue is that GitHub, Pandoc, and others have their own extensions to the spec. Such as code blocks or raw LaTeX for example.

I think you are right on behalf of LibreOffice and Kate producing the same PDF, the thing I suspect is that it gets interpreted differently under different OS’s, depending on installed language packs, OCR methods, UTF or other parsing and so on, depending what the OS prefers/has installed.

Yes, on a daily base I find it more and more difficult to sort through all of that, while I just wish for a way to produce something, that looks and behaves the same, no matter how and with what you look at it. I’m really getting way to old for this stuff, haha.

But.. PDF was the format that was meant to provide that, a universal wysiwig format that was aimed to kill the print industry with print on demand, look anywhere the same features, that kind of killed the old typesetting and printing industry with print on demand.. yet it’s broken by now by feature overkill.

Maybe we should give XPS a chance. coughs and runs :wink:

I am bit confused now.
Didn’t you write that you had gotten different PDFs on the two platforms?

Maybe but I would still be surprised if, say, Okular handled PDFs differently on Windows and Linux.

Perhaps if the PDF refers to a font not present on the system and not embedded in the PDF either. However parsing/understanding the PDF should be fairly platform agnostic.

I am not sure about “killing” but I think PDF is now the format used even in industrial printing. At least every printing company I’ve done business with had no issues with the PDFs I had sent them.

I tried to express that PDF’s get interpreted differently, depending on how you produce them. My main problem is that PDF’s being produced differently than with adobe tools under windows get rejected or not parsed wrong by various “services” or say AI tools. Like i got job applications rejected, because the software scanning the incoming PDF considered it as malicious because it was not created by adobe tools.

PDF being the format now is what killed a major part of the printing industry in the 90’s, that’s what I’m referring to. :wink: I’m talking preprint production in pagemaker and quarkexpress and such :wink:

I adore Donald Knuth for what he achieved with TeX&/LaTeX, but have you ever tried to produce a modern looking job application that refers to the state of the art with those tools? :wink: