Why PDFs Are Not the Problem for AI (Poor Ingestion Is)

We have processed a lot of PDF documents at Instro. Technical manuals, operating procedures, compliance records, product specifications, all the kinds of documents that manufacturing and engineering businesses have been building up for years, often decades. And the honest answer is that PDFs are not really the problem. What you do with them is.
Where we started
When we first started consuming PDF documents for AI systems, the approach was simple enough: convert the PDF to Markdown, split it into chunks, push it into a vector database. That is broadly how most tools handle it, and for clean, well-structured documents it works reasonably well.
But real-world documents are not clean. We learned that quickly.
Split a chunk in the wrong place and you can sever a sentence in two leading an AI to either miss the point or, worse, return something misleading. Chunking decisions that sound trivial in theory turn out to matter a lot in practice when someone is relying on the output for a technical decision.
Then there are the images. A naive conversion ignores them entirely. That is fine until you are dealing with a technical manual where a diagram is doing half the explanatory work. Ignore the diagram and you have a gap in the knowledge base that the AI cannot tell you about, it simply does not know what it does not know.
And then there are the PDFs that are not really text at all. Scanned pages from old manuals, essentially large images of paper. Conversions from other formats that carry encoding noise the original author never intended. Documents with structural quirks that produce pages of white space or stray characters when processed without care.
What we built
Over the course of multiple trials with manufacturers and engineering businesses, we kept running into variations of the same problem. Each time we adapted the pipeline. Each time we learned something new. Eventually it became clear that what we actually needed was a single, purpose-built tool that could handle all of this. Not by trying to reproduce the original document, but by holistically extracting the knowledge from documents in a form that an AI can actually use.
That tool became known in Instro as the AI Data Ingestion Tool, and it is now one of the core components of the Instro platform.
The processing happens in two phases. The first, and by far the most complex, is conversion. Documents are analysed before processing, almost like a triage, so that the right handling can be applied based on what is actually in the document. Images are extracted and passed through an AI classification step that determines what kind of image each one is: perhaps a technical diagram, a photograph of a component, or simply a decorative element with no bearing on the content. Decorative images are typically discarded. Technical images are processed properly, with descriptions generated that capture what is in them, detailed enough that when the system answers a question and a relevant diagram exists, it can surface it with context. Not just "there is an image here" but "this diagram shows the component you are asking about, and here it is."
The second phase handles the preparation and ingestion of that cleaned, structured content into vector storage (the mechanism that allows the AI to retrieve the right information in response to a question) even across a large and varied body of documents.
Why it matters
There is a tendency in AI to focus almost entirely on the model. The model is important. But in most real business applications, the quality of the answer depends at least as much on the quality of what went in. The old adage “Garbage in = Garbage out” may be more true today, in this context, than any other.
If the source content is poorly structured, if images have been discarded, if chunks have been split carelessly, or if the ingestion process had no way of handling the specific quirks of your document estate then even a powerful model will return results you cannot fully rely on.
That is the problem we set out to solve. Not because PDFs are going away, they are not, and for most businesses they should not have to. Rather because the organisations we work with have years of valuable knowledge locked inside them, and getting AI to use that knowledge reliably is where the real work lies.
How Instro helps
If your business has valuable knowledge stored in manuals, technical procedures, specifications or compliance documentation, and you want AI to be able to use that knowledge dependably rather than approximately, that is exactly what we have built for.