Read PDF?

BlitzMax Forums/BlitzMax Programming/Read PDF?

xlsior(Posted 2008) [#1]
I know that there are some libraries that will allow BlitzMax to create PDF files, but... Does anyone know of any library that enables you to read PDF files?

More specifically, I'm looking for a way to programmatically extract the text from plain-vanilla PDF. No images or fancy layouts, just text.


Canardian(Posted 2008) [#2]
I would use the OpenOffice API. This example saves a PDF, but it should work the other way around the same way: http://codesnippets.services.openoffice.org/Writer/Writer.StoreWriterAsPDF.snip


daaan(Posted 2008) [#3]
Look up the file format and then write a loader that only extracts the text.


xlsior(Posted 2008) [#4]
Look up the file format and then write a loader that only extracts the text


I looked at it, and found that there isn't enough voodoo in the world for me to make enough sense of it to write a loader from scratch. :-)

Unfortunately depending on OpenOffice isn't really a solution either, so I guess I'll keep looking...


Arowx(Posted 2008) [#5]
I found some C++ code that is aimed at extracting the text...



It looks like the main problem is uncompressing the encoded text. Hope this helps!


Foolish(Posted 2008) [#6]
Developing your own code to extract text from PDF files isn't trivial. I would suggest you looked to create a wrapper for an existing library.

You can find developer information at www.pdfzone.com

The example above extractPDFText.aspx assumes one kind of filter/encoding. ASCII85 encoding is also potentially used. PDF files are not required to store data in page order which means this code extracts text sequentially in the file, but it may be out of logical page order. If the file has been linearized or incrementally updated, then you may also find "legacy" objects that aren't even used by the file anymore.

PDF is a bear of a format to work with when it comes to dealing with pre-existing files. There are more libraries to generate PDFs because it's easier.


xlsior(Posted 2008) [#7]
Merx: Thanks, I'll take a look at that

The example above extractPDFText.aspx assumes one kind of filter/encoding. ASCII85 encoding is also potentially used. PDF files are not required to store data in page order which means this code extracts text sequentially in the file, but it may be out of logical page order. If the file has been linearized or incrementally updated, then you may also find "legacy" objects that aren't even used by the file anymore.


The PDF's I'm trying to read are pretty plain vanilla, so I doubt that I'll run into out-of-order issues... But thanks for the warning.