How to parse PDF Forms

In most cases texts come within page binary content streams and can be extracted as in Extracting texts and How to parse PDF texts.

There is one more place where text data can be found: page forms. Form is a special subtype of XObject which is a part of page resources, and can be referenced from page by do command.

You may think of Form as of “small subpage” that is stored aside main content.

Have a look at one PDF form.

Let’s open the document and get the 1st page.

>>> from pdfreader import SimplePDFViewer
>>> fd = open(file_name, "rb")
>>> viewer = SimplePDFViewer(fd)

And now, let’s try to locate a string, located under the section B.3 SOC (ONET/OES) occupation title

../_images/example-parse-form.png
>>> viewer.render()
>>> plain_text = "".join(viewer.canvas.strings)
>>> "Farmworkers and Laborers" in plain_text
False

Apparently, the texts typed into the form are in some other place. They are in Form XObjects, listed under page resources. The viewer puts them on canvas:

>>> sorted(list(viewer.canvas.forms.keys()))
['Fm1', 'Fm10', 'Fm11', 'Fm12', 'Fm13', 'Fm14',...]

As Form is a kind of “sub-document” every entry in viewer.canvas.forms dictionary maps to SimpleCanvas instance:

>>> form9_canvas = viewer.canvas.forms['Fm9']
>>> "".join(form9_canvas.strings)
'Farmworkers and Laborers, Crop, Nursery, and Greenhouse'

Here we are!

More on PDF Form objects: see sec. 8.10