How to parse PDF Forms¶
There is one more place where text data can be found: page forms. Form is a special subtype of XObject which is a part of page resources, and can be referenced from page by do command.
You may think of Form as of “small subpage” that is stored aside main content.
Have a look at one
Let’s open the document and get the 1st page.
>>> from pdfreader import SimplePDFViewer >>> fd = open(file_name, "rb") >>> viewer = SimplePDFViewer(fd)
And now, let’s try to locate a string, located under the section B.3 SOC (ONET/OES) occupation title
>>> viewer.render() >>> plain_text = "".join(viewer.canvas.strings) >>> "Farmworkers and Laborers" in plain_text False
Apparently, the texts typed into the form are in some other place. They are in Form XObjects, listed under page resources. The viewer puts them on canvas:
>>> sorted(list(viewer.canvas.forms.keys())) ['Fm1', 'Fm10', 'Fm11', 'Fm12', 'Fm13', 'Fm14',...]
As Form is a kind of “sub-document” every entry in viewer.canvas.forms dictionary maps to
>>> form9_canvas = viewer.canvas.forms['Fm9'] >>> "".join(form9_canvas.strings) 'Farmworkers and Laborers, Crop, Nursery, and Greenhouse'
Here we are!
More on PDF Form objects: see sec. 8.10