How to browse PDF objects

There could be a reason when you need to access raw PDF objects as they are in the document. Or even get an object by its number and generation, which is also possible. Let’s see several examples with PDFDocument.

Accessing document objects

Let’s take a sample file from How to access Document Catalog tutorial. We already discussed there how to locate document catalog.

>>> from pdfreader import PDFDocument
>>> fd = open(file_name, "rb")
>>> doc = PDFDocument(fd)
>>> catalog = doc.root

To walk through the document you need to know object attributes and possible values. It can be found on PDF-1.7 specification. Then simply use attribute names in your python code.

>>> catalog.Type
'Catalog'
>>> catalog.Metadata.Type
'Metadata'
>>> catalog.Metadata.Subtype
'XML'
>>> pages_tree_root = catalog.Pages
>>> pages_tree_root.Type
'Pages'

Attribute names are cases sensitive. Missing or non-existing attributes have value of None

>>> catalog.type is None
True
>>> catalog.Metadata.subType is None
True
>>> catalog.Metadata.UnkNown_AttriBute is None
True

If object is an array, access its items by index:

>>> first_page = pages_tree_root.Kids[0]
>>> first_page.Type
'Page'
>>> first_page.Contents.Length
3890

If object is a stream, you can get either raw data (deflated in this example):

>>> raw_data = first_page.Contents.stream
>>> first_page.Contents.Length == len(raw_data)
True
>>> first_page.Contents.Filter
'FlateDecode'

or decoded content:

>>> decoded_content = first_page.Contents.filtered
>>> len(decoded_content)
18428
>>> decoded_content.startswith(b'BT\n0 0 0 rg\n/GS0 gs')
True

All object reads are lazy. pdfreader reads an object when you access it for the first time.

Locate objects by number and generation

On the file structure level all objects have unique number an generation to identify them. To get an object by number and generation (for example to track object changes if incremental updates took place on file), just run:

>>> num, gen = 2, 0
>>> raw_obj = doc.locate_object(num, gen)
>>> obj = doc.build(raw_obj)
>>> obj.Type
'Catalog'