Have a look at the sample file. In this tutorial we will learn simple methods on - how to open it - navigate pages - exract images and texts.


Before we start, let’s make sure that you have the pdfreader distribution installed. In the Python shell, the following should run without raising an exception:

>>> import pdfreader
>>> from pdfreader import PDFDocument, SimplePDFViewer

How to start

The first step when working with pdfreader is to create a PDFDocument instance from a binary file. Doing so is easy:

>>> fd = open(file_name, "rb")
>>> doc = PDFDocument(fd)

As pdfreader implements lazy PDF reading (it never reads more then you ask from the file), so it’s important to keep the file opened while you are working with the document. Make sure you don’t close it until you’re done.

It is also possible to use a binary file-like object to create an instance, for example:

>>> from io import BytesIO
>>> with open(file_name, "rb") as f:
...     stream = BytesIO(f.read())
>>> doc2 = PDFDocument(stream)

Let’s check the PDF version of the document

>>> doc.header.version

Now we can go ahead to the document catalog and walking through pages.

How to access Document Catalog

Catalog (aka Document Root) contains all you need to know to start working with the document: metadata, reference to pages tree, layout, outlines etc.

>>> doc.root.Type
>>> doc.root.Metadata.Subtype
>>> doc.root.Outlines.First['Title']
b'Start of Document'

For the full list of document root attributes see PDF-1.7 specification section 7.7.2

How to browse document pages

There is a generator pages() to browse the pages one by one. It yields Page instances.

>>> page_one = next(doc.pages())

You may read all the pages at once

>>> all_pages = [p for p in doc.pages()]
>>> len(all_pages)

Now we know how many pages are there!

You may wish to get some specific page if your document contains hundreds and thousands. Doing this is just a little bit trickier. To get the 6th page you need to walk through the previous five.

>>> from itertools import islice
>>> page_six = next(islice(doc.pages(), 5, 6))
>>> page_five = next(islice(doc.pages(), 4, 5))

Don’t forget, that all PDF viewers start page numbering from 1, however Python lists start their indexes from 0.

>>> page_eight = all_pages[7]

Now we can access all page attributes:

>>> page_six.MediaBox
[0, 0, 612, 792]
>>> page_six.Annots[0].Subj
b'Text Box'

It’s possible to access parent Pages Tree Node for the page, which is PageTreeNode instance, and all it’s kids:

>>> page_six.Parent.Type
>>> page_six.Parent.Count
>>> len(page_six.Parent.Kids)

Our example contains the only one Pages Tree Node. That is not always true.

For the complete list Page and Pages attributes see PDF-1.7 specification sections

How to start extracting PDF content

It’s possible to extract raw data with PDFDocument instance but it just represents raw document structure. It can’t interpret PDF content operators, that’s why it might be hard.

Fortunately there is SimplePDFViewer, which understands a lot. It is a simple PDF interpreter which can “display” (whatever this means) a page on SimpleCanvas.

>>> fd = open(file_name, "rb")
>>> viewer = SimplePDFViewer(fd)

The viewer instance gets content you see in your Adobe Acrobat Reader. Just navigate a page with navigate() and call render()

>>> viewer.navigate(8)
>>> viewer.render()
The viewer extracts:
  • page images (XObject)
  • page inline images (BI/ID/EI operators)
  • page forms (XObject)
  • decoded page strings (PDF encodings & CMap support)
  • human (and robot) readable page markdown - original PDF commands containing decoded strings.

Extracting Page Images

There are 2 kinds of images in PDF documents:
  • XObject images
  • inline images

Every one is represented by its own class (Image and InlineImage)

Let’s extract some pictures now! They are accessible through canvas attribute. Have a look at page 8 of the sample document. It contains a fax message, and is is available on inline_images list.

>>> len(viewer.canvas.inline_images)
>>> fax_image = viewer.canvas.inline_images[0]
>>> fax_image.Filter
>>> fax_image.Width, fax_image.Height
(1800, 3113)

This would be nothing if you can’t see the image itself :-) Now let’s convert it to a Pillow/PIL Image object and save!

>>> pil_image = fax_image.to_Pillow()
>>> pil_image.save('fax-from-p8.png')

Voila! Enjoy opening it in your favorite editor!

Check the complete list of Image (sec. 8.9.5) and InlineImage (sec. 8.9.7) attributes.

Extracting texts

Getting texts from a page is super easy. They are available on strings and text_content attributes.

Let’s go to the previous page (#7) and extract some data.

>>> viewer.prev()

Remember, when you navigate another page the viewer resets the canvas.

>>> viewer.canvas.inline_images == []
Let’s render the page and see the texts.
  • Decoded plain text strings are on strings (by pieces and in order they come on the page)
  • Decoded strings with PDF markdown are on text_content
>>> viewer.render()
>>> viewer.canvas.strings
['P', 'E', 'R', 'S', 'O', 'N', 'A', 'L', ... '2', '0', '1', '7']

As you see every character comes as an individual string in the page content stream here. Which is not usual.

Let’s go to the very first page

>>> viewer.navigate(1)
>>> viewer.render()
>>> viewer.canvas.strings
[' ', 'P', 'l', 'a', 'i', 'nt', 'i', 'f', 'f', ... '10/28/2019 1:49 PM', '19CV47031']

PDF markdown is also available.

>>> viewer.canvas.text_content
"\n BT\n0 0 0 rg\n/GS0 gs... ET"

And the strings are decoded properly. Have a look at the file:

>>> with open("tutorial-sample-content-stream-p1.txt", "w") as f:
...     f.write(viewer.canvas.text_content)

pdfreader takes care of decoding binary streams, character encodings, CMap, fonts etc. So finally you have human-readable content sources and markdown.