Have a look at the
In this tutorial we will learn simple methods on
- how to open it
- navigate pages
- exract images and texts.
Before we start, let’s make sure that you have the pdfreader distribution installed. In the Python shell, the following should run without raising an exception:
>>> import pdfreader >>> from pdfreader import PDFDocument, SimplePDFViewer
How to start¶
The first step when working with pdfreader is to create a
PDFDocument instance from a binary file. Doing so is easy:
>>> fd = open(file_name, "rb") >>> doc = PDFDocument(fd)
As pdfreader implements lazy PDF reading (it never reads more then you ask from the file), so it’s important to keep the file opened while you are working with the document. Make sure you don’t close it until you’re done.
It is also possible to use a binary file-like object to create an instance, for example:
>>> from io import BytesIO >>> with open(file_name, "rb") as f: ... stream = BytesIO(f.read()) >>> doc2 = PDFDocument(stream)
Let’s check the PDF version of the document
>>> doc.header.version '1.6'
Now we can go ahead to the document catalog and walking through pages.
How to access Document Catalog¶
Catalog (aka Document Root) contains all you need to know to start working with
the document: metadata, reference to pages tree, layout, outlines etc.
>>> doc.root.Type 'Catalog' >>> doc.root.Metadata.Subtype 'XML' >>> doc.root.Outlines.First['Title'] b'Start of Document'
For the full list of document root attributes see PDF-1.7 specification section 7.7.2
How to browse document pages¶
>>> page_one = next(doc.pages())
You may read all the pages at once
>>> all_pages = [p for p in doc.pages()] >>> len(all_pages) 15
Now we know how many pages are there!
You may wish to get some specific page if your document contains hundreds and thousands. Doing this is just a little bit trickier. To get the 6th page you need to walk through the previous five.
>>> from itertools import islice >>> page_six = next(islice(doc.pages(), 5, 6)) >>> page_five = next(islice(doc.pages(), 4, 5))
Don’t forget, that all PDF viewers start page numbering from 1, however Python lists start their indexes from 0.
>>> page_eight = all_pages
Now we can access all page attributes:
>>> page_six.MediaBox [0, 0, 612, 792] >>> page_six.Annots.Subj b'Text Box'
It’s possible to access parent Pages Tree Node for the page, which is
instance, and all it’s kids:
>>> page_six.Parent.Type 'Pages' >>> page_six.Parent.Count 15 >>> len(page_six.Parent.Kids) 15
Our example contains the only one Pages Tree Node. That is not always true.
For the complete list Page and Pages attributes see PDF-1.7 specification sections 126.96.36.199-188.8.131.52
How to start extracting PDF content¶
It’s possible to extract raw data with
PDFDocument instance but it just represents raw
document structure. It can’t interpret PDF content operators, that’s why it might be hard.
>>> fd = open(file_name, "rb") >>> viewer = SimplePDFViewer(fd)
>>> viewer.navigate(8) >>> viewer.render()
- The viewer extracts:
- page images (XObject)
- page inline images (BI/ID/EI operators)
- page forms (XObject)
- decoded page strings (PDF encodings & CMap support)
- human (and robot) readable page markdown - original PDF commands containing decoded strings.
Extracting Page Images¶
- There are 2 kinds of images in PDF documents:
- XObject images
- inline images
>>> len(viewer.canvas.inline_images) 1 >>> fax_image = viewer.canvas.inline_images >>> fax_image.Filter 'CCITTFaxDecode' >>> fax_image.Width, fax_image.Height (1800, 3113)
This would be nothing if you can’t see the image itself :-) Now let’s convert it to a Pillow/PIL Image object and save!
>>> pil_image = fax_image.to_Pillow() >>> pil_image.save('fax-from-p8.png')
Voila! Enjoy opening it in your favorite editor!
Let’s go to the previous page (#7) and extract some data.
Remember, when you navigate another page the viewer resets the canvas.
>>> viewer.canvas.inline_images ==  True
- Let’s render the page and see the texts.
>>> viewer.render() >>> viewer.canvas.strings ['P', 'E', 'R', 'S', 'O', 'N', 'A', 'L', ... '2', '0', '1', '7']
As you see every character comes as an individual string in the page content stream here. Which is not usual.
Let’s go to the very first page
>>> viewer.navigate(1) >>> viewer.render() >>> viewer.canvas.strings [' ', 'P', 'l', 'a', 'i', 'nt', 'i', 'f', 'f', ... '10/28/2019 1:49 PM', '19CV47031']
PDF markdown is also available.
>>> viewer.canvas.text_content "\n BT\n0 0 0 rg\n/GS0 gs... ET"
And the strings are decoded properly. Have a look at
>>> with open("tutorial-sample-content-stream-p1.txt", "w") as f: ... f.write(viewer.canvas.text_content) 19339
pdfreader takes care of decoding binary streams, character encodings, CMap, fonts etc. So finally you have human-readable content sources and markdown.