pdfreader 0.1.4 Documentation¶
Overview¶
pdfreader is a Pythonic API to PDF documents which follows PDF-1.7 specification.
It allows to parse documents, extract texts, images, fonts, CMaps, and other data; access different objects within PDF documents.
Features:
- Extracts texts (plain and formatted)
- Extracts forms data (plain and formatted)
- Extracts images and image masks as Pillow/PIL Images
- Supports all PDF encodings, CMap, predefined cmaps.
- Browse any document objects, resources and extract any data you need (fonts, annotations, metadata, multimedia, etc.)
- Document history access and access to previous document versions if incremental updates are in place.
- Follows PDF-1.7 specification
- Fast document processing due to lazy objects access
- Installing / Upgrading
- Instructions on how to get and install the distribution.
- Tutorial
- A quick overview on how to start.
- Examples and HowTos
- Examples of how to perform specific tasks.
- pdfreader API
- API documentation, organized by module.
Issues, Support and Feature Requests¶
If you’re having trouble, have questions about pdfreader, or need some features the best place to ask is the Github issue tracker. Once you get an answer, it’d be great if you could work it back into this documentation and contribute!
Contributing¶
pdfreader is an open source project. You’re welcome to contribute:
- Code patches
- Bug reports
- Patch reviews
- Introduce new features
- Documentation improvements
pdfreader uses GitHub issues to keep track of bugs, feature requests, etc.
See project sources
About This Documentation¶
This documentation is generated using the Sphinx documentation generator. The source files for the documentation are located in the doc/ directory of the pdfreader distribution. To generate the docs locally run the following command from the root directory of the pdfreader source:
$ python setup.py doc
Table of Contents¶
Installing / Upgrading¶
pdfreader is in the Python Package Index.
Installing with pip¶
We recommend using pip to install pdfreader on all platforms:
$ python -m pip install pdfreader
To get a specific version of pdfreader:
$ python -m pip install pdfreader==0.1.2
To upgrade using pip:
$ python -m pip install --upgrade pdfreader
Installing with easy_install¶
To install with easy_install
from
setuptools do:
$ python -m easy_install pdfreader
Installing from source¶
You can also download the project source and do:
$ git clone git://github.com/maxpmaxp/pdfreader.git pdfreader
$ cd pdfreader/
$ python setup.py install
Python versions support¶
pdfreader supports Python 3.6+. It might work on 3.4 and 3.5 but was never tested.
It is not compatible with Python 2.
Tutorial¶
Have a look at the sample file
.
In this tutorial we will learn simple methods on
- how to open it
- navigate pages
- exract images and texts.
Prerequisites¶
Before we start, let’s make sure that you have the pdfreader distribution installed. In the Python shell, the following should run without raising an exception:
>>> import pdfreader
>>> from pdfreader import PDFDocument, SimplePDFViewer
How to start¶
The first step when working with pdfreader is to create a
PDFDocument
instance from a binary file. Doing so is easy:
>>> fd = open(file_name, "rb")
>>> doc = PDFDocument(fd)
As pdfreader implements lazy PDF reading (it never reads more then you ask from the file), so it’s important to keep the file opened while you are working with the document. Make sure you don’t close it until you’re done.
It is also possible to use a binary file-like object to create an instance, for example:
>>> from io import BytesIO
>>> with open(file_name, "rb") as f:
... stream = BytesIO(f.read())
>>> doc2 = PDFDocument(stream)
Let’s check the PDF version of the document
>>> doc.header.version
'1.6'
Now we can go ahead to the document catalog and walking through pages.
How to access Document Catalog¶
Catalog
(aka Document Root) contains all you need to know to start working with
the document: metadata, reference to pages tree, layout, outlines etc.
>>> doc.root.Type
'Catalog'
>>> doc.root.Metadata.Subtype
'XML'
>>> doc.root.Outlines.First['Title']
b'Start of Document'
For the full list of document root attributes see PDF-1.7 specification section 7.7.2
How to browse document pages¶
There is a generator pages()
to browse the pages one by one.
It yields Page
instances.
>>> page_one = next(doc.pages())
You may read all the pages at once
>>> all_pages = [p for p in doc.pages()]
>>> len(all_pages)
15
Now we know how many pages are there!
You may wish to get some specific page if your document contains hundreds and thousands. Doing this is just a little bit trickier. To get the 6th page you need to walk through the previous five.
>>> from itertools import islice
>>> page_six = next(islice(doc.pages(), 5, 6))
>>> page_five = next(islice(doc.pages(), 4, 5))
Don’t forget, that all PDF viewers start page numbering from 1, however Python lists start their indexes from 0.
>>> page_eight = all_pages[7]
Now we can access all page attributes:
>>> page_six.MediaBox
[0, 0, 612, 792]
>>> page_six.Annots[0].Subj
b'Text Box'
It’s possible to access parent Pages Tree Node for the page, which is PageTreeNode
instance, and all it’s kids:
>>> page_six.Parent.Type
'Pages'
>>> page_six.Parent.Count
15
>>> len(page_six.Parent.Kids)
15
Our example contains the only one Pages Tree Node. That is not always true.
For the complete list Page and Pages attributes see PDF-1.7 specification sections 7.7.3.2-7.7.3.3
How to start extracting PDF content¶
It’s possible to extract raw data with PDFDocument
instance but it just represents raw
document structure. It can’t interpret PDF content operators, that’s why it might be hard.
Fortunately there is SimplePDFViewer
, which understands a lot.
It is a simple PDF interpreter which can “display” (whatever this means)
a page on SimpleCanvas
.
>>> fd = open(file_name, "rb")
>>> viewer = SimplePDFViewer(fd)
The viewer instance gets content you see in your Adobe Acrobat Reader. Just navigate a page with
navigate()
and call render()
>>> viewer.navigate(8)
>>> viewer.render()
- The viewer extracts:
- page images (XObject)
- page inline images (BI/ID/EI operators)
- page forms (XObject)
- decoded page strings (PDF encodings & CMap support)
- human (and robot) readable page markdown - original PDF commands containing decoded strings.
Extracting Page Images¶
- There are 2 kinds of images in PDF documents:
- XObject images
- inline images
Every one is represented by its own class
(Image
and InlineImage
)
Let’s extract some pictures now! They are accessible through canvas
attribute. Have a look at page 8
of the sample document. It contains a fax message, and is is available
on inline_images
list.
>>> len(viewer.canvas.inline_images)
1
>>> fax_image = viewer.canvas.inline_images[0]
>>> fax_image.Filter
'CCITTFaxDecode'
>>> fax_image.Width, fax_image.Height
(1800, 3113)
This would be nothing if you can’t see the image itself :-) Now let’s convert it to a Pillow/PIL Image object and save!
>>> pil_image = fax_image.to_Pillow()
>>> pil_image.save('fax-from-p8.png')
Voila! Enjoy opening it in your favorite editor!
Check the complete list of Image (sec. 8.9.5) and InlineImage (sec. 8.9.7) attributes.
Extracting texts¶
Getting texts from a page is super easy. They are available on strings
and
text_content
attributes.
Let’s go to the previous page (#7) and extract some data.
>>> viewer.prev()
Remember, when you navigate another page the viewer resets the canvas.
>>> viewer.canvas.inline_images == []
True
- Let’s render the page and see the texts.
- Decoded plain text strings are on
strings
(by pieces and in order they come on the page) - Decoded strings with PDF markdown are on
text_content
- Decoded plain text strings are on
>>> viewer.render()
>>> viewer.canvas.strings
['P', 'E', 'R', 'S', 'O', 'N', 'A', 'L', ... '2', '0', '1', '7']
As you see every character comes as an individual string in the page content stream here. Which is not usual.
Let’s go to the very first page
>>> viewer.navigate(1)
>>> viewer.render()
>>> viewer.canvas.strings
[' ', 'P', 'l', 'a', 'i', 'nt', 'i', 'f', 'f', ... '10/28/2019 1:49 PM', '19CV47031']
PDF markdown is also available.
>>> viewer.canvas.text_content
"\n BT\n0 0 0 rg\n/GS0 gs... ET"
And the strings are decoded properly. Have a look at
the file
:
>>> with open("tutorial-sample-content-stream-p1.txt", "w") as f:
... f.write(viewer.canvas.text_content)
19339
pdfreader takes care of decoding binary streams, character encodings, CMap, fonts etc. So finally you have human-readable content sources and markdown.
Examples and HowTos¶
Advanced PDF data extraction techniques with real-life examples.
PDFDocument vs. SimplePDFViewer
What is the difference? The usecases.
How to extract XObject or Inline Images, Image Masks
Instructions on how to extract different image types for further manipulations.
Advanced text objects access methods for further parsing.
Instructions on how to extract text data from PDF Forms.
How to extract Font data from PDF
It’s possible to extract an embedded font. Let’s read how to do that.
How to extract CMap for a font from PDF
What if you need to see font’s CMap?
Instructions on how to navigate PDF documents and access it’s objects. Advanced techniques.
PDFDocument vs. SimplePDFViewer¶
- pdfreader provides 2 different interfaces for PDFs:
What is the difference?
PDFDocument
:- knows nothing about interpretation of content-level PDF operators
- knows all about PDF file and document structure (types, objects, indirect objects, references etc.)
- can be used to access any document object: XRef table, DocumentCatalog, page tree nodes (aka Pages), binary streams like Font, CMap, Form, Page etc.
- can be used to access raw objects content (raw page content stream for example)
- has no graphical state
SimplePDFViewer
:- uses
PDFDocument
as document navigation engine - can render document content properly decoding it and interpreting PDF operators
- has graphical state
- uses
Use PDFDocument
to navigate document and access raw data.
Use SimplePDFViewer
to extract content you see in your favorite viewer
(Adobe Acrobat Reader, hehe :-).
Let’s see several usecases.
How to extract XObject or Inline Images, Image Masks¶
Extracting Inline Images is discussed in tutorial Extracting Page Images, so let’s focus on XObject Images and Image Masks.
Extracting XObject Image¶
Open a sample document.
>>> from pdfreader import PDFDocument
>>> fd = open(file_name, "rb")
>>> doc = PDFDocument(fd)
Have a look at the sample file sample file
.
There is a logo on the first page. Let’s extract it.
>>> page = next(doc.pages())
Let’s check a dictionary of XObject resources for the page:
>>> page.Resources.XObject
{'img0': <IndirectReference:n=11,g=0>}
This stands for an XObject named img0, and referenced under number 11 and generation 0. The object has not been read by pdfreader still. We are lazy readers. We read objects only when we need them. Let’s see what the object is.
>>> xobj = page.Resources.XObject['img0']
We just read the object (__getitem__ does this implicitly) and now we may access its attributes.
>>> xobj.Type, xobj.Subtype
('XObject', 'Image')
Wow! It’s really an image. Should we care about it’s internal PDF representation? Of course no, let’s just convert it to a Pillow/PIL Image and save.
>>> pil_image = xobj.to_Pillow()
>>> pil_image.save("extract-logo.png")
And here we are!

Try to open it and see any differences. It’s absolutely the same as in the document.
Now you can manipulate pil_image with usual PIL methods: rotate, convert, blur, split, inverse, merge and so on, so on, so on.
Extracting Images: a very simple way¶
A very simple way also exisits.
Use SimplePDFViewer
:
>>> from pdfreader import SimplePDFViewer
>>> fd = open(file_name, "rb")
>>> viewer = SimplePDFViewer(fd)
>>> viewer.render()
After rendering all 1st page images are on the canvas
>>> all_page_images = viewer.canvas.images
>>> all_page_inline_images = viewer.canvas.inline_images
>>> img = all_page_images['img0']
>>> img.Type, img.Subtype
('XObject', 'Image')
Now you can convert it with magic to_Pillow()
method, save or do whatever you want!
Extracting Image Masks¶
Image Mask is just a specific kind of image actually. Except it is not always visible directly in your PDF Viewer. Nevertheless it can be accessed absolutely the same way.
Let’s have a look at the example
from Extracting Page Images,
and see what image masks it contains.
>>> from pdfreader import SimplePDFViewer
>>> fd = open(pdf_file_name, "rb")
>>> viewer = SimplePDFViewer(fd)
We use Image.ImageMask attribute to filter image masks from another images. Let’s go to the 5th page and take the first image mask:
>>> viewer.navigate(5)
>>> viewer.render()
>>> inline_images = viewer.canvas.inline_images
>>> image_mask = next(img for img in inline_images if img.ImageMask)
Now convert it to Pillow object and save:
>>> pil_img = image_mask.to_Pillow()
>>> pil_img.save("mask.png")
Have a look! What a beautiful QR-code!

Useful links¶
- You find the complete list of PDF image attributes in the specification:
How to parse PDF texts¶
Simple ways of getting plain texts and formatted texts from documents are discussed in the tutorial Extracting texts, so let’s focus on advanced techniques.
In this example we build a parser for traffic crash reports
,
that extracts:
- local report number
- reporting agency name
- crash severity
from the first page. The parser can be applied to all crash reports like that.

Let’s open the document and render the first page:
>>> from pdfreader import SimplePDFViewer
>>> fd = open(file_name, "rb")
>>> viewer = SimplePDFViewer(fd)
>>> viewer.render()
Every PDF page has one or more binary content streams associated with it. Streams may contain inline images, text blocks, text formatting instructions, display device operators etc. In this example we stay focused on text blocks.
Every text block in a stream is surrounded by BT/ET instructions and usually tricky encoded. Fortunately the viewer understands lot of PDF operators and encoding methods, so after rendering we may access human-readable PDF markup containing decoded strings.
>>> markdown = viewer.canvas.text_content
>>> markdown
"... BT\n/F3 6.0 Tf\n0 0 0 rg\n314.172 TL\n168.624 759.384 Td\n(LOCAL INFORMATION) Tj\n ..."
This text block contains instructions for a viewer (font, positioning etc.) and one string surrounded by brackets.
>>> viewer.canvas.strings
['LOCAL INFORMATION', 'P19010300000457', ...]
Text-related SimpleCanvas
attributes are:
text_content
- contains all data within a single BT/ET block: commands and text strings. All text strings are surrounded by brackets and decoded according to the current graphical state (q, Q, gs, Tf and few other commands). The value can be used to parse text content by PDF markdown.strings
- list of all strings as they come in text blocks. Just decoded plain text. No PDF markdown here.
How to parse PDF markdown¶
At this point markdown contains all texts with PDF markdown from the page.
>>> isinstance(markdown, str)
True
Let’s save it as a text file and analyze how can we extract the data we need.
>>> with open("example-crash-markdown.txt", "w") as f:
... f.write(markdown)
52643
Open your favorite editor and have a look at the file
.
Now we may use any text processing tools like regular expressions, grep, custom parsers to extract the data.
>>> reporting_agency = markdown.split('(REPORTING AGENCY NAME *)', 1)[1].split('(', 1)[1].split(')',1)[0]
>>> reporting_agency
'Ohio State Highway Patrol'
>>> local_report_number = markdown.split('(LOCAL REPORT NUMBER *)', 1)[1].split('(', 1)[1].split(')',1)[0]
>>> local_report_number
'02-0005-02'
>>> crash_severity = markdown.split('( ERROR)', 1)[1].split('(', 1)[1].split(')',1)[0]
>>> crash_severity
'1'
Here we are!
Useful links¶
- Detailed description of PDF texts is here (see sec. 9)
- Conforming reader graphical state reading is here (see sec. 8.4)
How to parse PDF Forms¶
In most cases texts come within page binary content streams and can be extracted as in Extracting texts and How to parse PDF texts.
There is one more place where text data can be found: page forms. Form is a special subtype of XObject which is a part of page resources, and can be referenced from page by do command.
You may think of Form as of “small subpage” that is stored aside main content.
Have a look at one PDF form
.
Let’s open the document and get the 1st page.
>>> from pdfreader import SimplePDFViewer
>>> fd = open(file_name, "rb")
>>> viewer = SimplePDFViewer(fd)
And now, let’s try to locate a string, located under the section B.3 SOC (ONET/OES) occupation title

>>> viewer.render()
>>> plain_text = "".join(viewer.canvas.strings)
>>> "Farmworkers and Laborers" in plain_text
False
Apparently, the texts typed into the form are in some other place. They are in Form XObjects, listed under page resources. The viewer puts them on canvas:
>>> sorted(list(viewer.canvas.forms.keys()))
['Fm1', 'Fm10', 'Fm11', 'Fm12', 'Fm13', 'Fm14',...]
As Form is a kind of “sub-document” every entry in viewer.canvas.forms dictionary maps to
SimpleCanvas
instance:
>>> form9_canvas = viewer.canvas.forms['Fm9']
>>> "".join(form9_canvas.strings)
'Farmworkers and Laborers, Crop, Nursery, and Greenhouse'
Here we are!
More on PDF Form objects: see sec. 8.10
How to extract CMap for a font from PDF¶
In this example we extract CMap data for a font from PDF file.
CMaps (Character Maps) are text files used in PDF to map character codes to character glyphs in CID fonts. They come to PDF from PostScript.
Let’s open a sample document
.
>>> from pdfreader import PDFDocument
>>> fd = open(file_name, "rb")
>>> doc = PDFDocument(fd)
Now let’s navigate to the 3rd page:
>>> from itertools import islice
>>> page = next(islice(doc.pages(), 2, 3))
and check page’s fonts.
>>> page.Resources.Font
{'R11': <IndirectReference:n=153,g=0>, ... 'R9': <IndirectReference:n=152,g=0>}
>>> len(page.Resources.Font)
9
We see 9 different font resources. As pdfreader is a lazy reader the font data has not been read yet. We just see the names and the references to the objects.
Let’s have a look at font named R26.
>>> font = page.Resources.Font['R26']
>>> font.Subtype, bool(font.ToUnicode)
('Type1', True)
It is PostScript Type1 font, and texts use CMap provided by ToUnicode attribute. Font’s ToUnicode attribute contains a reference to the CMap file data stream:
>>> cmap = font.ToUnicode
Cmap file is a StreamBasedObject
instance containing flate encoded binary stream.
>>> type(cmap)
<class 'pdfreader.types.objects.StreamBasedObject'>
>>> cmap.Filter
'FlateDecode'
that can be decoded by accessing filtered
:
>>> data = cmap.filtered
>>> data
b'/CIDInit /ProcSet findresource ... end\n'
>>> with open("sample-cmap.txt", "wb") as f:
... f.write(data)
229
Voila! 229 bytes written :-)
As it is a text file you can open it
with your favorite text editor.
How to extract Font data from PDF¶
In this example we extract font data from a PDF file.
Let’s open a sample document
.
>>> from pdfreader import PDFDocument
>>> fd = open(file_name, "rb")
>>> doc = PDFDocument(fd)
Now let’s see what fonts the very first page uses:
>>> page = next(doc.pages())
>>> sorted(page.Resources.Font.keys())
['T1_0', 'T1_1', 'T1_2', 'TT0', 'TT1']
We see 5 fonts named T1_0, T1_1, T1_2, TT0 and TT1. As pdfreader is a lazy reader the font data has not been read yet. We just have the names and the references to the objects.
Let’s have a look at font T1_0.
>>> font = page.Resources.Font['T1_0']
>>> font.Subtype, font.BaseFont, font.Encoding
('Type1', 'SCMYNU+TimesNewRomanPSMT', 'WinAnsiEncoding')
It is PostScript Type1 font, based on TimesNewRomanPSMT. Texts use WinAnsiEncoding, which is almost like python’s cp1252.
Font’s FontDescriptor contains a reference to the font file data stream:
>>> font_file = font.FontDescriptor.FontFile
The font file is a flate encoded binary stream StreamBasedObject
>>> type(font_file)
<class 'pdfreader.types.objects.StreamBasedObject'>
>>> font_file.Filter
['FlateDecode']
which can be decoded by accessing filtered
>>> data = font_file.filtered
>>> with open("sample-font.type1", "wb") as f:
... f.write(data)
16831
Voila! 16831 bytes written :-)
How to browse PDF objects¶
There could be a reason when you need to access raw PDF objects as they are in the document.
Or even get an object by its number and generation, which is also possible.
Let’s see several examples with PDFDocument
.
Accessing document objects¶
Let’s take a sample file from How to access Document Catalog tutorial. We already discussed there how to locate document catalog.
>>> from pdfreader import PDFDocument
>>> fd = open(file_name, "rb")
>>> doc = PDFDocument(fd)
>>> catalog = doc.root
To walk through the document you need to know object attributes and possible values. It can be found on PDF-1.7 specification. Then simply use attribute names in your python code.
>>> catalog.Type
'Catalog'
>>> catalog.Metadata.Type
'Metadata'
>>> catalog.Metadata.Subtype
'XML'
>>> pages_tree_root = catalog.Pages
>>> pages_tree_root.Type
'Pages'
Attribute names are cases sensitive. Missing or non-existing attributes have value of None
>>> catalog.type is None
True
>>> catalog.Metadata.subType is None
True
>>> catalog.Metadata.UnkNown_AttriBute is None
True
If object is an array, access its items by index:
>>> first_page = pages_tree_root.Kids[0]
>>> first_page.Type
'Page'
>>> first_page.Contents.Length
3890
If object is a stream, you can get either raw data (deflated in this example):
>>> raw_data = first_page.Contents.stream
>>> first_page.Contents.Length == len(raw_data)
True
>>> first_page.Contents.Filter
'FlateDecode'
or decoded content:
>>> decoded_content = first_page.Contents.filtered
>>> len(decoded_content)
18428
>>> decoded_content.startswith(b'BT\n0 0 0 rg\n/GS0 gs')
True
All object reads are lazy. pdfreader reads an object when you access it for the first time.
Locate objects by number and generation¶
On the file structure level all objects have unique number an generation to identify them. To get an object by number and generation (for example to track object changes if incremental updates took place on file), just run:
>>> num, gen = 2, 0
>>> raw_obj = doc.locate_object(num, gen)
>>> obj = doc.build(raw_obj)
>>> obj.Type
'Catalog'
pdfreader API¶
pdfreader.document submodule¶
-
class
pdfreader.document.
PDFDocument
(fobj)¶ Constructor method
-
root
= None¶ references to document’s Catalog instance
-
header
= None¶ contains PDF file header data
-
trailer
= None¶ contains PDF file trailer data
-
build
(obj, visited=None, lazy=True)¶ Resolves all indirect references for the object.
Parameters: - obj (one of supported PDF types) – an object from the document
- lazy (bool) – don’t resolve subsequent indirect references if True (default).
- visited – Shouldn’t be used. Internal param containing already resolved objects to not fall into infinite loops
-
locate_object
(num, gen)¶
-
pdfreader.viewer submodule¶
- class
pdfreader.viewer.
SimplePDFViewer
(*args, **kwargs)¶
- Simple PDF document interpreter (viewer).
- uses
PDFDocument
as document navigation engine- renders document page content onto
SimpleCanvas
- has graphical state
On initialization automatically navigates to the 1st page.
Parameters: fobj – file-like object: binary file descriptor, BytesIO stream etc.
current_page_number
¶Contains current page number
gss
¶Reflects current graphical state.
GraphicsStateStack
instance.
canvas
¶Current page canvas -
SimpleCanvas
instance
render
()¶Renders current page onto current canvas by interpreting content stream(s) commands. Charnges: graphical state, canvas.
Navigates viewer to n-th page of the document. Side-effects: clears canvas, resets page resources, resets graphics state
Parameters: n – page number. The very first page has number 1 Raises: PageDoesNotExist – if there is no n-th page
next
()¶Navigates viewer to the next page of the document. Side-effects: clears canvas, resets page resources, resets graphics state
Raises: PageDoesNotExist – if there is no next page
prev
()¶Navigates viewer to the previous page of the document. Side-effects: clears canvas, resets page resources, resets graphics state
Raises: PageDoesNotExist – if there is no previous page
- class
pdfreader.viewer.
SimpleCanvas
¶Very simple canvas for PDF viewer: can contain page images (inline and XObject), strings, forms and text content.
text_content
¶Shall be a meaningful string representation of page content for further usage (decoded strings + markdown for example)
strings
¶Shall be list of
InlineImage
objects as they appear on page stream (BI/ID/EI operators)
images
¶Shall be dict of name ->
SimpleCanvas
built from Form XObjects displayed with do command
inline_images
¶
forms
¶
- class
pdfreader.viewer.
GraphicsState
(**kwargs)¶Viewer’s graphics state. See PDF 1.7 specification
sec. 9.3 - Text State Parameters and Operators
Parameters: kwargs – dict of attributes to set
CTM
¶current transformation matrix
LW
¶line width
LC
¶line cap
LJ
¶line join style
ML
¶miter limit
D
¶line dash
RI
¶color rendering intent
I
¶flatness tolerance
Font
[font_name, font_size]¶shall be a list if exists - [font_name, font_size] (Tf operator)
Tc
¶char spacing
Tw
¶word spacing
Tz
¶horizontlal scaling
TL
¶text leading
Tr
¶text rendering mode
Ts
¶text rise
- class
pdfreader.viewer.
GraphicsStateStack
¶Graphics state stack. See PDF 1.7 specification sec. 8.4.2 - Graphics State Stack
save_state
()¶Copies current state and puts it on the top
restore_state
()¶Restore previously saved state from the top
- class
pdfreader.viewer.
Resources
(**kwargs)¶Page resources. See sec 7.8.3 Resource Dictionaries
- class
pdfreader.viewer.
PageDoesNotExist
¶Exception. Supposed to be raised by PDF viewers on navigation to non-existing pages.
pdfreader.types submodule¶
- class
pdfreader.types.objects.
DictBasedObject
(doc, *args, **kwargs)¶Dictionary-based object. Automatically resolves indirect references on attributes/items access
- class
pdfreader.types.objects.
StreamBasedObject
(doc, stream)¶Stream-based object. Automatically resolves indirect references on attributes access
- class
pdfreader.types.objects.
ArrayBasedObject
(doc, lst)¶Array-based object. Automatically resolves indirect references on items access
- class
pdfreader.types.objects.
Catalog
(doc, *args, **kwargs)¶Dictionary based object. (Type = Catalog) See PDF 1.7 specification sec. 7.7.2 - DocumentCatalog
- class
pdfreader.types.objects.
PageTreeNode
(doc, *args, **kwargs)¶Dictionary based object. (Type = Pages) See PDF 1.7 specification sec. 7.7.3.2 - Page Tree Nodes
- class
pdfreader.types.objects.
Page
(doc, *args, **kwargs)¶Dictionary based Page object. (Type = Page) See PDF 1.7 specification sec. 7.7.3.3 - Page Objects
- class
pdfreader.types.objects.
Image
(doc, stream)¶Stream based XObject object. (Type = XObject, Subtype = Image) See PDF 1.7 specification sec. 8.9 - Images
to_Pillow
()¶Converts image into PIL.Image object.
Returns: PIL.Image instance
- class
pdfreader.types.objects.
Form
(doc, stream)¶Stream based XObject object. (Type = XObject, Subtype = Form) See PDF 1.7 specification sec. 8.10 - Form XObjects
- class
pdfreader.types.objects.
XObject
(doc, stream)¶Stream based XObject object. (Type = XObject) See PDF 1.7 specification sec. 8.8 - External Objects
- class
pdfreader.types.content.
InlineImage
(entries, data)¶BI/ID/EI operators content.
Inline image looks like a stream-based object but really it is not. We just follow Stream interface to have an option to interact with InlineImage the same way as with XObject/Image
dictionary
¶key-value image properties
data
¶bytes, encoded image stream
to_Pillow
()¶Converts image into PIL.Image object.
Returns: PIL.Image instance
-
pdfreader.
version
= '0.1.4'¶ package version
PDF data extraction, browsing objects:
-
pdfreader.
PDFDocument
¶ Alias for
pdfreader.document.PDFDocument
-
pdfreader.
SimplePDFViewer
¶ Alias for
pdfreader.viewer.SimplePDFViewer
Major classes used by pdfreader.viewer.SimplePDFViewer
Major classes and types:
Objects you face up in page/form content streams: