pdfreader 0.1.15 Documentation


pdfreader is a Pythonic API to PDF documents which follows PDF-1.7 specification.

It allows to parse documents, extract texts, images, fonts, CMaps, and other data; access different objects within PDF documents.


  • Extracts texts (plain and formatted)

  • Extracts forms data (plain and formatted)

  • Extracts images and image masks as Pillow/PIL Images

  • Supports all PDF encodings, CMap, predefined cmaps.

  • Browse any document objects, resources and extract any data you need (fonts, annotations, metadata, multimedia, etc.)

  • Document history access and access to previous document versions if incremental updates are in place.

  • Follows PDF-1.7 specification

  • Fast document processing due to lazy objects access

Installing / Upgrading

Instructions on how to get and install the distribution.


A quick overview on how to start.

Examples and HowTos

Examples of how to perform specific tasks.

pdfreader API

API documentation, organized by module.

Issues, Support and Feature Requests

If you’re having trouble, have questions about pdfreader, or need some features the best place to ask is the Github issue tracker. Once you get an answer, it’d be great if you could work it back into this documentation and contribute!


pdfreader is an open source project. You’re welcome to contribute:

  • Code patches

  • Bug reports

  • Patch reviews

  • Introduce new features

  • Documentation improvements

pdfreader uses GitHub issues to keep track of bugs, feature requests, etc.

See project sources


If this project is helpful, you can treat me to coffee :-)


About This Documentation

This documentation is generated using the Sphinx documentation generator. The source files for the documentation are located in the doc/ directory of the pdfreader distribution. To generate the docs locally run the following command from the root directory of the pdfreader source:

$ python setup.py doc

Table of Contents