pdfreader 0.1.4 Documentation¶
pdfreader is a Pythonic API to PDF documents which follows PDF-1.7 specification.
It allows to parse documents, extract texts, images, fonts, CMaps, and other data; access different objects within PDF documents.
- Extracts texts (plain and formatted)
- Extracts forms data (plain and formatted)
- Extracts images and image masks as Pillow/PIL Images
- Supports all PDF encodings, CMap, predefined cmaps.
- Browse any document objects, resources and extract any data you need (fonts, annotations, metadata, multimedia, etc.)
- Document history access and access to previous document versions if incremental updates are in place.
- Follows PDF-1.7 specification
- Fast document processing due to lazy objects access
Issues, Support and Feature Requests¶
If you’re having trouble, have questions about pdfreader, or need some features the best place to ask is the Github issue tracker. Once you get an answer, it’d be great if you could work it back into this documentation and contribute!
pdfreader is an open source project. You’re welcome to contribute:
- Code patches
- Bug reports
- Patch reviews
- Introduce new features
- Documentation improvements
pdfreader uses GitHub issues to keep track of bugs, feature requests, etc.
See project sources
If this project is helpful, you can treat me to coffee :-)
About This Documentation¶
This documentation is generated using the Sphinx documentation generator. The source files for the documentation are located in the doc/ directory of the pdfreader distribution. To generate the docs locally run the following command from the root directory of the pdfreader source:
$ python setup.py doc
Table of Contents¶
- Installing / Upgrading
- Examples and HowTos
- PDFDocument vs. SimplePDFViewer
- How to extract XObject or Inline Images, Image Masks
- How to parse PDF texts
- How to parse PDF Forms
- How to extract CMap for a font from PDF
- How to extract Font data from PDF
- How to browse PDF objects
- pdfreader API