Welcome to OCR utils’s documentation!

OCR utils

Build Status Documentation Code Coverage

Python tools for interacting with Tesseract


Features

  • Detects tables in PDF/images and performs OCR on each cell

  • Performs OCR on PDF and generates SVG image

Quick Start

from ocr_utils import pdf_to_svg

pdf_to_svg(
    input_filename='in.pdf',
    output_filename='out.svg',
    detect_tables=True,
    lang='eng',
)

Execution example

Input pdf

Input pdf

Output svg

Output svg

Installation

Stable Release: pip install tesseract_ocr_utils
Development Head: pip install git+https://github.com/envinorma/ocr_utils.git

This library is built upon pytesseract and pdf2image which have non-pip requirements. Visit these libraries installation pages to install dependencies.

For example, on ubuntu, the following libraries need to be installed:

apt-get install libarchive13
apt-get install tesseract-ocr
apt-get install poppler-utils

Documentation

For full package documentation please visit envinorma.github.io/ocr_utils.

Development

See CONTRIBUTING.md for information related to developing the code.

MIT license

Indices and tables