Welcome to OCR utils’s documentation!¶

OCR utils¶

Python tools for interacting with Tesseract

Features¶

Detects tables in PDF/images and performs OCR on each cell
Performs OCR on PDF and generates SVG image

Quick Start¶

from ocr_utils import pdf_to_svg

pdf_to_svg(
    input_filename='in.pdf',
    output_filename='out.svg',
    detect_tables=True,
    lang='eng',
)

Execution example¶

Input pdf¶

Output svg¶

Installation¶

Stable Release: pip install tesseract_ocr_utils
Development Head: pip install git+https://github.com/envinorma/ocr_utils.git

This library is built upon pytesseract and pdf2image which have non-pip requirements. Visit these libraries installation pages to install dependencies.

For example, on ubuntu, the following libraries need to be installed:

apt-get install libarchive13
apt-get install tesseract-ocr
apt-get install poppler-utils

Documentation¶

For full package documentation please visit envinorma.github.io/ocr_utils.

Development¶

See CONTRIBUTING.md for information related to developing the code.

MIT license