ocr_utils package¶

Submodules¶

ocr_utils.alto_to_svg module¶

class ocr_utils.alto_to_svg.FontSize(default: int, guess: bool, max_value: int)[source]¶

Bases: object

default: int¶

guess: bool¶

max_value: int¶

class ocr_utils.alto_to_svg.ForeignObject(text: str, x: int, y: int, width: int, height: int, **extra)[source]¶

Bases: svgwrite.base.BaseElement

Parameters

extra –

extra SVG attributes (keyword arguments)

add trailing ‘_’ to reserved keywords: 'class_', 'from_'
replace inner ‘-‘ by ‘_’: 'stroke_width'

SVG attribute names will be checked, if debug is True.

workaround for removed attribs parameter in Version 0.2.2:

# replace
element = BaseElement(attribs=adict)

#by
element = BaseElement()
element.update(adict)

elementname = 'foreignObject'¶

get_xml()[source]¶

Get the XML representation as ElementTree object.

Returns: XML ElementTree of this object and all its subelements

class ocr_utils.alto_to_svg.Text(content: str, hpos: float, vpos: float)[source]¶

Bases: object

content: str¶

hpos: float¶

vpos: float¶

ocr_utils.alto_to_svg.alto_pages_and_cells_to_svg(alto_xml_strings: List[str], pages_cells: List[List[ocr_utils.table.DetectedCell]], default_font_size: int = 40, guess_font_size: bool = True, max_font_size: int = 50) → svgwrite.drawing.Drawing[source]¶

Generates an SVG image made of concatenated pages from alto xml files and table cells

Parameters

alto_xml_strings – alto xml strings
pages_cells – detected cells on each page
default_font_size (int) – size of font in output svg
guess_font_size (bool) – if True, font size is automatically deduced from block width when possible (to handle varying font sizes)
max_font_size (int) – when guess_font_size is True, maximal possible font size is set to max_font_size (to avoid huge font size in edge cases)

Returns

svg, can be written to file with saveas method

Return type

svgwrite.Drawing

ocr_utils.alto_to_svg.alto_to_svg(input_filename: str, output_filename: str, default_font_size: int = 40, guess_font_size: bool = True, max_font_size: int = 50) → None[source]¶

Loads alto xml file and generates an SVG image made of concatenated pages.

Parameters

input_filename (str) – Path of the XML alto file
output_filename (str) – Path of the output SVG image
default_font_size (int) – size of font in output svg
guess_font_size (bool) – if True, font size is automatically deduced from block width when possible (to handle varying font sizes)
max_font_size (int) – when guess_font_size is True, maximal possible font size is set to max_font_size (to avoid huge font size in edge cases)

ocr_utils.commons module¶

ocr_utils.commons.assert_one_page_and_get_it(file_: alto.Alto) → alto.Page[source]¶

ocr_utils.pdf_to_svg module¶

ocr_utils.pdf_to_svg.pdf_to_svg(input_filename: str, output_filename: str, detect_tables: bool, lang: str) → None[source]¶

ocr_utils.table module¶

class ocr_utils.table.Cell(content: ~ T, colspan: int = 1, rowspan: int = 1)[source]¶

Bases: Generic[ocr_utils.table.T]

colspan: int = 1¶

content: T¶

classmethod from_dict(dict_: Dict, factory: Optional[Callable[[Dict], T]] = None) → ocr_utils.table.Cell [source]¶

rowspan: int = 1¶

class ocr_utils.table.Contour(x_0: int, x_1: int, y_0: int, y_1: int)[source]¶

Bases: object

x_0: int¶

x_1: int¶

y_0: int¶

y_1: int¶

class ocr_utils.table.DetectedCell(text: str, contour: ocr_utils.table.Contour, lines: List[alto.TextLine] = <factory>)[source]¶

Bases: object

contour: ocr_utils.table.Contour¶

lines: List[alto.TextLine]¶

text: str¶

class ocr_utils.table.LocatedTable(table: ocr_utils.table.Table[~ T], h_pos: int, v_pos: int, height: int, width: int)[source]¶

Bases: Generic[ocr_utils.table.T]

classmethod from_dict(dict_: Dict[str, Any], factory: Optional[Callable[[Dict], T]] = None) → ocr_utils.table.LocatedTable [source]¶

h_pos: int¶

height: int¶

table: ocr_utils.table.Table[T]¶

to_dict() → Dict[str, Any][source]¶

v_pos: int¶

width: int¶

class ocr_utils.table.Row(cells: List[ocr_utils.table.Cell[~ T]])[source]¶

Bases: Generic[ocr_utils.table.T]

cells: List[ocr_utils.table.Cell[T]]¶

classmethod from_dict(dict_: Dict, factory: Optional[Callable[[Dict], T]] = None) → ocr_utils.table.Row [source]¶

class ocr_utils.table.Table(headers: List[ocr_utils.table.Row[~ T]], rows: List[ocr_utils.table.Row[~ T]])[source]¶

Bases: Generic[ocr_utils.table.T]

classmethod from_dict(dict_: Dict, factory: Optional[Callable[[Dict], T]] = None) → ocr_utils.table.Table [source]¶

headers: List[ocr_utils.table.Row[T]]¶

rows: List[ocr_utils.table.Row[T]]¶

to_dict() → Dict[str, Any][source]¶

ocr_utils.table.extract_and_hide_cells(image_filename: str, output_filename: str, lang: str) → List[ocr_utils.table.DetectedCell][source]¶

Detects cells Returns all detected cells with their parsed content Saves image with detected cells covered by a blank rectangle (using opencv for structure detection and pytesseract for cell content detection)

Parameters

image_filename (str) – Path of the input image.
output_filename (str) – Location of the output image (input image with detected tables covered by blank rectangle).
lang (str) – Lang to use when performing OCR.

Returns

cells – List of detected cells

Return type

List[DetectedCells]

ocr_utils.table.extract_and_hide_tables(image_filename: str, output_filename: str, lang: str) → List[ocr_utils.table.LocatedTable][source]¶

Detects and returns tables in image Save image with detected tables covered by a blank rectangle (using opencv for structure detection and pytesseract for cell content detection)

Parameters

image_filename (str) – Path of the input image.
output_filename (str) – Location of the output image (input image with detected tables covered by blank rectangle).
lang (str) – Lang to use when performing OCR.

Returns

tables – List of tables with their position in the original image

Return type

List[LocatedTable]

ocr_utils.table.extract_and_hide_tables_from_image(image: numpy.ndarray, lang: str) → Tuple[numpy.ndarray, List[ocr_utils.table.LocatedTable]][source]¶

Detects and returns tables in images using opencv for structure detection and pytesseract for cell content detection. Then hides detected tables from the original image.

Parameters

image (np.ndarray) – Input image as an array of pixels, (output of cv2.imread(image_filename, 0))
lang (str) – Lang to use when performing OCR

Returns

image (np.ndarray) – Output image as an array of pixels with blank rectangle over detected tables
tables (List[LocatedTable]) – List of tables with their position in the original image

ocr_utils.table.extract_tables(image_filename: str, lang: str) → List[ocr_utils.table.LocatedTable][source]¶

Detects and returns tables in images using opencv for structure detection and pytesseract for cell content detection

Parameters

image_filename (str) – Path of the input image.
lang (str) – Lang to use when performing OCR.

Returns

tables – List of tables with their position in the original image

Return type

List[LocatedTable]

ocr_utils.table.extract_tables_from_image(image: numpy.ndarray, lang: str) → List[ocr_utils.table.LocatedTable][source]¶

Detects and returns tables in images using opencv for structure detection and pytesseract for cell content detection

Parameters

image (np.ndarray) – Input image as an array of pixels, (output of cv2.imread(image_filename, 0))
lang (str) – Lang to use when performing OCR

Returns

tables – List of tables with their position in the original image

Return type

List[LocatedTable]

ocr_utils.table.group_by_proximity(elements: List[T], are_neighbors: Callable[[T, T], bool]) → List[List[T]][source]¶

Module contents¶

Top-level package for OCR utils.

ocr_utils.get_module_version()[source]¶