ocr_utils package

Submodules

ocr_utils.alto_to_svg module

class ocr_utils.alto_to_svg.FontSize(default: int, guess: bool, max_value: int)[source]

Bases: object

default: int
guess: bool
max_value: int
class ocr_utils.alto_to_svg.ForeignObject(text: str, x: int, y: int, width: int, height: int, **extra)[source]

Bases: svgwrite.base.BaseElement

Parameters

extra

extra SVG attributes (keyword arguments)

  • add trailing ‘_’ to reserved keywords: 'class_', 'from_'

  • replace inner ‘-‘ by ‘_’: 'stroke_width'

SVG attribute names will be checked, if debug is True.

workaround for removed attribs parameter in Version 0.2.2:

# replace
element = BaseElement(attribs=adict)

#by
element = BaseElement()
element.update(adict)
elementname = 'foreignObject'
get_xml()[source]

Get the XML representation as ElementTree object.

Returns

XML ElementTree of this object and all its subelements

class ocr_utils.alto_to_svg.Text(content: str, hpos: float, vpos: float)[source]

Bases: object

content: str
hpos: float
vpos: float
ocr_utils.alto_to_svg.alto_pages_and_cells_to_svg(alto_xml_strings: List[str], pages_cells: List[List[ocr_utils.table.DetectedCell]], default_font_size: int = 40, guess_font_size: bool = True, max_font_size: int = 50)svgwrite.drawing.Drawing[source]

Generates an SVG image made of concatenated pages from alto xml files and table cells

Parameters
  • alto_xml_strings – alto xml strings

  • pages_cells – detected cells on each page

  • default_font_size (int) – size of font in output svg

  • guess_font_size (bool) – if True, font size is automatically deduced from block width when possible (to handle varying font sizes)

  • max_font_size (int) – when guess_font_size is True, maximal possible font size is set to max_font_size (to avoid huge font size in edge cases)

Returns

svg, can be written to file with saveas method

Return type

svgwrite.Drawing

ocr_utils.alto_to_svg.alto_to_svg(input_filename: str, output_filename: str, default_font_size: int = 40, guess_font_size: bool = True, max_font_size: int = 50)None[source]

Loads alto xml file and generates an SVG image made of concatenated pages.

Parameters
  • input_filename (str) – Path of the XML alto file

  • output_filename (str) – Path of the output SVG image

  • default_font_size (int) – size of font in output svg

  • guess_font_size (bool) – if True, font size is automatically deduced from block width when possible (to handle varying font sizes)

  • max_font_size (int) – when guess_font_size is True, maximal possible font size is set to max_font_size (to avoid huge font size in edge cases)

ocr_utils.commons module

ocr_utils.commons.assert_one_page_and_get_it(file_: alto.Alto)alto.Page[source]

ocr_utils.pdf_to_svg module

ocr_utils.pdf_to_svg.pdf_to_svg(input_filename: str, output_filename: str, detect_tables: bool, lang: str)None[source]

ocr_utils.table module

class ocr_utils.table.Cell(content: ~ T, colspan: int = 1, rowspan: int = 1)[source]

Bases: Generic[ocr_utils.table.T]

colspan: int = 1
content: T
classmethod from_dict(dict_: Dict, factory: Optional[Callable[[Dict], T]] = None)ocr_utils.table.Cell[source]
rowspan: int = 1
class ocr_utils.table.Contour(x_0: int, x_1: int, y_0: int, y_1: int)[source]

Bases: object

x_0: int
x_1: int
y_0: int
y_1: int
class ocr_utils.table.DetectedCell(text: str, contour: ocr_utils.table.Contour, lines: List[alto.TextLine] = <factory>)[source]

Bases: object

contour: ocr_utils.table.Contour
lines: List[alto.TextLine]
text: str
class ocr_utils.table.LocatedTable(table: ocr_utils.table.Table[~ T], h_pos: int, v_pos: int, height: int, width: int)[source]

Bases: Generic[ocr_utils.table.T]

classmethod from_dict(dict_: Dict[str, Any], factory: Optional[Callable[[Dict], T]] = None)ocr_utils.table.LocatedTable[source]
h_pos: int
height: int
table: ocr_utils.table.Table[T]
to_dict()Dict[str, Any][source]
v_pos: int
width: int
class ocr_utils.table.Row(cells: List[ocr_utils.table.Cell[~ T]])[source]

Bases: Generic[ocr_utils.table.T]

cells: List[ocr_utils.table.Cell[T]]
classmethod from_dict(dict_: Dict, factory: Optional[Callable[[Dict], T]] = None)ocr_utils.table.Row[source]
class ocr_utils.table.Table(headers: List[ocr_utils.table.Row[~ T]], rows: List[ocr_utils.table.Row[~ T]])[source]

Bases: Generic[ocr_utils.table.T]

classmethod from_dict(dict_: Dict, factory: Optional[Callable[[Dict], T]] = None)ocr_utils.table.Table[source]
headers: List[ocr_utils.table.Row[T]]
rows: List[ocr_utils.table.Row[T]]
to_dict()Dict[str, Any][source]
ocr_utils.table.extract_and_hide_cells(image_filename: str, output_filename: str, lang: str)List[ocr_utils.table.DetectedCell][source]

Detects cells Returns all detected cells with their parsed content Saves image with detected cells covered by a blank rectangle (using opencv for structure detection and pytesseract for cell content detection)

Parameters
  • image_filename (str) – Path of the input image.

  • output_filename (str) – Location of the output image (input image with detected tables covered by blank rectangle).

  • lang (str) – Lang to use when performing OCR.

Returns

cells – List of detected cells

Return type

List[DetectedCells]

ocr_utils.table.extract_and_hide_tables(image_filename: str, output_filename: str, lang: str)List[ocr_utils.table.LocatedTable][source]

Detects and returns tables in image Save image with detected tables covered by a blank rectangle (using opencv for structure detection and pytesseract for cell content detection)

Parameters
  • image_filename (str) – Path of the input image.

  • output_filename (str) – Location of the output image (input image with detected tables covered by blank rectangle).

  • lang (str) – Lang to use when performing OCR.

Returns

tables – List of tables with their position in the original image

Return type

List[LocatedTable]

ocr_utils.table.extract_and_hide_tables_from_image(image: numpy.ndarray, lang: str)Tuple[numpy.ndarray, List[ocr_utils.table.LocatedTable]][source]

Detects and returns tables in images using opencv for structure detection and pytesseract for cell content detection. Then hides detected tables from the original image.

Parameters
  • image (np.ndarray) – Input image as an array of pixels, (output of cv2.imread(image_filename, 0))

  • lang (str) – Lang to use when performing OCR

Returns

  • image (np.ndarray) – Output image as an array of pixels with blank rectangle over detected tables

  • tables (List[LocatedTable]) – List of tables with their position in the original image

ocr_utils.table.extract_tables(image_filename: str, lang: str)List[ocr_utils.table.LocatedTable][source]

Detects and returns tables in images using opencv for structure detection and pytesseract for cell content detection

Parameters
  • image_filename (str) – Path of the input image.

  • lang (str) – Lang to use when performing OCR.

Returns

tables – List of tables with their position in the original image

Return type

List[LocatedTable]

ocr_utils.table.extract_tables_from_image(image: numpy.ndarray, lang: str)List[ocr_utils.table.LocatedTable][source]

Detects and returns tables in images using opencv for structure detection and pytesseract for cell content detection

Parameters
  • image (np.ndarray) – Input image as an array of pixels, (output of cv2.imread(image_filename, 0))

  • lang (str) – Lang to use when performing OCR

Returns

tables – List of tables with their position in the original image

Return type

List[LocatedTable]

ocr_utils.table.group_by_proximity(elements: List[T], are_neighbors: Callable[[T, T], bool])List[List[T]][source]

Module contents

Top-level package for OCR utils.

ocr_utils.get_module_version()[source]