alto package

Module contents

Top-level package for Alto.

class alto.Alternative(content: str)[source]

Bases: object

content: str
classmethod from_xml(element: xml.etree.ElementTree.Element)alto.Alternative[source]
class alto.Alto(description: alto.Description, layout: alto.Layout)[source]

Bases: object

Alto dataclass for manipulating Tesseract output files.

Parameters
  • description (Description) – The “description” tag of alto xml documents, containing metadata

  • layout (Layout) – The “layout” tag of alto xml documents, containing parsed elements

description: alto.Description
extract_composed_blocks()List[alto.ComposedBlock][source]
extract_grouped_words(group_by: Union[Literal[TextLine], Literal[TextBlock], Literal[ComposedBlock]])List[List[str]][source]

Extracts all parsed words grouped at the required level.

Args:

group_by (Union[Literal[‘TextLine’], Literal[‘TextBlock’], Literal[‘ComposedBlock’]]) : group level

Returns:

List[List[str]]: List of list of words in each entity of target level

extract_text_blocks()List[alto.TextBlock][source]
extract_text_lines()List[alto.TextLine][source]
extract_words()List[str][source]

Extracts all parsed words regardless of their positions.

Returns:

List[str]: List of words extracted from file

classmethod from_xml(element: xml.etree.ElementTree.Element)alto.Alto[source]
layout: alto.Layout
static parse(xml_str: str)alto.Alto[source]

Alto constructor from xml string.

Parameters

xml_str (str) – xml alto string

static parse_file(filename: str)alto.Alto[source]

Alto constructor from xml file.

Parameters

filename (str) – filename of the file to load

class alto.ComposedBlock(id: str, height: float, width: float, hpos: float, vpos: float, text_blocks: List[alto.TextBlock])[source]

Bases: object

extract_words()List[str][source]

Extracts all parsed words regardless of their positions.

Returns:

List[str]: List of words extracted from file

classmethod from_xml(element: xml.etree.ElementTree.Element)alto.ComposedBlock[source]
height: float
hpos: float
id: str
text_blocks: List[alto.TextBlock]
vpos: float
width: float
class alto.Description(file_name: Optional[str])[source]

Bases: object

file_name: Optional[str]
classmethod from_xml(element: xml.etree.ElementTree.Element)alto.Description[source]
class alto.Layout(pages: List[alto.Page])[source]

Bases: object

classmethod from_xml(element: xml.etree.ElementTree.Element)alto.Layout[source]
pages: List[alto.Page]
class alto.Page(id: str, height: float, width: float, physical_img_nr: int, printed_img_nr: Optional[int], print_spaces: List[alto.PrintSpace])[source]

Bases: object

extract_blocks()List[alto.ComposedBlock][source]
extract_lines()List[alto.TextLine][source]
extract_strings()List[alto.String][source]
extract_text_blocks()List[alto.TextBlock][source]
extract_words()List[str][source]

Extracts all parsed words regardless of their positions.

Returns:

List[str]: List of words extracted from file

classmethod from_xml(element: xml.etree.ElementTree.Element)alto.Page[source]
height: float
id: str
physical_img_nr: int
print_spaces: List[alto.PrintSpace]
printed_img_nr: Optional[int]
width: float
class alto.PrintSpace(height: float, width: float, hpos: float, vpos: float, pc: Optional[float], composed_blocks: List[alto.ComposedBlock])[source]

Bases: object

composed_blocks: List[alto.ComposedBlock]
extract_words()List[str][source]

Extracts all parsed words regardless of their positions.

Returns:

List[str]: List of words extracted from file

classmethod from_xml(element: xml.etree.ElementTree.Element)alto.PrintSpace[source]
height: float
hpos: float
pc: Optional[float]
vpos: float
width: float
class alto.SP(width: float, hpos: float, vpos: float)[source]

Bases: object

classmethod from_xml(element: xml.etree.ElementTree.Element)alto.SP[source]
hpos: float
vpos: float
width: float
class alto.String(id: str, height: float, width: float, hpos: float, vpos: float, content: str, confidence: float, alternatives: List[alto.Alternative])[source]

Bases: object

alternatives: List[alto.Alternative]
confidence: float
content: str
classmethod from_xml(element: xml.etree.ElementTree.Element)alto.String[source]
height: float
hpos: float
id: str
vpos: float
width: float
class alto.TextBlock(id: Optional[str], height: float, width: float, hpos: float, vpos: float, text_lines: List[alto.TextLine])[source]

Bases: object

extract_string_lines()List[str][source]
extract_words()List[str][source]

Extracts all parsed words regardless of their positions.

Returns:

List[str]: List of words extracted from file

classmethod from_xml(element: xml.etree.ElementTree.Element)alto.TextBlock[source]
height: float
hpos: float
id: Optional[str]
text_lines: List[alto.TextLine]
vpos: float
width: float
class alto.TextLine(id: str, height: float, width: float, hpos: float, vpos: float, strings: List[Union[alto.String, alto.SP]])[source]

Bases: object

extract_words()List[str][source]

Extracts all parsed words regardless of their positions.

Returns:

List[str]: List of words extracted from file

classmethod from_xml(element: xml.etree.ElementTree.Element)alto.TextLine[source]
height: float
hpos: float
id: str
strings: List[Union[alto.String, alto.SP]]
vpos: float
width: float
alto.get_module_version()[source]
alto.parse(xml_string: str)alto.Alto[source]

Alto constructor from xml string.

Parameters

xml_str (str) – xml alto string

alto.parse_file(filename: str)alto.Alto[source]

Alto constructor from xml file.

Parameters

filename (str) – filename of the file to load