Skip to content

Data Structures in Pdfstruct⚓︎

Physical Structure vs Logical Structure⚓︎

A pdf document usually contains two types of information:

  • The physical information, which pertains to the Portable Document Format itself :

    • number and size of pages
    • images
    • coordinates
    • characters
    • links ...
  • The logical information, which is more abstract and involves organizing the elements of the document in a tree structure - this implies the creation of thematic sections delineated by titles, and a table of contents to navigate between sections.


Some softwares allow users to store information about the logical structure of a pdf, but most of the time, it is up to the user to find a way to extract it.

The objective of Pdfstruct is to use the physical structure of the document to infer its logical structure. This is done by using the regularities found in the document (a method called endogenous learning).

  • One one hand, pdfstruct stores information about the physical structure in the classes related to plain pdf data (Data Classes).

  • On the other hand, it stores information about the logical structure in the classes related to "markers" (see marker.py), which refers to the different types of logical elements (section titles, table cells, TOC entries, etc...). Those are called Marker Classes.


Data Classes⚓︎

  • A Collection contains a list of Document objects.
  • A Document contains a list of Page objects.
  • A Page contains a list of Line objects.
Show Class Diagram
classDiagram
  Collection <|-- Document
  Document <|-- Page
  Page <|-- Line



  class Collection{
    +docs [Document]
    +metadata
  }

  class Document{

    +list pages [Page]

    +add_page(Page)
    +split()
    +detect_page_numbering()
    +detect_headers_and_footers()
    +check_and_filter_sections()
    +filter_other_indexes()
  }

class Page{
    +list lines [Line]
    +bool is_tdm_page
    +add_line(Line)
    +set_orientation()
    +check_double_column()
    +detect_tables()

}

class Line{
    +str: content
    +kind: marker | None
    +is_potential_section()
    +is_potential_other_index()

}

Marker Classes⚓︎

  • A Line object has a .kind attribute, which can either be None or be a Marker object.
  • The Marker object is used as a way to categorize the Line as a type of logical Pdf element. It also stores useful information about the element. For example, the Section object, used for section titles, stores the numeration and the text of the title in different attributes (for more info, see the Markers and Patterns pages).

The types of Marker Classes are summarized in the table below:

Marker Description Example
Section The title of a section III-2.1.3 The Title of a Section
SecIndex A section title that is written in a table of contents (TOC) III-2.1.3 The Title of a Section ............ page 4
Other Something like a section title but belonging to another type Table 3.2 - Some Table
OtherIndex Something like section title but belonging to another type, written in a TOC. Table 3.2 - Some Table ............ page 34
Header Header of a page Some Company Name - Project number - December 13th
Footer Footer of a page Page 133
Cell The cell of a table /