Data Structures in Pdfstruct⚓︎
Physical Structure vs Logical Structure⚓︎
A pdf document usually contains two types of information:
-
The physical information, which pertains to the Portable Document Format itself :
- number and size of pages
- images
- coordinates
- characters
- links ...
-
The logical information, which is more abstract and involves organizing the elements of the document in a tree structure - this implies the creation of thematic sections delineated by titles, and a table of contents to navigate between sections.
Some softwares allow users to store information about the logical structure of a pdf, but most of the time, it is up to the user to find a way to extract it.
The objective of Pdfstruct is to use the physical structure of the document to infer its logical structure. This is done by using the regularities found in the document (a method called endogenous learning).
-
One one hand, pdfstruct stores information about the physical structure in the classes related to plain pdf data (Data Classes).
-
On the other hand, it stores information about the logical structure in the classes related to "markers" (see
marker.py
), which refers to the different types of logical elements (section titles, table cells, TOC entries, etc...). Those are called Marker Classes.
Data Classes⚓︎
- A
Collection
contains a list ofDocument
objects. - A
Document
contains a list ofPage
objects. - A
Page
contains a list ofLine
objects.
Show Class Diagram
classDiagram
Collection <|-- Document
Document <|-- Page
Page <|-- Line
class Collection{
+docs [Document]
+metadata
}
class Document{
+list pages [Page]
+add_page(Page)
+split()
+detect_page_numbering()
+detect_headers_and_footers()
+check_and_filter_sections()
+filter_other_indexes()
}
class Page{
+list lines [Line]
+bool is_tdm_page
+add_line(Line)
+set_orientation()
+check_double_column()
+detect_tables()
}
class Line{
+str: content
+kind: marker | None
+is_potential_section()
+is_potential_other_index()
}
Marker Classes⚓︎
- A
Line
object has a.kind
attribute, which can either beNone
or be a Marker object. - The Marker object is used as a way to categorize the Line as a type of logical Pdf element. It also stores useful information about the element. For example, the
Section
object, used for section titles, stores the numeration and the text of the title in different attributes (for more info, see the Markers and Patterns pages).
The types of Marker Classes are summarized in the table below:
Marker | Description | Example |
---|---|---|
Section |
The title of a section | III-2.1.3 The Title of a Section |
SecIndex |
A section title that is written in a table of contents (TOC) | III-2.1.3 The Title of a Section ............ page 4 |
Other |
Something like a section title but belonging to another type | Table 3.2 - Some Table |
OtherIndex |
Something like section title but belonging to another type, written in a TOC. | Table 3.2 - Some Table ............ page 34 |
Header |
Header of a page | Some Company Name - Project number - December 13th |
Footer |
Footer of a page | Page 133 |
Cell |
The cell of a table | / |