Getting Started⚓︎

Installation⚓︎

git clone https://git.lab.sspcloud.fr/liriae/pdfstruct.gitcd pdfstruct

Install poetry and the pdfstruct package:

curl -sSL https://install.python-poetry.org | python3 - # or pip install poetrypoetry install poetry shell

Or :

pip install -e .

Basic Usage⚓︎

Your pdf might be a concatenation of several documents. That is why we created a main class to store the pdf file, called Collection.

First, instanciate a Collection object with your pdf file.

import fitz  # pymupdf  
from pdfstruct.collection import Collection  


collection = Collection.from_pdf(filename="my_pdf_file.pdf")

A Collection stores documents (Document) in a list.

If your file is an aggregation of several files, the Collection splits your file into several documents. Otherwise, it returns a single Document.The documents are accessible in the .docs attribute of a Collection.

A Document has a list of titles, or Sections. They are accessible in the .sections attribute of the document.

You can access all the information about a Section to extract a table of content.

import fitz  # pymupdf  
from pdfstruct.collection import Collection  


collection = Collection.from_pdf(filename="my_pdf_file.pdf")  
first_pages = list(accumulate([1] + [len(d.pages) for d in collection.docs[:-1]]))

# Build the toc from pdfstruct  
pdfstruct_toc = []  
for doc, offset in zip(collection.docs, first_pages):  
    for s in doc.sections:
                point = fitz.Point()
                point.x, point.y = s.x0, s.y0
                dest = {"kind": 1, "page": s.page + offset, "to": point, "zoom": 0, "collapse": True}
                pdfstruct_toc.append([s.kind.level, f"{s.kind.numbering} {s.kind.title}", s.page + offset, dest])

Tests⚓︎

The tests folder in the Git repository is used for testing with the run_pdfstruct() function (see liriae-form).

Pytest Execution⚓︎

Execute the following command to launch the tests:

pytest tests