Getting Started⚓︎
Installation⚓︎
git clone https://git.lab.sspcloud.fr/liriae/pdfstruct.gitcd pdfstruct
Install poetry and the pdfstruct package:
curl -sSL https://install.python-poetry.org | python3 - # or pip install poetrypoetry install poetry shell
Or :
pip install -e .
Basic Usage⚓︎
Your pdf might be a concatenation of several documents. That is why we created a main class to store the pdf file, called Collection
.
First, instanciate a Collection
object with your pdf file.
import fitz # pymupdf
from pdfstruct.collection import Collection
collection = Collection.from_pdf(filename="my_pdf_file.pdf")
Collection
stores documents (Document
) in a list.
If your file is an aggregation of several files, the Collection
splits your file into several documents. Otherwise, it returns a single Document
.The documents are accessible in the .docs
attribute of a Collection
.
A Document
has a list of titles, or Sections
. They are accessible in the .sections
attribute of the document.
You can access all the information about a Section
to extract a table of content.
import fitz # pymupdf
from pdfstruct.collection import Collection
collection = Collection.from_pdf(filename="my_pdf_file.pdf")
first_pages = list(accumulate([1] + [len(d.pages) for d in collection.docs[:-1]]))
# Build the toc from pdfstruct
pdfstruct_toc = []
for doc, offset in zip(collection.docs, first_pages):
for s in doc.sections:
point = fitz.Point()
point.x, point.y = s.x0, s.y0
dest = {"kind": 1, "page": s.page + offset, "to": point, "zoom": 0, "collapse": True}
pdfstruct_toc.append([s.kind.level, f"{s.kind.numbering} {s.kind.title}", s.page + offset, dest])
Tests⚓︎
The tests
folder in the Git repository is used for testing with the run_pdfstruct()
function (see liriae-form).
Pytest Execution⚓︎
Execute the following command to launch the tests:
pytest tests