Document

Show Document Processing Flowchart (zoom on page)

graph LR
  A(Start)  -->|document.py| B{Document init}

subgraph "Step 1"
  B-->B1["fetch PyMuPdf Data

  add pages"]-->|page.py|C{Page init}
  C-->C1["set orientation

  check double columns

  add lines
  "]-->|line.py|D{Line init}
  D-->D1["collect potential markers"]-->|page.py|C5["check if the page is a TOC

  detect tables
  "]
end



subgraph "Optional Step"
C5-->|document.py|E{Document splitting}
E-->E1[Detect numbering ranges]
end

subgraph "Step 2"
C5-->|document.py|F{Single Document Processing}
E1-->F
F-->F3["detect Headers and Footers

filter Sections

filter OtherIndex"]
end

F3-->G(End)

The main class for pdf processing.

The methods handle :

the conversion from a PyMuPdf format to a Pdfstruct format
the detection of multiple documents via page numbering and table of contents
the mapping between physical pages and logical pages
the detection of headers and footers
the detection of potential sections

The processing timeline has two phases :

1) Incremental gathering of Pages and Lines info

When creating a Collection, a single Document object is created.
the Document adds Page objects to its .pages attribute.
The Page object adds Line objects to its .lines attribute. It also checks for orientation, double columns and tables.
The Line object checks with regex whether a line is a potential title (Section) or part of a table of contents (SecIndex), or another kind of title (OtherIndex).

2) Document processing as a whole

The informations in Lines and Pages are then used for further processing at the Document level :

The Document is split into multiple Documents using the information of its Pages and page numbering.
Each Document is then processed individually. The processing of a single document consists of :
- detecting headers and footers,
- identifying true Sections previously collected in 1),
- removing invalid OtherIndexes previously collected in 1).

The raw input relies on the page data structure provided by PyMupdf: https://pymupdf.readthedocs.io/en/latest/textpage.html#structure-of-dictionary-outputs

The words attribute relies on the word data structure provided by PyMupdf: https://pymupdf.readthedocs.io/en/latest/textpage.html#TextPage.extractWORDS

Source code in pdfstruct/document.py

class Document:
    """The main class for pdf processing.


    --------------------------------
     The methods handle :

     - the conversion from a PyMuPdf format to a Pdfstruct format
     - the detection of multiple documents via page numbering and table of contents
     - the mapping between physical pages and logical pages
     - the detection of headers and footers
     - the detection of potential sections

     -------------------------------

     The processing timeline has two phases :

     **1) Incremental gathering of `Pages` and `Lines` info**

     - When creating a `Collection`, a single `Document` object is created.
     - the `Document` adds `Page` objects to its .pages attribute.
     - The `Page` object adds `Line` objects to its ```.lines``` attribute.
       It also checks for orientation, double columns and tables.
     - The `Line` object checks with regex whether a line is a potential title (`Section`)
       or part of a table of contents (`SecIndex`), or another kind of title (`OtherIndex`).

     **2) Document processing as a whole**

     The informations in `Lines` and `Pages` are then used for further processing at the `Document` level :

     - The `Document` is split into multiple `Documents` using the information of its `Pages` and page numbering.
     - Each `Document` is then processed individually.
       The processing of a single document consists of :

         - detecting headers and footers,
         - identifying true Sections previously collected in 1),
         - removing invalid OtherIndexes previously collected in 1).

     -------------------------------

     The `raw` input relies on the page data structure provided by PyMupdf:
     <https://pymupdf.readthedocs.io/en/latest/textpage.html#structure-of-dictionary-outputs>


     The `words` attribute relies on the word data structure provided by PyMupdf:
     <https://pymupdf.readthedocs.io/en/latest/textpage.html#TextPage.extractWORDS>

     ------------------------------

    """

    def __init__(
        self,
        raw,
        words=None,
        links=None,
        page=None,
        images=None,
        n=None,
        metadata=None,
        filename=None,
        ocr=None,
    ):
        # On instanciation, the Document adds Page objects to the .pages attribute.
        # This triggers all the processings made when instanciating a Page object.
        self._tdm = defaultdict(list)
        self._tdm_light = defaultdict(list)
        # page_mapping will have logical page ids as keys and physical page ids as values
        self.page_mapping = {}
        self.words = words
        # the `words` attribute is optional and used for paragraph detection and logical sections.
        # If you want to use `get_logical_section`, this value needs to be filled by using first `.from_pdf()`
        # and specifying extract_words=True.
        self.links = links
        # The links are given by PYMUPDF.
        # They are used in `detect_headers_and_footers()` to map logical and physical pages.
        self.images = images
        # The images are given by PYMUPDF. The text is extracted from small images with ocr.
        self.metadata = metadata
        # The PYMUPDF metadata. Must be careful with publishing as it sometimes contains the email adress of
        # the creator of the pdf.
        self.filename = filename
        # The name of the pdf file
        self.ocr = ocr
        # The ocr paramater is unavailable for now. If ocr is True, an ocr library is used to extract text from images
        # or from files where the PYMUPDF text has too many issues.
        self.raw: Optional[List[Any]] = None

        # in some cases, we already instanciated the Page objects and put them in a list.
        if raw and isinstance(raw, list) and isinstance(raw[0], Page):
            self.pages = raw
            self.raw = [page.raw for page in self.pages]
        # but most of the time, we pass a raw input which is a list of PyMuPdf Page dictionaries,
        # and progressively add them to the .pages attribute in a list.
        elif raw:
            self.raw = raw
            self.pages = []
            if page:
                if not n:
                    n = 1
                for i in range(page, page + n):
                    self.add_page(
                        self.raw[i],
                        links=links[i] if links else None,
                        images=images[i] if images else None,
                    )
            else:
                for i, s in enumerate(self.raw):
                    self.add_page(
                        s,
                        links=links[i] if links else None,
                        images=images[i] if images else None,
                    )
        else:
            logger.warning("empty document")
            self.raw = None
            self.pages = []

    def single_doc_process(self):
        """Returns a Document object. Called after splitting document.
        Detects headers and footers, filters potential Sections, filters OtherIndexes."""
        # the following methods should be called after sub-document detection
        # (after building a collection of documents)
        logger.info("completing processing single document")
        self.detect_header_and_footer()
        self.check_and_filter_sections()
        self.filter_other_indexes()

        logger.info("done processing single document")
        return self

    def sub_doc(self, start=0, end=None):
        """Creates sub-`Document` from a larger `Document` with the provided start and end indices.
        Reassigns page ids and link targets according to new start and new end.
        Returns:
            (Document): the updated smaller `Document` object.
        """
        if not end:
            end = len(self.pages)
        logger.info(f"Potential sub-document between pages {start} and {end - 1} #pages={end - start}")
        pages = self.pages[start:end]
        doc_len = len(pages)
        for i, page in enumerate(pages):
            # updating page ids
            page.id = i
            page.doc_len = doc_len
            for line in page.lines:
                line.page = i
                # updating links
                if line.links and start:
                    old = line.links
                    line.links = [(target_id - start, src) for target_id, src in line.links]
                    logger.info(
                        f"Page {i} updated links with {start} {list(zip(*old))[0]} to {list(zip(*line.links))[0]}"
                    )
        return Document(pages, metadata=self.metadata, filename=self.filename)

    def split(self):
        """Splits a multi-part `Document` into a collection of `Documents`.
        Also issues warnings about the presence of a table of content for each `Document`.

        Returns:
            (list(Document)): the list of sub-Documents according to the numbering ranges."""

        # TODO : change + reset of page numbering in headers or footers could be used
        # numbering ranges is a list of tuples (start, end)
        numbering_ranges = self.detect_page_numbering()
        if len(numbering_ranges) > 1:
            docs = []
            for start, end in numbering_ranges:
                logger.debug(f"handling numbering range start={start} end={end}")
                potential_tdm = []
                prev_tdm = None
                lpages = set([])
                # going through each page in the interval and collecting the tables of contents
                for page in self.pages[start : end + 1]:
                    if page.is_tdm_page():
                        if not prev_tdm:
                            potential_tdm.append(page.id)
                        prev_tdm = page
                        # collecting the logical page in the SecIndex matching regex (ex : "title .... page x"-> extract x) # noqa: E501
                        for line in page.tdm:
                            lpages.add(line.kind.page)
                    else:
                        prev_tdm = None
                if potential_tdm:
                    lpages_ = sorted(list(lpages))
                    lpage_max = lpages_[0]
                    # sometimes a bad ToC entry is returned with a very large page number
                    # we check that the max page number is close from the other page numbers
                    for lpage in lpages_[1:]:
                        if lpage > lpage_max + 20:
                            break
                        lpage_max = lpage
                    if len(potential_tdm) > 1:
                        logger.warning(f"Several Tables of Contents within page range {start} {end}")
                    logger.debug(
                        f"found ToC at page {potential_tdm[-1]} in page range {start} {end} "
                        f"with last reference on page {start + lpage_max - 1}"
                    )
                    if start + lpage_max > end + 1:
                        logger.warning(
                            f"Table of content entry for logical page={lpage_max} "
                            f"outside of physical page range {start} {end}"
                        )
                        # TODO : should extend the range of the sub-document if necessary
                        # specially if the next one was a gap filler
                else:
                    logger.warning(f"No Table of Content within page range {start} {end}")
                docs.append(self.sub_doc(start=start, end=end + 1))
            return [doc.single_doc_process() for doc in docs]

        logger.debug("single document: no splitting")
        return [self.single_doc_process()]

    @classmethod
    def from_json(cls, filename, **kwargs):
        """opens a json file and passes it as a `raw` input to create a `Document`.
        The json must follow the dictionary structure of PyMuPdf."""
        with open(filename, "r") as f:
            return cls(json.load(f), **kwargs)

    @classmethod
    def from_pdf(
        cls,
        filename,
        extract_words=None,
        extract_pages=None,
        path=None,
        extract_images=False,
        **kwargs,
    ):
        """Opens the pdf file with PyMuPdf, extracts the dictionaries (1 dictionary per page) and passes it as a `raw`
        input to create a `Document`.

        Args:
            filename (str): The path of the pdf file.
            extract_words (bool): Extracts the word data as given by PyMuPdf. This parameter is mandatory if one wants
            to also extract paragraphs (see `paragraph.py`)
            extract_pages (bool): Extracts the pages individually as pdf files. This is useful for online interfaces.
            path (bool): the path where the extracted pages are saved if extract_pages is True.
            extract_images (bool): If True, the flags used for extractions also preserves images.
            kwargs : all other arguments passed when creating a Document.

        Returns:
            (Document): a Document object.

        """

        # should cache the json
        def get_pages(doc, path, kwargs):
            page, n = kwargs["page"], kwargs["n"]
            if path is None:
                path = "./pages"
            if not os.path.exists(path):
                os.makedirs(path)
            for i in range(page, page + n):
                out = fitz.open()
                out.insert_pdf(doc, from_page=i, to_page=i)
                out.save(f"{path}/page_{i}.pdf")

        doc = fitz.open(filename)
        metadata = doc.metadata
        try:
            toc = doc.get_toc()
        except Exception:
            toc = []

        logger.info(f"toc is {toc}")
        jsn = []
        words: List[str] = []
        links = []
        all_images = []
        flags = fitz.TEXTFLAGS_TEXT if not extract_images else fitz.TEXTFLAGS_DICT
        for i, page in enumerate(doc):
            # jsn.append(json.loads(page.get_text("json", flags=flags)))
            # getting the raw dict for the page, the links and the associated words.
            jsn.append(page.get_text("dict", flags=flags))
            words.append(page.get_text("words", flags=flags))
            links.append([(_link["page"], _link["from"]) for _link in page.get_links() if _link["kind"] == 1])
            if links[-1]:
                logger.debug(f"links on page {i}: {links[-1]}")

            # Image extraction
            try:
                images = page.get_image_info(xrefs=True)
            except Exception:
                images = []
            if images:

                def is_large_image(image):
                    bbox = image["bbox"]
                    return (bbox[2] - bbox[0]) > 100 or (bbox[3] - bbox[1]) > 100

                def get_content(image):
                    xref = image["xref"]
                    # return content['image'] if content else None
                    try:
                        content = doc.extract_image(xref)
                        return content["image"] if content else None
                        # TODO: ask Eric about that part that seems unused
                        pix = fitz.Pixmap(content["image"])
                        smask = content["smask"]
                        if smask > 0:
                            logger.debug(f"get content with mask page {i} {xref=} {smask=}")
                            mask = fitz.Pixmap(doc.extract_image(smask)["image"])
                            try:
                                if pix.alpha:
                                    pix = fitz.Pixmap(pix, 0)  # remove alpha channel
                                pix = fitz.Pixmap(pix, mask)
                            except Exception:
                                pass
                        ipath = "./images"
                        if not os.path.exists(ipath):
                            os.makedirs(ipath)
                        pix.save(f"{ipath}/img{i}_{xref}.png")
                        return pix.tobytes()
                    except Exception:
                        logger.exception(f"pb image content on page {i} {xref=}")
                        return None

                images = [(*image["bbox"], get_content(image)) for image in images if is_large_image(image)]
                logger.debug(f"images on page {i} #entry={len(images)}")
            all_images.append(images)
        # Page extraction if wee need to have single pdf pages.
        if extract_pages:
            get_pages(doc, path, kwargs)
        # We discard the list of words if needed
        if extract_words is None:
            words = None

        # TODO: easyocr commented because requires specific packages installation. Should be discussed
        # ocr = easyocr.Reader(['fr'])
        ocr = None
        return cls(
            jsn,
            words,
            links,
            images=all_images,
            metadata=metadata,
            filename=filename,
            ocr=ocr,
            **kwargs,
        )

    def add_page(self, content, links=None, images=None):
        """The equivalent of an `.append()` list method : the Page is added at the end of the .pages attribute.
        The Page object is instanciated here (and consequently all its Line objects.)
        The Lines that are SecIndex (Toc entries) are also saved in the Document Table of Contents (`process_indexes()`)
        """
        page = Page(
            len(self),
            content,
            prev=self.pages[-1] if self.pages else None,
            doc=self,
            links=links,
            images=images,
            ocr=self.ocr,
        )
        self.pages.append(page)
        self.process_indexes(page.tdm)

    def __len__(self):
        """
        Returns:
            (int): the number of pages in the Document.
        """
        return len(self.pages)

    def collect(self, kind=Section):
        """
        Providing a kind of marker, for each page of the Document, collects all the lines where `line.kind` = marker.


        Returns:
            (iter): a flattened iterator of all the occurrences of the marker in the Document.
        """
        return chain.from_iterable((p.collect(kind=kind) for p in self.pages))

    @property
    def sections(self):
        """Collects all section titles (`Section`) of the Document."""
        return self.collect(kind=Section)

    @property
    def tdm(self):
        """Collects all the section TOC entries (`SecIndex`) of the Document. Is equivalent to a Table of Contents."""
        return self.collect(kind=SecIndex)

    @property
    def captions(self):
        """Collects all the captions (Other) of the Document."""
        return self.collect(kind=Other)

    @property
    def caption_tdm(self):
        """Collects all the caption TOC entries (OtherIndex) of the Document.
        Is equivalent to a Table of Contents for captions."""
        return self.collect(kind=OtherIndex)

    @property
    def cells(self):
        """Collects all the Lines that are parts of a table (Cell)"""
        return self.collect(kind=Cell)

    def lpage(self, n):
        """
        For a given physical page index, returns the logical page associated with it if it is present in
        the `page_mapping` attribute dictionary.

        Args:
            n (int): the index of the physical page

        Returns:
            (int): the index of the logical page if available, else None.
        """
        return self.pages[self.page_mapping[n]] if n in self.page_mapping else None

    def __getitem__(self, n):
        """
        For a given physical page index, returns the logical page associated with it if it is present in
        the `page_mapping` attribute dictionary.

        Args:
            n (int): the index of the physical page

        Returns:
            (int): the index of the logical page if available, else None.
        """
        return self.pages[self.page_mapping[n]] if n in self.page_mapping else None

    def detect_page_numbering(self):
        """Detects frequent numberings and their ranges in the document.

        Returns:
            (list(tuple(int))): intervals of numbering ranges


        **Algorithm** :

        * going through each page, extracting top and bottom lines
        * normalizing lines (text + coordinates), extracting numbers to find a potential logical page number.
        * checking if the normalized versions occur more than 4 times (test_lines())
        * if so, checking for a range (is_range()), retrieving the deltas between physical and logical pages.
        * storing all potential ranges
        * going through potential ranges, handling overlappings and gaps (handle_gaps()) using the deltas.
        """

        entries = defaultdict(list)

        def normalize(line):
            """Returns the rounded y0, a normalized version of the ontent and
            a list of all the numbers detected in the line."""
            content = line.content
            # extract all the numbers in the line
            nums = [int(x) for x in re.findall(r"(\d+)", content)]
            # replace all numbers in the line by <<NUM>>
            content = re.sub(r"\d+", r"<<NUM>>", content)
            # remove spacings and convert to lower case
            content = re.sub(r"\s+", r"", content)
            content = content.lower()
            y = int(line.y0)
            return ((y, content), nums)

        def is_range(info, key=None, line=None):
            """Input is a list of line infos : physical page ids and potential logical pages.
            **Algorithm**:

            * go over each line
            * compute the deltas between physical page id and logical numbers
            * compare the results with the previous line.
            * If there is a range, there should be the same deltas at the same places each time,
              modulo some exceptions (intercalar pages... )

            Returns:
                (tuple): contains the physical start and end, the delta/shift between the physical and logical ids,
                the normalized text and coordinates, and the range size.
            """
            # logger.info(f"entering is_range page={info[-1][0]} {key=} {line=} {info=}")
            # info is a list of tuples [(page id of line, list of detect numbers in the line)]
            orig = prev = info[-1]
            # comparing page ids and the detected numbers in the potential header/footer
            # (substacting between physical page and logical page)
            prev_deltas = [prev[0] - page_id for page_id in prev[1]]
            # where is the logical page index
            # shift is the difference between the physical page and the physical page.
            where = None
            shift = None
            range_size = 1

            # d intercalar pages (no nummbering)
            # p+d : k+1 prev_delta=(p+d)-k-1
            # p : k delta=p-k
            # prev_delta-delta = (p+d)-k-1-p+k=d-1

            # going backwards in the list, excluding the last element
            # x[0] = page id (physical page), x[1] = detected numbers (with a potential logical page)
            for x in info[:-1][::-1]:
                deltas = [x[0] - page_id for page_id in x[1]]
                # logger.debug(f"try down range {x[0]} potential={x[1]} {deltas=} {shift=} {range_size=}")

                # a physical page is the same as the previous one
                if x[0] == prev[0]:
                    break
                # the amount of detected numbers differs
                if len(deltas) != len(prev_deltas):
                    break
                # gap too big between previous page id and current page id
                if False and where is None and prev[0] > x[0] + 3:
                    break
                if where is None:
                    for i in range(0, len(deltas)):
                        # there should be the same difference between the physical page and the logical page and
                        # at the same spot in the line.
                        if deltas[i] == prev_deltas[i]:
                            where = i
                            shift = deltas[i]
                            # logger.debug(f"set shift={shift} where={where} {prev_deltas=} {deltas=} {range_size=}")
                            break
                    if where is None and x[0] + 1 < prev[0] < x[0] + 3:
                        # try with potential intercal pages !
                        for i in range(0, len(deltas)):
                            if prev_deltas[i] - deltas[i] == prev[0] - x[0] - 1:
                                where = i
                                shift = deltas[i]
                                break
                    if where is None:
                        # logger.debug(f"exit range check at {x[0]}")
                        break
                elif deltas[where] == prev_deltas[where]:
                    # normal page numbering: keep the same shift between the logical page and physical page
                    # logger.debug(f"going down range {x[0]} {shift=} {range_size=}")
                    pass
                elif prev_deltas[where] - deltas[where] == prev[0] - x[0] - 1 and x[0] + 1 < prev[0] < x[0] + 3:
                    # insertion of intercalar pages (no page numbering)
                    # may skip intercalar pages that does not really break numbering
                    # we could play with the thresholds here (1 and 3)
                    # logger.debug(f"changed shift={shift} where={where} {prev_deltas=} {deltas=} {range_size=}")
                    shift = deltas[where]
                else:
                    break

                if False and (
                    where is None
                    or (
                        prev_deltas[where] != deltas[where]
                        # the following condition could deal with document insertion within a larger document
                        # however, need to maintain a list of shifts between physical and logical page numbers
                        and deltas[where] != prev_deltas[where] + x[0] - prev[0]
                    )
                ):
                    break

                prev = x
                prev_deltas = deltas
                range_size += 1

            if range_size > 4:
                # logger.debug(f"range extension from {prev[0]} to {orig[0]} with key={key}
                # where={where} shift={shift} info={info}")
                # logger.debug(f"got range from={prev[0]} to={orig[0]} shift={shift} key={key} size={range_size}")
                return (prev[0], orig[0], shift, where, key, range_size)
            else:
                # logger.debug(f"no range extension")
                return None

        potential_range: Dict[Any, Any] = {}

        def test_line(line):
            """Checks whether normalized content and coordinates and their info could be used to detect ranges.
            If so, adds it to the potential_range dictionary."""
            # checking if there are numbers in the line (in which case it could be a header or footer)
            xkey, nums = normalize(line)
            # logger.warning(f"Test footer {page.id} {line.y0=} {xkey=} {nums=} info={footers[xkey]}")
            if nums:
                # entries = {(y0, normalized line) : [(page.id, [num1, num2...])] }
                entries[xkey].append((page.id, nums))
                info = entries[xkey]
                # if a normalized content and its coordinates have been detected more than 4 times,
                # we check if we have a range.
                if len(info) > 4:
                    # logger.warning(f"Try range {page.id} {line.y0=} {xkey=} {nums=} {line=}")
                    xrange = is_range(info, key=xkey, line=line)
                    if xrange:
                        # logger.warning(f"range extension from {info[0][0]} to {page.id} with footer {xkey}")
                        # potential range = {((y0, normalized text), start): (start, end, shift, where, ...)}
                        potential_range[(xkey, xrange[0])] = xrange

        for page in self.pages:
            height = page.height
            # logger.warning(f"Test page {page.id} {height=}")

            df = page.df.sort_values(by="y0")
            # taking the first 10 lines at the top of the page (everything before 1/3rd of the page)
            for line in list(df.head(10).line.values):
                if line.y0 > 0.33 * height:
                    continue
                test_line(line)
            # taking the last 10 lines at the bottom of the page (everything after 2/3rds of the page)
            for line in list(df.tail(10).line.values):
                if line.y0 < 0.66 * height:
                    continue
                test_line(line)

        ranges = list(potential_range.values())
        # sorting ranges by start (ascending)
        ranges = sorted(ranges, key=lambda x: x[0])
        xranges: List[Any] = []

        def handle_gap(start):
            if xranges and start > xranges[-1][1] + 1:
                gap = start - xranges[-1][1]
                logger.warning(f"Numbering gap of {gap} pages between pages {xranges[-1][1]} and {start}")
                if gap > 3:
                    # add new doc
                    logger.info(f"Adding gap page range start={xranges[-1][1] + 1} end={start - 1}")
                    xranges.append((xranges[-1][1] + 1, start - 1))
                else:
                    # extend previous document
                    logger.info(f"Extending previous page range start={xranges[-1][0]} end={start - 1}")
                    xranges[-1] = (xranges[-1][0], start - 1)

        # going through collected potential ranges and handling gaps and overlappings
        for xrange in ranges:
            # xstart = xrange[0]
            xend = xrange[1]
            shift = xrange[2]
            start = shift + 1
            x = (start, xend)
            if not xranges and start > 0:
                logger.warning(f"First pages 0 to {start - 1} not in a numbering range")
                start = 0
                x = (start, xend)
            if not xranges or start > xranges[-1][1]:
                handle_gap(start)
                logger.info(f"Adding page range start={start} end={xend} xrange={xrange}")
                xranges.append(x)
            elif start == xranges[-1][0]:
                logger.info(f"Compatible overlapping numbering ranges {xranges[-1]} and {x} xrange={xrange}")
                xranges[-1] = (start, max(xend, xranges[-1][1]))
            else:
                logger.warning(f"Non-compatible overlapping numbering ranges {xranges[-1]} and {x} xrange={xrange}")
        # handle potentially final gap at the end of the document
        handle_gap(len(self.pages))
        return xranges

    def detect_header_and_footer(self):
        """Detect headers and footers as topmost and bottommost lines stable modulo numbers.

        **Algorithm**:

        * initiating 2 dictionaries (1 for headers, 1 for footers)
        * going through each page
            * extracting the top and bottom lines
            * going through each top line
                * normalizing the text of line : replacing all numbers by the same symbol, removing spaces, setting it
                  to lowercase.
                * normalizing the coordinates of the line
                * filling the header dictionary : key = tuple(normalized text , normalized coordinates), value = list
                  where the real line is appended.
                * this way, each normalized tuple will have a certain number of lines associated to it.
            * same process applied to bottom lines (footers).
        * final dictionaries : filtering the header and footer keys that are not frequent enough (threshold relative to
          the length of the document)
        * identifying changing headers (retrieving the headers that could be variants of a frequent header and
        putting it in the final header dictionary)
        * identifying changing footers
        * for each Line in the final dictionaries
            * assigning the kind "Header" or "Footer" to the Line.
        * once the headers/footers are found, the page mapping between logical page and physical page
          is made (`find_page_numbering()`):
            * extracting all numbers for each header line and checking there are numbers then are in the same position
              and incremental.

        """
        # TODO : maybe we should rewrite this method to handle only one kind of part
        # and call it twice, once for header and once for footer
        # and maybe for side parts (left and right sides) present in some documents
        npages = len(self.pages)

        if npages < 10:
            return

        median_height = np.median(np.array([p.height for p in self.pages]))
        # median_width = np.median(np.array([p.width for p in self.pages]))

        headers = defaultdict(list)
        footers = defaultdict(list)

        def normalize(content):
            # we abstract on number parts
            # assuming potential page numbers are given as arabic numbers
            content = re.sub(r"\d+", r"<<NUM>>", content)
            content = re.sub(r"\s+", r" ", content)
            content = content.lower()
            return content

        for page in self.pages:
            # we assume that headers and footers are relatively short
            # and occur only in the top and bottom 15% of pages
            # we use local page height because of distinct page orientation

            top_page = page.height * 0.15
            bot_page = page.height * 0.85
            half_width = page.width * 0.6

            df = page.df.sort_values(by="y0")

            for line in list(df.head(10).line.values):
                if line.y1 < top_page and line.width < half_width:
                    headers[(normalize(line.content), round(line.y0 / 15))].append(line)
                    # logger.debug(f"page {page.id} potential header line {line.y0} {line}")
            for line in list(df.tail(10).line.values):
                if line.y0 > bot_page and line.width < half_width:
                    footers[(normalize(line.content), round(line.y0 / 15))].append(line)
        # we use a threshold of 0.45 because we may have different headers and footers for odd and even pages
        # the current version does not handle headers/footer varying along sections/chapters/...
        header_set = set((header for header in headers if len(headers[header]) / npages > 0.45))
        footer_set = set((footer for footer in footers if len(footers[footer]) / npages > 0.45))

        def identify_changing_header_or_footer(candidates, keep_set):
            """
            Checks whether the headers (or footers) that were not kept in the final set were actually worth keeping :
            sometimes the content of a header varies with the page number (those with odd page numbers might differ
            from those with even page numbers).

            """
            # going through each normalized candidate
            for candidate in candidates:
                if candidate in keep_set:
                    continue
                stats = [None, None, None]

                def update(page, y, where, step=1):
                    if stats[where] is not None and stats[where][1] == page - step and abs(stats[where][2] - y) < 10:
                        stats[where][1] = page

                        if stats[where][1] - stats[where][0] > 10:
                            keep_set.add(candidate)
                            return True
                    else:
                        stats[where] = [page, page, y]

                for line in candidates[candidate]:
                    page = line.page
                    y = round(line.y0)
                    if update(page, y, where=0, step=1) or update(page, y, where=1 + (page % 2), step=2):
                        break

        identify_changing_header_or_footer(headers, header_set)
        identify_changing_header_or_footer(footers, footer_set)

        logger.debug(f"headers: {header_set}")
        logger.debug(f"footers: {footer_set}")
        # we type header/footer lines
        for header in header_set:
            for line in headers[header]:
                line.kind = Header()
                line.where = "header"
                # logger.debug(f"Page {line.page} handling header line {header=} {line}")
        for footer in footer_set:
            for line in footers[footer]:
                line.kind = Footer()
                line.where = "footer"

        # we can use elements in header_set/footer_set to identify the one carrying the page number
        # TODO : take this function out to make the main function more readable.
        def find_page_numbering(lines, template, y=None):
            """For a given header template and its associated lines, extracts the potential logical page number for each
            line and make the mapping with the pshysical page."""
            distincts = set(line.content for line in lines)
            if len(distincts) > 0.8 * len(lines):
                logger.info(f"potential page number in {template} {y=} {len(distincts)} lines out of {len(lines)}")
                info: Dict[Any, List[Any]] = defaultdict(list)
                for line in lines:
                    logger.debug(f"extract page number from {line.page} {line}")
                    for i, num in enumerate(re.findall(r"(\d+)", line.content)):
                        num = int(num)
                        # num is the potential logical page, line.page is the physical page
                        # i is the position of "num" in the text
                        # info[i][-1][0] = the previous "num" in the same position but in the previous header.
                        # That number should be inferior to the current "num".
                        if i not in info or info[i][-1][0] < num:
                            info[i].append((num, line.page))
                        else:
                            logger.debug(
                                f"strange page number in template {template} {i=} {num=} prev={info[i][-1][0]} {line.page} {line}"  # noqa: E501
                            )
                logger.debug(f"{info=}")
                for i in info:
                    logger.debug(f"{i=} {len(info[i])=} {len(distincts)}")
                    if len(info[i]) == len(distincts):
                        logger.info(f"page number in {template} as {i}th number")
                        for logical_pageid, physical_pageid in info[i]:
                            self.page_mapping[logical_pageid] = physical_pageid
                            self.pages[physical_pageid].page_number = logical_pageid

        for header in header_set:
            find_page_numbering(headers[header], header[0], y=header[1])
        for footer in footer_set:
            find_page_numbering(footers[footer], footer[0], y=footer[1])

        # use links on ToC entries to set page mapping
        # should check that the physical page numbers are correct versus sub-documents
        # also, that the corresponding section is indeed present on the target page

        # Collecting all Line.kind = SecIndex in Document
        tdm = list(self.tdm)
        if tdm:
            npages = len(self.pages)
            # entry is a Line object
            for entry in tdm:
                for link in entry.links:
                    # id is the logical page retrieved by the regex
                    # here, physical id is the page number provided in the link (PyMuPdf)
                    id = entry.kind.page
                    physical_id = link[0]
                    logger.debug(f"checking page mapping vs links {id=} {physical_id=}")
                    # if the link goes beyond the page scope of the document, we ignore it
                    if physical_id >= npages:
                        continue
                    # checking if the logical page is indeed in page_mapping.
                    # checking if the physical page contains a section title similar
                    # to the TOC entry (comparing numeration)
                    # handling mismatches (updating the mapping)
                    if id not in self.page_mapping:
                        for line in self.pages[physical_id].sections:
                            if line.kind.numbering == entry.kind.numbering:
                                logger.info(f"using link to set page mapping {id} to {physical_id}")
                                self.page_mapping[id] = physical_id
                                self.pages[physical_id].page_number = id
                                break
                    elif physical_id != self.page_mapping[id]:
                        logger.warning(
                            f"page mapping mismatch for {id} between link={physical_id} and current={self.page_mapping[id]}"  # noqa: E501
                        )
                        found = False
                        for line in self.pages[physical_id].sections:
                            logger.debug(f"compare {entry} with {line}")
                            if line.kind.numbering == entry.kind.numbering:
                                logger.info(f"using link to update page mapping {id} to {physical_id}")
                                self.page_mapping[id] = physical_id
                                self.pages[physical_id].page_number = id
                                found = True
                                break
                        if not found:
                            logger.info(f"keeping existing page mapping for {id}")

        # filling gaps in page mapping
        logical_pages = sorted(list(self.page_mapping.keys()))
        logger.debug(f"mapped logical page ids {logical_pages}")
        prev = None

        # filling internal gaps
        for logical_pageid in logical_pages:
            if (
                prev
                # finding the delta between logical pages(difference superior to 1)
                and logical_pageid - prev > 1
                # the delta should be the same btw logical pages and physical pages
                and self.page_mapping[logical_pageid] - self.page_mapping[prev] == logical_pageid - prev
            ):
                for id in range(prev + 1, logical_pageid):
                    # filling the gap : a new key-value pair is added to page mapping
                    physical_id = self.page_mapping[prev] + id - prev
                    logger.debug(f"in middle gap, mapping logical page {id} to page {physical_id}")
                    self.page_mapping[id] = physical_id
                    self.pages[physical_id].page_number = id
            prev = logical_pageid

        # also filling what has not been detected before the first mapped page
        if logical_pages and logical_pages[0] > 1:
            first_mapped_page = logical_pages[0]
            first_mapped_physical_page = self.page_mapping[first_mapped_page]
            for id in range(1, first_mapped_page):
                physical_id = first_mapped_physical_page + id - first_mapped_page
                if physical_id >= 0:
                    logger.debug(f"in initial gap, mapping logical page {id} to page {physical_id}")
                    self.page_mapping[id] = physical_id
                    self.pages[physical_id].page_number = id

        # also filling what has not been detected before the last mapped page
        if logical_pages and self.page_mapping[logical_pages[-1]] < len(self.pages):
            last_mapped_page = logical_pages[-1]
            last_mapped_physical_page = self.page_mapping[last_mapped_page]
            for physical_id in range(last_mapped_physical_page + 1, len(self.pages)):
                id = last_mapped_page + physical_id - last_mapped_physical_page
                logger.debug(f"in final gap, mapping logical page {id} to page {physical_id}")
                self.page_mapping[id] = physical_id
                self.pages[physical_id].page_number = id

        # we can use elements in header_set/footer_set to potentially identify top and bottom coordinates
        # TODO : ask Eric why the final loop since all headers and footers in the final sets have already been labeled
        if header_set:
            page_tops = defaultdict(list)
            for i, header in enumerate(header_set):
                for line in headers[header]:
                    page_tops[line.page].append(round(line.y1))
            all_tops = Counter([max(tops) for tops in page_tops.values()])
            top = all_tops.most_common(1)[0][0] + 5 if all_tops else 0
            logger.debug(f"header sep line at {top} for median height={median_height} all_tops={all_tops}")
            for lines in headers.values():
                for line in lines:
                    if line.y1 <= top and type(line.kind) is not Header:
                        logger.debug(f"Page {line.page} setting line {line} to header")
                        line.kind = Header()
                        line.where = "header"

        if footer_set:
            page_bots = defaultdict(list)
            # TODO : correct mistake here (variable name is wrong)
            for i, header in enumerate(footer_set):
                for line in footers[footer]:
                    page_bots[line.page].append(round(line.y0))
            all_bots = Counter([min(bots) for bots in page_bots.values()])
            bot = all_bots.most_common(1)[0][0] - 5 if all_bots else 10000
            logger.debug(f"bottom sep line at {bot} for median height={median_height} all_bots={all_bots}")
            for lines in footers.values():
                for line in lines:
                    if line.y0 >= bot and type(line.kind) is not Footer:
                        logger.debug(f"Page {line.page} setting line {line} to footer")
                        line.kind = Footer()
                        line.where = "footer"

    def filter_other_indexes(self):
        """Deletes caption TOC entries whose intros are too rare.
        Also warns if a caption TOC entry was not found in the actual Document."""
        others = defaultdict(list)
        for line in self.collect(kind=OtherIndex):
            others[line.kind.intro].append(line)
        other_set = set((other for other in others if len(others[other]) > 10))
        logger.debug(f"other indexes: {other_set}")
        # delete rare other-like entries
        for other in others:
            if other not in other_set:
                for line in others[other]:
                    line.kind = None
        # handle other entries to retrieve them on pages
        if self.page_mapping:
            for other in other_set:
                for line in others[other]:
                    if type(line.kind) is not OtherIndex:
                        continue
                    found = False
                    page_number = line.kind.page
                    if page_number in self.page_mapping:
                        page_id = self.page_mapping[page_number]
                        for xline in self.pages[page_id].lines:
                            if xline.check_with_other_index(line):
                                found = True
                                logger.info(f"Page {page_id} found section for ToC entry {xline}")
                                break
                        if not found:
                            logger.warning(f"missing caption for ToC entry {line}")

    def process_indexes(self, indexes):
        """stores the detected SecIndex in ._tdm and ._tdm_light.

        The keys of ._tdm are the normalized version of the Secindex :
            - tuple(normalized numbering, normalized body of text)
        """
        for line in indexes:
            numbering, title = index = line.kind.index
            self._tdm[index].append(line)
            # self._tdm_light[line.kind.numbering].append(line)
            self._tdm_light[numbering].append(line)

    def is_section_in_index(self, section):
        """checks whether a normalized Section"""
        if self._tdm:
            return section.index in self._tdm
        else:
            return True

    def check_and_filter_sections(self):
        """filters each collected section with conditions."""
        if not hasattr(self, "deleted_sections"):
            self.deleted_sections: List[Any] = []
        else:
            for line in self.deleted_sections:
                logger.debug(f"Page {line.page} potential section {line} already deleted")
        self._tdm = defaultdict(list)
        self._tdm_light = defaultdict(list)
        self.process_indexes(self.tdm)
        tdm = self._tdm
        tdm_light = self._tdm_light
        if not hasattr(self, "_tdm_light_extra"):
            self._tdm_light_extra: Dict[Any, Any] = defaultdict(list)
        else:
            logger.debug(f"{len(self._tdm_light_extra)} sections already checked")

        # tdm_light_extra collects accepted sections that have no associated ToC entries
        # used to anchor other sections
        # indexed by normalized numbering
        tdm_light_extra = self._tdm_light_extra
        reduced2full: Dict[Any, Any] = {}
        # list of accepted sections without ToC entries
        ordered_extra_sections: List[Any] = []

        def register_in_extra(numbering, line):
            sections[line.kind.index] = line
            tdm_light_extra[numbering].append(line)
            ordered_extra_sections.append(line)
            numbering_alt = line.kind.components.format_without_last_post()
            if numbering_alt != numbering:
                reduced2full[numbering_alt] = numbering
            logger.debug(f"Registered {numbering} in tdm_light_extra with {line}")

        sections = {}
        extra = []
        nsections = 0

        if not tdm:
            extra = list(self.sections)
            nsections = len(extra)
        else:
            if self.page_mapping:

                def check_page(entry, line):
                    page_number = entry.kind.page
                    if page_number in self.page_mapping:
                        page_id = self.page_mapping[page_number]
                        return page_id == line.page
                    else:
                        return False

            else:

                def check_page(entry, line):
                    return True

            def get_acceptable_entry(section):
                numbering, title = index = section.kind.index
                if index in tdm:
                    return [entry for entry in tdm[index] if check_page(entry, section)]
                elif numbering in tdm_light:
                    return [entry for entry in tdm_light[numbering] if check_page(entry, section)]
                else:
                    return []

            logger.debug(f"{len(tdm)} entries in ToC")
            if "" in tdm_light:
                for entry in tdm_light[""]:
                    logger.info(f"special ToC entry {entry}")
            # missing = []
            extra = []
            nsections = 0
            for line in self.sections:
                numbering, title = index = line.kind.index
                nsections += 1
                potential_entries = get_acceptable_entry(line)
                logger.debug(f"checking section {numbering=} {line} #potential={len(potential_entries)}")
                # special sections:
                if not numbering:
                    logger.info(f"special section {line}")

                if index in tdm and potential_entries:
                    logger.debug(f"found in tdm {line} in {potential_entries}")
                    sections[index] = line
                elif index in tdm:
                    logger.debug(f"found in tdm {line} but not at the right page")
                    extra.append(line)
                else:
                    logger.debug(f"not found in tdm {line} ")
                    extra.append(line)

        logger.warning(f"{nsections} potential sections, {len(extra)} not in ToC")
        extra2 = []
        changed = set([])

        def potential_serie(k, prev=None):
            candidate = extra[k]
            seqnum = candidate.kind.components
            current = seqnum.normalized
            if False and not prev and not seqnum.is_last_comp_first():
                return False
            if k + 1 == len(extra):
                return False
            next_candidate = None
            if candidate.kind.intro:
                for j in range(k + 1, len(extra)):
                    next_candidate = extra[j]
                    if next_candidate.kind.intro == candidate.kind.intro:
                        break
            if not next_candidate:
                next_candidate = extra[k + 1]
            logger.debug(f"try potential serie {candidate} against {next_candidate} prev={prev}")
            next_seqnum = next_candidate.kind.components
            next_prev = next_seqnum.prev().normalized
            next_up = next_seqnum.up().normalized
            next_seqnum.restore()
            logger.debug(f"{current=} {next_prev=} {next_up=} {next_seqnum.normalized}")
            if next_seqnum.level > 1 and next_up == current and next_seqnum.is_last_comp_first():
                return True
            elif next_prev == current:
                return candidate.kind.intro or potential_serie(k + 1, prev=current)
            else:
                return False

        for k, line in enumerate(extra):
            numbering, title = index = line.kind.index
            logger.debug(f"handling0 extra section level={numbering} {line}")
            if numbering not in tdm_light or not any(
                lline.kind.index[1].startswith(title) for lline in tdm_light[numbering]
            ):
                seqnum = line.kind.components
                current = seqnum.normalized  # actually, should be the same than numbering

                logger.debug(f"handling extra section level={seqnum.level} {line}")

                if (  # seqnum.level == 1
                    numbering not in tdm_light
                    # and numbering not in tdm_light_extra
                    and potential_serie(k)
                ):
                    logger.debug(f"Page {line.page} {line} accepted as section because of starting serie")
                    register_in_extra(numbering, line)
                    continue

                elif (
                    (seqnum.level > 1 or line.kind.intro) and numbering not in tdm_light
                    # and numbering not in tdm_light_extra
                ):

                    def check_related(related, kind="sibling"):
                        # we test if the related section is in the ToC or has been previously added
                        # and check page numbers
                        to_be_tried = [related]
                        if related in reduced2full:
                            to_be_tried.append(reduced2full[related])
                        for related in to_be_tried:
                            logger.debug(
                                f"check related numbering {related=} in_extra={related in tdm_light_extra} in_tdm={related in tdm_light}"  # noqa: E501
                            )

                            if related in tdm_light_extra:
                                for entry in tdm_light_extra[related][::-1]:
                                    if entry.page <= line.page and (
                                        line.page - entry.page < 30
                                        or (ordered_extra_sections and line.page - ordered_extra_sections[-1].page < 30)
                                    ):
                                        logger.debug(
                                            f"Page {line.page} {line} accepted as section because of section {kind} {related}"  # noqa: E501
                                            + f" on page {entry.page}"
                                        )
                                        register_in_extra(numbering, line)
                                        return True

                            if related in tdm_light:
                                for entry in tdm_light[related][::-1]:
                                    entry_page = entry.kind.page
                                    if self.page_mapping and entry_page in self.page_mapping:
                                        entry_page = self.page_mapping[entry_page]
                                    logger.debug(
                                        f"check vs ToC entry {entry} entry_page={entry.kind.page}/{entry_page} section_page={line.page} {self.page_mapping}"  # noqa: E501
                                    )
                                    if entry_page <= line.page and (
                                        line.page - entry_page < 30
                                        or (ordered_extra_sections and line.page - ordered_extra_sections[-1].page < 30)
                                    ):
                                        logger.debug(
                                            f"Page {line.page} {line} accepted as section because of entry {kind} {related} expected on page {entry_page}"  # noqa: E501
                                        )
                                        register_in_extra(numbering, line)
                                        return True
                        return False

                    other = seqnum.prev().normalized  # sibling (seqnum is modified when using navigation methods)
                    other2 = seqnum.prev().normalized  # sibling of sibling
                    other3 = seqnum.up().normalized  # parent
                    other4 = seqnum.up().normalized  # grand-parent
                    seqnum.restore()
                    logger.debug(f"test related to current {current}: {other=} {other2=} {other3=} {other4=}")
                    if current != other:
                        # if current has a potential previous brother, we test it
                        logger.debug(f"test sibling related between {current} and {other}: {other in tdm_light_extra}")
                        if check_related(other):
                            continue
                        if other2 != other:
                            # we accept a gap of 1 by testing sibling of sibling
                            if check_related(other2, kind="sibling-gap"):
                                continue
                    else:
                        # otherwise we test its parent
                        if check_related(other3, kind="parent") or check_related(other3[:-1], kind="parent"):
                            continue
                        # and if deep enough, we accept a gap by testing its grand-parent !
                        if seqnum.level > 3 and (
                            check_related(other4, kind="grand-parent")
                            or check_related(other4[:-1], kind="grand-parent")
                        ):
                            continue

                logger.debug(f"Page {line.page} discarding section case1 {line}")
                self.deleted_sections.append(line)
                extra2.append(line)
            else:
                # the candidate section has a potential entry in ToC
                found = False
                for entry in tdm_light[numbering]:
                    if entry.kind.index[1].startswith(title) and check_page(entry, line):
                        sections[entry.kind.index] = line
                        tdm_light_extra[numbering].append(line)
                        logger.debug(f"case3 extra section associate {line} to tdm {entry}")
                        found = True
                        break
                if not found:
                    logger.debug(f"Page {line.page} discarding section case2 {line}")
                    self.deleted_sections.append(line)
                    extra2.append(line)

        if self.page_mapping:
            for index in tdm:
                logger.debug(f"Processing ToC entry index {index}")
                if index not in sections:
                    # sometimes there are several ToC for the same entries
                    # (a generic one, and one per chapter for instance)
                    # we try not to check for duplicate entries
                    # However, several ToC entries may be associated with disctinct pages
                    # (e.g. a section Introduction at the beginning of each chapter)
                    logger.debug(f"handling ToC entry index {index} without associated section")
                    pages = set([])
                    for line in tdm[index]:
                        page_number = line.kind.page
                        logger.debug(f"handling ToC entry {line}")
                        if page_number in pages:
                            logger.debug("Dulicated Toc entry: skipping")
                            continue
                        pages.add(page_number)
                        if page_number in self.page_mapping:
                            page_id = self.page_mapping[page_number]
                            logger.debug(f"searching on physical page {page_id}")
                            found = False
                            for xline in self.pages[page_id].lines:
                                if (
                                    isinstance(xline.kind, Section)
                                    and line.kind.numbering
                                    and xline.kind.numbering == line.kind.numbering
                                ):
                                    logger.info(f"Page {page_id} relating section {xline} for ToC entry {line}")
                                    found = True
                                    # remove from sections to be deleted
                                    self.deleted_sections = [
                                        _line for _line in self.deleted_sections if _line is not xline
                                    ]
                                    break
                                elif xline.check_with_section_index(line):
                                    logger.info(f"Page {page_id} adding section {xline} for ToC entry {line}")
                                    changed.add(xline.page)
                                    found = True
                                    break
                            if not found:
                                for xline in self.pages[page_id].lines:
                                    if isinstance(xline.kind, Section):
                                        continue
                                    elif xline.light_check_with_section_index(line):
                                        logger.info(f"Page {page_id} adding light section {xline} for ToC entry {line}")
                                        changed.add(xline.page)
                                        found = True
                                        break
                            if not found:
                                logger.warning(f"missing section for ToC entry {line}")

        for line in self.deleted_sections:
            logger.warning(f"Page {line.page} section {line} not found in ToC: deleting")
            line.kind = None
            changed.add(line.page)

        for page_id in changed:
            logger.debug(f"Updating page {page_id}")
            self.pages[page_id].detect_tables()

    def models(self, kind=Section):
        return Counter(line.kind.model for line in self.collect(kind=kind)).most_common()

    def section_models(self):
        return self.models()

    def tdm_models(self):
        return self.models(kind=SecIndex)

    def other_index_models(self):
        return self.models(kind=OtherIndex)

    def other_models(self):
        return self.models(kind=Other)

    def display(self, use_logger=False):
        local_print: Callable[..., None] = (
            cast(Callable[..., None], logger.info) if use_logger else cast(Callable[..., None], print)
        )
        logger.info("Statistics")
        npages = len(self.pages)
        nempty = len([p for p in self.pages if not p.lines])
        local_print(f"#pages={npages} #empty={nempty} ratio={nempty / (npages + 0.001):.1%}")
        tdm = list(self.tdm)
        nentries = len(tdm)
        ntocpages = len(set(entry.page for entry in tdm))
        local_print(f"#toc={nentries} on #pages={ntocpages}")
        nsections = len(list(self.sections))
        local_print(f"#sections={nsections}")
        ncaptions = len(list(self.captions))
        local_print(f"#captions={ncaptions}")
        logger.info("ToC models")
        local_print(self.tdm_models())
        logger.info("Section models")
        local_print(self.section_models())
        logger.info("Caption ToC models")
        local_print(self.other_index_models())
        logger.info("Caption models")
        local_print(self.other_models())
        logger.info("Sections")
        for section in self.sections:
            local_print(f"Page {section.page} ({section.kind.numbering}) {str(section)}")
        logger.info("Captions")
        for caption in self.captions:
            local_print(str(caption))
        if not use_logger:
            logger.info("Displaying all pages")
            for page in self.pages:
                logger.debug(f"Page {page.id} display")
                page.display()

    def display_html_overview(self):
        html_display = []
        models = [
            ("TOC Models", self.tdm_models()),
            ("Section Models", self.section_models()),
            ("Caption TOC Models", self.other_index_models()),
            ("Caption models", self.other_models()),
        ]
        sections = [section for section in self.sections]
        captions = [section for section in self.captions]

        def make_table(lst):
            table = []
            styles = [("style", "width:30%;"), ("border", "1")]
            ths = [make_html("Model", "th"), make_html("Frequency", "th")]
            table.append(make_html("\n".join(ths), "tr", parent=True))
            for t in lst:
                td1, td2 = make_html(t[0], "td"), make_html(str(t[1]), "td")
                tr = [td1, td2]
                table.append(make_html("\n".join(tr), "tr", parent=True))
            table = make_html("\n".join(table), "table", parent=True, attval=styles)
            return table

        def make_list(lst):
            ul = []
            for t in lst:
                txt = t.content
                attval = [("style", HTML_COLORS[t.kind.color])]
                li = make_html(txt, tag="li", attval=attval)
                ul.append(li)
            ul = make_html("\n".join(ul), "ul", parent=True)
            return ul

        def make_footers():
            footer_div = []
            href_1 = make_html(
                text="Page précédente",
                tag="a",
                attval=[("href", f"./page_{len(self)}.html"), ("target", "_self")],
            )
            href_2 = make_html(
                text="Page suivante",
                tag="a",
                attval=[("href", "./page_1.html"), ("target", "_self")],
            )
            subdiv1 = make_html(
                text=href_1,
                tag="div",
                attval=[("style", "position:absolute; left:0%; width:50%; height:100%;")],
                parent=True,
            )
            subdiv2 = make_html(
                text=href_2,
                tag="div",
                attval=[("style", "position:absolute; left:50%; width:50%; height:100%;")],
                parent=True,
            )

            footer_div.append(subdiv1)
            footer_div.append(subdiv2)
            footer_div = make_html(
                text="\n".join(footer_div),
                tag="div",
                attval=[("style", "width:300px;height:100px;position:relative")],
                parent=True,
            )

            return footer_div

        for m in models:
            title, lst = m
            html_display.append(make_html(title, tag="h2"))
            html_display.append(make_table(lst))

        html_display.append(make_html("Sections", tag="h2"))
        html_display.append(make_list(sections))
        html_display.append(make_html("Captions", tag="h2"))
        html_display.append(make_list(captions))
        html_display.append(make_footers())
        html = make_html(text="\n".join(html_display), tag="html", parent=True)
        return html

    def to_html(self, path="./pages", pdf=None):
        """
        Produces an HTML version of the Document (1 file per page)
        """
        os.makedirs(path, exist_ok=True)
        with open(f"{path}/overview.html", "w", encoding="utf8", newline="\n") as inf1:
            inf1.write(self.display_html_overview())
        for i, page in enumerate(self.pages):
            with open(f"{path}/page_{str(i + 1)}.html", "w", encoding="utf8", newline="\n") as inf:
                inf.write(page.display_html())

    def get_logical_section(self, section=None, index=None):
        logger.info(f"try get_logical_section {section} {index}")
        if index is not None:
            return LogicalSection(doc=self, title=list(self.sections)[index], sections=list(self.sections))
        elif section is not None:
            return LogicalSection(doc=self, title=section, sections=list(self.sections))

`caption_tdm` `property` ⚓︎

Collects all the caption TOC entries (OtherIndex) of the Document. Is equivalent to a Table of Contents for captions.

`captions` `property` ⚓︎

Collects all the captions (Other) of the Document.

`cells` `property` ⚓︎

Collects all the Lines that are parts of a table (Cell)

`sections` `property` ⚓︎

Collects all section titles (Section) of the Document.

`tdm` `property` ⚓︎

Collects all the section TOC entries (SecIndex) of the Document. Is equivalent to a Table of Contents.

`getitem(n)` ⚓︎

For a given physical page index, returns the logical page associated with it if it is present in the page_mapping attribute dictionary.

Parameters:

Name	Type	Description	Default
`n`	`int`	the index of the physical page	required

Returns:

Type	Description
`int`	the index of the logical page if available, else None.

Source code in pdfstruct/document.py

def __getitem__(self, n):
    """
    For a given physical page index, returns the logical page associated with it if it is present in
    the `page_mapping` attribute dictionary.

    Args:
        n (int): the index of the physical page

    Returns:
        (int): the index of the logical page if available, else None.
    """
    return self.pages[self.page_mapping[n]] if n in self.page_mapping else None

`len()` ⚓︎

Returns:

Type	Description
`int`	the number of pages in the Document.

Source code in pdfstruct/document.py

def __len__(self):
    """
    Returns:
        (int): the number of pages in the Document.
    """
    return len(self.pages)

`add_page(content, links=None, images=None)` ⚓︎

The equivalent of an .append() list method : the Page is added at the end of the .pages attribute. The Page object is instanciated here (and consequently all its Line objects.) The Lines that are SecIndex (Toc entries) are also saved in the Document Table of Contents (process_indexes())

Source code in pdfstruct/document.py

def add_page(self, content, links=None, images=None):
    """The equivalent of an `.append()` list method : the Page is added at the end of the .pages attribute.
    The Page object is instanciated here (and consequently all its Line objects.)
    The Lines that are SecIndex (Toc entries) are also saved in the Document Table of Contents (`process_indexes()`)
    """
    page = Page(
        len(self),
        content,
        prev=self.pages[-1] if self.pages else None,
        doc=self,
        links=links,
        images=images,
        ocr=self.ocr,
    )
    self.pages.append(page)
    self.process_indexes(page.tdm)

`check_and_filter_sections()` ⚓︎

filters each collected section with conditions.

Source code in pdfstruct/document.py

def check_and_filter_sections(self):
    """filters each collected section with conditions."""
    if not hasattr(self, "deleted_sections"):
        self.deleted_sections: List[Any] = []
    else:
        for line in self.deleted_sections:
            logger.debug(f"Page {line.page} potential section {line} already deleted")
    self._tdm = defaultdict(list)
    self._tdm_light = defaultdict(list)
    self.process_indexes(self.tdm)
    tdm = self._tdm
    tdm_light = self._tdm_light
    if not hasattr(self, "_tdm_light_extra"):
        self._tdm_light_extra: Dict[Any, Any] = defaultdict(list)
    else:
        logger.debug(f"{len(self._tdm_light_extra)} sections already checked")

    # tdm_light_extra collects accepted sections that have no associated ToC entries
    # used to anchor other sections
    # indexed by normalized numbering
    tdm_light_extra = self._tdm_light_extra
    reduced2full: Dict[Any, Any] = {}
    # list of accepted sections without ToC entries
    ordered_extra_sections: List[Any] = []

    def register_in_extra(numbering, line):
        sections[line.kind.index] = line
        tdm_light_extra[numbering].append(line)
        ordered_extra_sections.append(line)
        numbering_alt = line.kind.components.format_without_last_post()
        if numbering_alt != numbering:
            reduced2full[numbering_alt] = numbering
        logger.debug(f"Registered {numbering} in tdm_light_extra with {line}")

    sections = {}
    extra = []
    nsections = 0

    if not tdm:
        extra = list(self.sections)
        nsections = len(extra)
    else:
        if self.page_mapping:

            def check_page(entry, line):
                page_number = entry.kind.page
                if page_number in self.page_mapping:
                    page_id = self.page_mapping[page_number]
                    return page_id == line.page
                else:
                    return False

        else:

            def check_page(entry, line):
                return True

        def get_acceptable_entry(section):
            numbering, title = index = section.kind.index
            if index in tdm:
                return [entry for entry in tdm[index] if check_page(entry, section)]
            elif numbering in tdm_light:
                return [entry for entry in tdm_light[numbering] if check_page(entry, section)]
            else:
                return []

        logger.debug(f"{len(tdm)} entries in ToC")
        if "" in tdm_light:
            for entry in tdm_light[""]:
                logger.info(f"special ToC entry {entry}")
        # missing = []
        extra = []
        nsections = 0
        for line in self.sections:
            numbering, title = index = line.kind.index
            nsections += 1
            potential_entries = get_acceptable_entry(line)
            logger.debug(f"checking section {numbering=} {line} #potential={len(potential_entries)}")
            # special sections:
            if not numbering:
                logger.info(f"special section {line}")

            if index in tdm and potential_entries:
                logger.debug(f"found in tdm {line} in {potential_entries}")
                sections[index] = line
            elif index in tdm:
                logger.debug(f"found in tdm {line} but not at the right page")
                extra.append(line)
            else:
                logger.debug(f"not found in tdm {line} ")
                extra.append(line)

    logger.warning(f"{nsections} potential sections, {len(extra)} not in ToC")
    extra2 = []
    changed = set([])

    def potential_serie(k, prev=None):
        candidate = extra[k]
        seqnum = candidate.kind.components
        current = seqnum.normalized
        if False and not prev and not seqnum.is_last_comp_first():
            return False
        if k + 1 == len(extra):
            return False
        next_candidate = None
        if candidate.kind.intro:
            for j in range(k + 1, len(extra)):
                next_candidate = extra[j]
                if next_candidate.kind.intro == candidate.kind.intro:
                    break
        if not next_candidate:
            next_candidate = extra[k + 1]
        logger.debug(f"try potential serie {candidate} against {next_candidate} prev={prev}")
        next_seqnum = next_candidate.kind.components
        next_prev = next_seqnum.prev().normalized
        next_up = next_seqnum.up().normalized
        next_seqnum.restore()
        logger.debug(f"{current=} {next_prev=} {next_up=} {next_seqnum.normalized}")
        if next_seqnum.level > 1 and next_up == current and next_seqnum.is_last_comp_first():
            return True
        elif next_prev == current:
            return candidate.kind.intro or potential_serie(k + 1, prev=current)
        else:
            return False

    for k, line in enumerate(extra):
        numbering, title = index = line.kind.index
        logger.debug(f"handling0 extra section level={numbering} {line}")
        if numbering not in tdm_light or not any(
            lline.kind.index[1].startswith(title) for lline in tdm_light[numbering]
        ):
            seqnum = line.kind.components
            current = seqnum.normalized  # actually, should be the same than numbering

            logger.debug(f"handling extra section level={seqnum.level} {line}")

            if (  # seqnum.level == 1
                numbering not in tdm_light
                # and numbering not in tdm_light_extra
                and potential_serie(k)
            ):
                logger.debug(f"Page {line.page} {line} accepted as section because of starting serie")
                register_in_extra(numbering, line)
                continue

            elif (
                (seqnum.level > 1 or line.kind.intro) and numbering not in tdm_light
                # and numbering not in tdm_light_extra
            ):

                def check_related(related, kind="sibling"):
                    # we test if the related section is in the ToC or has been previously added
                    # and check page numbers
                    to_be_tried = [related]
                    if related in reduced2full:
                        to_be_tried.append(reduced2full[related])
                    for related in to_be_tried:
                        logger.debug(
                            f"check related numbering {related=} in_extra={related in tdm_light_extra} in_tdm={related in tdm_light}"  # noqa: E501
                        )

                        if related in tdm_light_extra:
                            for entry in tdm_light_extra[related][::-1]:
                                if entry.page <= line.page and (
                                    line.page - entry.page < 30
                                    or (ordered_extra_sections and line.page - ordered_extra_sections[-1].page < 30)
                                ):
                                    logger.debug(
                                        f"Page {line.page} {line} accepted as section because of section {kind} {related}"  # noqa: E501
                                        + f" on page {entry.page}"
                                    )
                                    register_in_extra(numbering, line)
                                    return True

                        if related in tdm_light:
                            for entry in tdm_light[related][::-1]:
                                entry_page = entry.kind.page
                                if self.page_mapping and entry_page in self.page_mapping:
                                    entry_page = self.page_mapping[entry_page]
                                logger.debug(
                                    f"check vs ToC entry {entry} entry_page={entry.kind.page}/{entry_page} section_page={line.page} {self.page_mapping}"  # noqa: E501
                                )
                                if entry_page <= line.page and (
                                    line.page - entry_page < 30
                                    or (ordered_extra_sections and line.page - ordered_extra_sections[-1].page < 30)
                                ):
                                    logger.debug(
                                        f"Page {line.page} {line} accepted as section because of entry {kind} {related} expected on page {entry_page}"  # noqa: E501
                                    )
                                    register_in_extra(numbering, line)
                                    return True
                    return False

                other = seqnum.prev().normalized  # sibling (seqnum is modified when using navigation methods)
                other2 = seqnum.prev().normalized  # sibling of sibling
                other3 = seqnum.up().normalized  # parent
                other4 = seqnum.up().normalized  # grand-parent
                seqnum.restore()
                logger.debug(f"test related to current {current}: {other=} {other2=} {other3=} {other4=}")
                if current != other:
                    # if current has a potential previous brother, we test it
                    logger.debug(f"test sibling related between {current} and {other}: {other in tdm_light_extra}")
                    if check_related(other):
                        continue
                    if other2 != other:
                        # we accept a gap of 1 by testing sibling of sibling
                        if check_related(other2, kind="sibling-gap"):
                            continue
                else:
                    # otherwise we test its parent
                    if check_related(other3, kind="parent") or check_related(other3[:-1], kind="parent"):
                        continue
                    # and if deep enough, we accept a gap by testing its grand-parent !
                    if seqnum.level > 3 and (
                        check_related(other4, kind="grand-parent")
                        or check_related(other4[:-1], kind="grand-parent")
                    ):
                        continue

            logger.debug(f"Page {line.page} discarding section case1 {line}")
            self.deleted_sections.append(line)
            extra2.append(line)
        else:
            # the candidate section has a potential entry in ToC
            found = False
            for entry in tdm_light[numbering]:
                if entry.kind.index[1].startswith(title) and check_page(entry, line):
                    sections[entry.kind.index] = line
                    tdm_light_extra[numbering].append(line)
                    logger.debug(f"case3 extra section associate {line} to tdm {entry}")
                    found = True
                    break
            if not found:
                logger.debug(f"Page {line.page} discarding section case2 {line}")
                self.deleted_sections.append(line)
                extra2.append(line)

    if self.page_mapping:
        for index in tdm:
            logger.debug(f"Processing ToC entry index {index}")
            if index not in sections:
                # sometimes there are several ToC for the same entries
                # (a generic one, and one per chapter for instance)
                # we try not to check for duplicate entries
                # However, several ToC entries may be associated with disctinct pages
                # (e.g. a section Introduction at the beginning of each chapter)
                logger.debug(f"handling ToC entry index {index} without associated section")
                pages = set([])
                for line in tdm[index]:
                    page_number = line.kind.page
                    logger.debug(f"handling ToC entry {line}")
                    if page_number in pages:
                        logger.debug("Dulicated Toc entry: skipping")
                        continue
                    pages.add(page_number)
                    if page_number in self.page_mapping:
                        page_id = self.page_mapping[page_number]
                        logger.debug(f"searching on physical page {page_id}")
                        found = False
                        for xline in self.pages[page_id].lines:
                            if (
                                isinstance(xline.kind, Section)
                                and line.kind.numbering
                                and xline.kind.numbering == line.kind.numbering
                            ):
                                logger.info(f"Page {page_id} relating section {xline} for ToC entry {line}")
                                found = True
                                # remove from sections to be deleted
                                self.deleted_sections = [
                                    _line for _line in self.deleted_sections if _line is not xline
                                ]
                                break
                            elif xline.check_with_section_index(line):
                                logger.info(f"Page {page_id} adding section {xline} for ToC entry {line}")
                                changed.add(xline.page)
                                found = True
                                break
                        if not found:
                            for xline in self.pages[page_id].lines:
                                if isinstance(xline.kind, Section):
                                    continue
                                elif xline.light_check_with_section_index(line):
                                    logger.info(f"Page {page_id} adding light section {xline} for ToC entry {line}")
                                    changed.add(xline.page)
                                    found = True
                                    break
                        if not found:
                            logger.warning(f"missing section for ToC entry {line}")

    for line in self.deleted_sections:
        logger.warning(f"Page {line.page} section {line} not found in ToC: deleting")
        line.kind = None
        changed.add(line.page)

    for page_id in changed:
        logger.debug(f"Updating page {page_id}")
        self.pages[page_id].detect_tables()

`collect(kind=Section)` ⚓︎

Providing a kind of marker, for each page of the Document, collects all the lines where line.kind = marker.

Returns:

Type	Description
`iter`	a flattened iterator of all the occurrences of the marker in the Document.

Source code in pdfstruct/document.py

def collect(self, kind=Section):
    """
    Providing a kind of marker, for each page of the Document, collects all the lines where `line.kind` = marker.


    Returns:
        (iter): a flattened iterator of all the occurrences of the marker in the Document.
    """
    return chain.from_iterable((p.collect(kind=kind) for p in self.pages))

`detect_header_and_footer()` ⚓︎

Detect headers and footers as topmost and bottommost lines stable modulo numbers.

Algorithm:

initiating 2 dictionaries (1 for headers, 1 for footers)
going through each page
- extracting the top and bottom lines
- going through each top line
  - normalizing the text of line : replacing all numbers by the same symbol, removing spaces, setting it to lowercase.
  - normalizing the coordinates of the line
  - filling the header dictionary : key = tuple(normalized text , normalized coordinates), value = list where the real line is appended.
  - this way, each normalized tuple will have a certain number of lines associated to it.
- same process applied to bottom lines (footers).
final dictionaries : filtering the header and footer keys that are not frequent enough (threshold relative to the length of the document)
identifying changing headers (retrieving the headers that could be variants of a frequent header and putting it in the final header dictionary)
identifying changing footers
for each Line in the final dictionaries
- assigning the kind "Header" or "Footer" to the Line.
once the headers/footers are found, the page mapping between logical page and physical page is made (find_page_numbering()):
- extracting all numbers for each header line and checking there are numbers then are in the same position and incremental.

Source code in pdfstruct/document.py

def detect_header_and_footer(self):
    """Detect headers and footers as topmost and bottommost lines stable modulo numbers.

    **Algorithm**:

    * initiating 2 dictionaries (1 for headers, 1 for footers)
    * going through each page
        * extracting the top and bottom lines
        * going through each top line
            * normalizing the text of line : replacing all numbers by the same symbol, removing spaces, setting it
              to lowercase.
            * normalizing the coordinates of the line
            * filling the header dictionary : key = tuple(normalized text , normalized coordinates), value = list
              where the real line is appended.
            * this way, each normalized tuple will have a certain number of lines associated to it.
        * same process applied to bottom lines (footers).
    * final dictionaries : filtering the header and footer keys that are not frequent enough (threshold relative to
      the length of the document)
    * identifying changing headers (retrieving the headers that could be variants of a frequent header and
    putting it in the final header dictionary)
    * identifying changing footers
    * for each Line in the final dictionaries
        * assigning the kind "Header" or "Footer" to the Line.
    * once the headers/footers are found, the page mapping between logical page and physical page
      is made (`find_page_numbering()`):
        * extracting all numbers for each header line and checking there are numbers then are in the same position
          and incremental.

    """
    # TODO : maybe we should rewrite this method to handle only one kind of part
    # and call it twice, once for header and once for footer
    # and maybe for side parts (left and right sides) present in some documents
    npages = len(self.pages)

    if npages < 10:
        return

    median_height = np.median(np.array([p.height for p in self.pages]))
    # median_width = np.median(np.array([p.width for p in self.pages]))

    headers = defaultdict(list)
    footers = defaultdict(list)

    def normalize(content):
        # we abstract on number parts
        # assuming potential page numbers are given as arabic numbers
        content = re.sub(r"\d+", r"<<NUM>>", content)
        content = re.sub(r"\s+", r" ", content)
        content = content.lower()
        return content

    for page in self.pages:
        # we assume that headers and footers are relatively short
        # and occur only in the top and bottom 15% of pages
        # we use local page height because of distinct page orientation

        top_page = page.height * 0.15
        bot_page = page.height * 0.85
        half_width = page.width * 0.6

        df = page.df.sort_values(by="y0")

        for line in list(df.head(10).line.values):
            if line.y1 < top_page and line.width < half_width:
                headers[(normalize(line.content), round(line.y0 / 15))].append(line)
                # logger.debug(f"page {page.id} potential header line {line.y0} {line}")
        for line in list(df.tail(10).line.values):
            if line.y0 > bot_page and line.width < half_width:
                footers[(normalize(line.content), round(line.y0 / 15))].append(line)
    # we use a threshold of 0.45 because we may have different headers and footers for odd and even pages
    # the current version does not handle headers/footer varying along sections/chapters/...
    header_set = set((header for header in headers if len(headers[header]) / npages > 0.45))
    footer_set = set((footer for footer in footers if len(footers[footer]) / npages > 0.45))

    def identify_changing_header_or_footer(candidates, keep_set):
        """
        Checks whether the headers (or footers) that were not kept in the final set were actually worth keeping :
        sometimes the content of a header varies with the page number (those with odd page numbers might differ
        from those with even page numbers).

        """
        # going through each normalized candidate
        for candidate in candidates:
            if candidate in keep_set:
                continue
            stats = [None, None, None]

            def update(page, y, where, step=1):
                if stats[where] is not None and stats[where][1] == page - step and abs(stats[where][2] - y) < 10:
                    stats[where][1] = page

                    if stats[where][1] - stats[where][0] > 10:
                        keep_set.add(candidate)
                        return True
                else:
                    stats[where] = [page, page, y]

            for line in candidates[candidate]:
                page = line.page
                y = round(line.y0)
                if update(page, y, where=0, step=1) or update(page, y, where=1 + (page % 2), step=2):
                    break

    identify_changing_header_or_footer(headers, header_set)
    identify_changing_header_or_footer(footers, footer_set)

    logger.debug(f"headers: {header_set}")
    logger.debug(f"footers: {footer_set}")
    # we type header/footer lines
    for header in header_set:
        for line in headers[header]:
            line.kind = Header()
            line.where = "header"
            # logger.debug(f"Page {line.page} handling header line {header=} {line}")
    for footer in footer_set:
        for line in footers[footer]:
            line.kind = Footer()
            line.where = "footer"

    # we can use elements in header_set/footer_set to identify the one carrying the page number
    # TODO : take this function out to make the main function more readable.
    def find_page_numbering(lines, template, y=None):
        """For a given header template and its associated lines, extracts the potential logical page number for each
        line and make the mapping with the pshysical page."""
        distincts = set(line.content for line in lines)
        if len(distincts) > 0.8 * len(lines):
            logger.info(f"potential page number in {template} {y=} {len(distincts)} lines out of {len(lines)}")
            info: Dict[Any, List[Any]] = defaultdict(list)
            for line in lines:
                logger.debug(f"extract page number from {line.page} {line}")
                for i, num in enumerate(re.findall(r"(\d+)", line.content)):
                    num = int(num)
                    # num is the potential logical page, line.page is the physical page
                    # i is the position of "num" in the text
                    # info[i][-1][0] = the previous "num" in the same position but in the previous header.
                    # That number should be inferior to the current "num".
                    if i not in info or info[i][-1][0] < num:
                        info[i].append((num, line.page))
                    else:
                        logger.debug(
                            f"strange page number in template {template} {i=} {num=} prev={info[i][-1][0]} {line.page} {line}"  # noqa: E501
                        )
            logger.debug(f"{info=}")
            for i in info:
                logger.debug(f"{i=} {len(info[i])=} {len(distincts)}")
                if len(info[i]) == len(distincts):
                    logger.info(f"page number in {template} as {i}th number")
                    for logical_pageid, physical_pageid in info[i]:
                        self.page_mapping[logical_pageid] = physical_pageid
                        self.pages[physical_pageid].page_number = logical_pageid

    for header in header_set:
        find_page_numbering(headers[header], header[0], y=header[1])
    for footer in footer_set:
        find_page_numbering(footers[footer], footer[0], y=footer[1])

    # use links on ToC entries to set page mapping
    # should check that the physical page numbers are correct versus sub-documents
    # also, that the corresponding section is indeed present on the target page

    # Collecting all Line.kind = SecIndex in Document
    tdm = list(self.tdm)
    if tdm:
        npages = len(self.pages)
        # entry is a Line object
        for entry in tdm:
            for link in entry.links:
                # id is the logical page retrieved by the regex
                # here, physical id is the page number provided in the link (PyMuPdf)
                id = entry.kind.page
                physical_id = link[0]
                logger.debug(f"checking page mapping vs links {id=} {physical_id=}")
                # if the link goes beyond the page scope of the document, we ignore it
                if physical_id >= npages:
                    continue
                # checking if the logical page is indeed in page_mapping.
                # checking if the physical page contains a section title similar
                # to the TOC entry (comparing numeration)
                # handling mismatches (updating the mapping)
                if id not in self.page_mapping:
                    for line in self.pages[physical_id].sections:
                        if line.kind.numbering == entry.kind.numbering:
                            logger.info(f"using link to set page mapping {id} to {physical_id}")
                            self.page_mapping[id] = physical_id
                            self.pages[physical_id].page_number = id
                            break
                elif physical_id != self.page_mapping[id]:
                    logger.warning(
                        f"page mapping mismatch for {id} between link={physical_id} and current={self.page_mapping[id]}"  # noqa: E501
                    )
                    found = False
                    for line in self.pages[physical_id].sections:
                        logger.debug(f"compare {entry} with {line}")
                        if line.kind.numbering == entry.kind.numbering:
                            logger.info(f"using link to update page mapping {id} to {physical_id}")
                            self.page_mapping[id] = physical_id
                            self.pages[physical_id].page_number = id
                            found = True
                            break
                    if not found:
                        logger.info(f"keeping existing page mapping for {id}")

    # filling gaps in page mapping
    logical_pages = sorted(list(self.page_mapping.keys()))
    logger.debug(f"mapped logical page ids {logical_pages}")
    prev = None

    # filling internal gaps
    for logical_pageid in logical_pages:
        if (
            prev
            # finding the delta between logical pages(difference superior to 1)
            and logical_pageid - prev > 1
            # the delta should be the same btw logical pages and physical pages
            and self.page_mapping[logical_pageid] - self.page_mapping[prev] == logical_pageid - prev
        ):
            for id in range(prev + 1, logical_pageid):
                # filling the gap : a new key-value pair is added to page mapping
                physical_id = self.page_mapping[prev] + id - prev
                logger.debug(f"in middle gap, mapping logical page {id} to page {physical_id}")
                self.page_mapping[id] = physical_id
                self.pages[physical_id].page_number = id
        prev = logical_pageid

    # also filling what has not been detected before the first mapped page
    if logical_pages and logical_pages[0] > 1:
        first_mapped_page = logical_pages[0]
        first_mapped_physical_page = self.page_mapping[first_mapped_page]
        for id in range(1, first_mapped_page):
            physical_id = first_mapped_physical_page + id - first_mapped_page
            if physical_id >= 0:
                logger.debug(f"in initial gap, mapping logical page {id} to page {physical_id}")
                self.page_mapping[id] = physical_id
                self.pages[physical_id].page_number = id

    # also filling what has not been detected before the last mapped page
    if logical_pages and self.page_mapping[logical_pages[-1]] < len(self.pages):
        last_mapped_page = logical_pages[-1]
        last_mapped_physical_page = self.page_mapping[last_mapped_page]
        for physical_id in range(last_mapped_physical_page + 1, len(self.pages)):
            id = last_mapped_page + physical_id - last_mapped_physical_page
            logger.debug(f"in final gap, mapping logical page {id} to page {physical_id}")
            self.page_mapping[id] = physical_id
            self.pages[physical_id].page_number = id

    # we can use elements in header_set/footer_set to potentially identify top and bottom coordinates
    # TODO : ask Eric why the final loop since all headers and footers in the final sets have already been labeled
    if header_set:
        page_tops = defaultdict(list)
        for i, header in enumerate(header_set):
            for line in headers[header]:
                page_tops[line.page].append(round(line.y1))
        all_tops = Counter([max(tops) for tops in page_tops.values()])
        top = all_tops.most_common(1)[0][0] + 5 if all_tops else 0
        logger.debug(f"header sep line at {top} for median height={median_height} all_tops={all_tops}")
        for lines in headers.values():
            for line in lines:
                if line.y1 <= top and type(line.kind) is not Header:
                    logger.debug(f"Page {line.page} setting line {line} to header")
                    line.kind = Header()
                    line.where = "header"

    if footer_set:
        page_bots = defaultdict(list)
        # TODO : correct mistake here (variable name is wrong)
        for i, header in enumerate(footer_set):
            for line in footers[footer]:
                page_bots[line.page].append(round(line.y0))
        all_bots = Counter([min(bots) for bots in page_bots.values()])
        bot = all_bots.most_common(1)[0][0] - 5 if all_bots else 10000
        logger.debug(f"bottom sep line at {bot} for median height={median_height} all_bots={all_bots}")
        for lines in footers.values():
            for line in lines:
                if line.y0 >= bot and type(line.kind) is not Footer:
                    logger.debug(f"Page {line.page} setting line {line} to footer")
                    line.kind = Footer()
                    line.where = "footer"

`detect_page_numbering()` ⚓︎

Detects frequent numberings and their ranges in the document.

Returns:

Type	Description
`list(tuple(int))`	intervals of numbering ranges

Algorithm :

going through each page, extracting top and bottom lines
normalizing lines (text + coordinates), extracting numbers to find a potential logical page number.
checking if the normalized versions occur more than 4 times (test_lines())
if so, checking for a range (is_range()), retrieving the deltas between physical and logical pages.
storing all potential ranges
going through potential ranges, handling overlappings and gaps (handle_gaps()) using the deltas.

Source code in pdfstruct/document.py

def detect_page_numbering(self):
    """Detects frequent numberings and their ranges in the document.

    Returns:
        (list(tuple(int))): intervals of numbering ranges


    **Algorithm** :

    * going through each page, extracting top and bottom lines
    * normalizing lines (text + coordinates), extracting numbers to find a potential logical page number.
    * checking if the normalized versions occur more than 4 times (test_lines())
    * if so, checking for a range (is_range()), retrieving the deltas between physical and logical pages.
    * storing all potential ranges
    * going through potential ranges, handling overlappings and gaps (handle_gaps()) using the deltas.
    """

    entries = defaultdict(list)

    def normalize(line):
        """Returns the rounded y0, a normalized version of the ontent and
        a list of all the numbers detected in the line."""
        content = line.content
        # extract all the numbers in the line
        nums = [int(x) for x in re.findall(r"(\d+)", content)]
        # replace all numbers in the line by <<NUM>>
        content = re.sub(r"\d+", r"<<NUM>>", content)
        # remove spacings and convert to lower case
        content = re.sub(r"\s+", r"", content)
        content = content.lower()
        y = int(line.y0)
        return ((y, content), nums)

    def is_range(info, key=None, line=None):
        """Input is a list of line infos : physical page ids and potential logical pages.
        **Algorithm**:

        * go over each line
        * compute the deltas between physical page id and logical numbers
        * compare the results with the previous line.
        * If there is a range, there should be the same deltas at the same places each time,
          modulo some exceptions (intercalar pages... )

        Returns:
            (tuple): contains the physical start and end, the delta/shift between the physical and logical ids,
            the normalized text and coordinates, and the range size.
        """
        # logger.info(f"entering is_range page={info[-1][0]} {key=} {line=} {info=}")
        # info is a list of tuples [(page id of line, list of detect numbers in the line)]
        orig = prev = info[-1]
        # comparing page ids and the detected numbers in the potential header/footer
        # (substacting between physical page and logical page)
        prev_deltas = [prev[0] - page_id for page_id in prev[1]]
        # where is the logical page index
        # shift is the difference between the physical page and the physical page.
        where = None
        shift = None
        range_size = 1

        # d intercalar pages (no nummbering)
        # p+d : k+1 prev_delta=(p+d)-k-1
        # p : k delta=p-k
        # prev_delta-delta = (p+d)-k-1-p+k=d-1

        # going backwards in the list, excluding the last element
        # x[0] = page id (physical page), x[1] = detected numbers (with a potential logical page)
        for x in info[:-1][::-1]:
            deltas = [x[0] - page_id for page_id in x[1]]
            # logger.debug(f"try down range {x[0]} potential={x[1]} {deltas=} {shift=} {range_size=}")

            # a physical page is the same as the previous one
            if x[0] == prev[0]:
                break
            # the amount of detected numbers differs
            if len(deltas) != len(prev_deltas):
                break
            # gap too big between previous page id and current page id
            if False and where is None and prev[0] > x[0] + 3:
                break
            if where is None:
                for i in range(0, len(deltas)):
                    # there should be the same difference between the physical page and the logical page and
                    # at the same spot in the line.
                    if deltas[i] == prev_deltas[i]:
                        where = i
                        shift = deltas[i]
                        # logger.debug(f"set shift={shift} where={where} {prev_deltas=} {deltas=} {range_size=}")
                        break
                if where is None and x[0] + 1 < prev[0] < x[0] + 3:
                    # try with potential intercal pages !
                    for i in range(0, len(deltas)):
                        if prev_deltas[i] - deltas[i] == prev[0] - x[0] - 1:
                            where = i
                            shift = deltas[i]
                            break
                if where is None:
                    # logger.debug(f"exit range check at {x[0]}")
                    break
            elif deltas[where] == prev_deltas[where]:
                # normal page numbering: keep the same shift between the logical page and physical page
                # logger.debug(f"going down range {x[0]} {shift=} {range_size=}")
                pass
            elif prev_deltas[where] - deltas[where] == prev[0] - x[0] - 1 and x[0] + 1 < prev[0] < x[0] + 3:
                # insertion of intercalar pages (no page numbering)
                # may skip intercalar pages that does not really break numbering
                # we could play with the thresholds here (1 and 3)
                # logger.debug(f"changed shift={shift} where={where} {prev_deltas=} {deltas=} {range_size=}")
                shift = deltas[where]
            else:
                break

            if False and (
                where is None
                or (
                    prev_deltas[where] != deltas[where]
                    # the following condition could deal with document insertion within a larger document
                    # however, need to maintain a list of shifts between physical and logical page numbers
                    and deltas[where] != prev_deltas[where] + x[0] - prev[0]
                )
            ):
                break

            prev = x
            prev_deltas = deltas
            range_size += 1

        if range_size > 4:
            # logger.debug(f"range extension from {prev[0]} to {orig[0]} with key={key}
            # where={where} shift={shift} info={info}")
            # logger.debug(f"got range from={prev[0]} to={orig[0]} shift={shift} key={key} size={range_size}")
            return (prev[0], orig[0], shift, where, key, range_size)
        else:
            # logger.debug(f"no range extension")
            return None

    potential_range: Dict[Any, Any] = {}

    def test_line(line):
        """Checks whether normalized content and coordinates and their info could be used to detect ranges.
        If so, adds it to the potential_range dictionary."""
        # checking if there are numbers in the line (in which case it could be a header or footer)
        xkey, nums = normalize(line)
        # logger.warning(f"Test footer {page.id} {line.y0=} {xkey=} {nums=} info={footers[xkey]}")
        if nums:
            # entries = {(y0, normalized line) : [(page.id, [num1, num2...])] }
            entries[xkey].append((page.id, nums))
            info = entries[xkey]
            # if a normalized content and its coordinates have been detected more than 4 times,
            # we check if we have a range.
            if len(info) > 4:
                # logger.warning(f"Try range {page.id} {line.y0=} {xkey=} {nums=} {line=}")
                xrange = is_range(info, key=xkey, line=line)
                if xrange:
                    # logger.warning(f"range extension from {info[0][0]} to {page.id} with footer {xkey}")
                    # potential range = {((y0, normalized text), start): (start, end, shift, where, ...)}
                    potential_range[(xkey, xrange[0])] = xrange

    for page in self.pages:
        height = page.height
        # logger.warning(f"Test page {page.id} {height=}")

        df = page.df.sort_values(by="y0")
        # taking the first 10 lines at the top of the page (everything before 1/3rd of the page)
        for line in list(df.head(10).line.values):
            if line.y0 > 0.33 * height:
                continue
            test_line(line)
        # taking the last 10 lines at the bottom of the page (everything after 2/3rds of the page)
        for line in list(df.tail(10).line.values):
            if line.y0 < 0.66 * height:
                continue
            test_line(line)

    ranges = list(potential_range.values())
    # sorting ranges by start (ascending)
    ranges = sorted(ranges, key=lambda x: x[0])
    xranges: List[Any] = []

    def handle_gap(start):
        if xranges and start > xranges[-1][1] + 1:
            gap = start - xranges[-1][1]
            logger.warning(f"Numbering gap of {gap} pages between pages {xranges[-1][1]} and {start}")
            if gap > 3:
                # add new doc
                logger.info(f"Adding gap page range start={xranges[-1][1] + 1} end={start - 1}")
                xranges.append((xranges[-1][1] + 1, start - 1))
            else:
                # extend previous document
                logger.info(f"Extending previous page range start={xranges[-1][0]} end={start - 1}")
                xranges[-1] = (xranges[-1][0], start - 1)

    # going through collected potential ranges and handling gaps and overlappings
    for xrange in ranges:
        # xstart = xrange[0]
        xend = xrange[1]
        shift = xrange[2]
        start = shift + 1
        x = (start, xend)
        if not xranges and start > 0:
            logger.warning(f"First pages 0 to {start - 1} not in a numbering range")
            start = 0
            x = (start, xend)
        if not xranges or start > xranges[-1][1]:
            handle_gap(start)
            logger.info(f"Adding page range start={start} end={xend} xrange={xrange}")
            xranges.append(x)
        elif start == xranges[-1][0]:
            logger.info(f"Compatible overlapping numbering ranges {xranges[-1]} and {x} xrange={xrange}")
            xranges[-1] = (start, max(xend, xranges[-1][1]))
        else:
            logger.warning(f"Non-compatible overlapping numbering ranges {xranges[-1]} and {x} xrange={xrange}")
    # handle potentially final gap at the end of the document
    handle_gap(len(self.pages))
    return xranges

`filter_other_indexes()` ⚓︎

Deletes caption TOC entries whose intros are too rare. Also warns if a caption TOC entry was not found in the actual Document.

Source code in pdfstruct/document.py

def filter_other_indexes(self):
    """Deletes caption TOC entries whose intros are too rare.
    Also warns if a caption TOC entry was not found in the actual Document."""
    others = defaultdict(list)
    for line in self.collect(kind=OtherIndex):
        others[line.kind.intro].append(line)
    other_set = set((other for other in others if len(others[other]) > 10))
    logger.debug(f"other indexes: {other_set}")
    # delete rare other-like entries
    for other in others:
        if other not in other_set:
            for line in others[other]:
                line.kind = None
    # handle other entries to retrieve them on pages
    if self.page_mapping:
        for other in other_set:
            for line in others[other]:
                if type(line.kind) is not OtherIndex:
                    continue
                found = False
                page_number = line.kind.page
                if page_number in self.page_mapping:
                    page_id = self.page_mapping[page_number]
                    for xline in self.pages[page_id].lines:
                        if xline.check_with_other_index(line):
                            found = True
                            logger.info(f"Page {page_id} found section for ToC entry {xline}")
                            break
                    if not found:
                        logger.warning(f"missing caption for ToC entry {line}")

`from_json(filename, **kwargs)` `classmethod` ⚓︎

opens a json file and passes it as a raw input to create a Document. The json must follow the dictionary structure of PyMuPdf.

Source code in pdfstruct/document.py

@classmethod
def from_json(cls, filename, **kwargs):
    """opens a json file and passes it as a `raw` input to create a `Document`.
    The json must follow the dictionary structure of PyMuPdf."""
    with open(filename, "r") as f:
        return cls(json.load(f), **kwargs)

`from_pdf(filename, extract_words=None, extract_pages=None, path=None, extract_images=False, **kwargs)` `classmethod` ⚓︎

Opens the pdf file with PyMuPdf, extracts the dictionaries (1 dictionary per page) and passes it as a raw input to create a Document.

Parameters:

Name	Type	Description	Default
`filename`	`str`	The path of the pdf file.	required
`extract_words`	`bool`	Extracts the word data as given by PyMuPdf. This parameter is mandatory if one wants	`None`
`extract_pages`	`bool`	Extracts the pages individually as pdf files. This is useful for online interfaces.	`None`
`path`	`bool`	the path where the extracted pages are saved if extract_pages is True.	`None`
`extract_images`	`bool`	If True, the flags used for extractions also preserves images.	`False`
`kwargs`		all other arguments passed when creating a Document.	`{}`

Returns:

Type	Description
`Document`	a Document object.

Source code in pdfstruct/document.py

@classmethod
def from_pdf(
    cls,
    filename,
    extract_words=None,
    extract_pages=None,
    path=None,
    extract_images=False,
    **kwargs,
):
    """Opens the pdf file with PyMuPdf, extracts the dictionaries (1 dictionary per page) and passes it as a `raw`
    input to create a `Document`.

    Args:
        filename (str): The path of the pdf file.
        extract_words (bool): Extracts the word data as given by PyMuPdf. This parameter is mandatory if one wants
        to also extract paragraphs (see `paragraph.py`)
        extract_pages (bool): Extracts the pages individually as pdf files. This is useful for online interfaces.
        path (bool): the path where the extracted pages are saved if extract_pages is True.
        extract_images (bool): If True, the flags used for extractions also preserves images.
        kwargs : all other arguments passed when creating a Document.

    Returns:
        (Document): a Document object.

    """

    # should cache the json
    def get_pages(doc, path, kwargs):
        page, n = kwargs["page"], kwargs["n"]
        if path is None:
            path = "./pages"
        if not os.path.exists(path):
            os.makedirs(path)
        for i in range(page, page + n):
            out = fitz.open()
            out.insert_pdf(doc, from_page=i, to_page=i)
            out.save(f"{path}/page_{i}.pdf")

    doc = fitz.open(filename)
    metadata = doc.metadata
    try:
        toc = doc.get_toc()
    except Exception:
        toc = []

    logger.info(f"toc is {toc}")
    jsn = []
    words: List[str] = []
    links = []
    all_images = []
    flags = fitz.TEXTFLAGS_TEXT if not extract_images else fitz.TEXTFLAGS_DICT
    for i, page in enumerate(doc):
        # jsn.append(json.loads(page.get_text("json", flags=flags)))
        # getting the raw dict for the page, the links and the associated words.
        jsn.append(page.get_text("dict", flags=flags))
        words.append(page.get_text("words", flags=flags))
        links.append([(_link["page"], _link["from"]) for _link in page.get_links() if _link["kind"] == 1])
        if links[-1]:
            logger.debug(f"links on page {i}: {links[-1]}")

        # Image extraction
        try:
            images = page.get_image_info(xrefs=True)
        except Exception:
            images = []
        if images:

            def is_large_image(image):
                bbox = image["bbox"]
                return (bbox[2] - bbox[0]) > 100 or (bbox[3] - bbox[1]) > 100

            def get_content(image):
                xref = image["xref"]
                # return content['image'] if content else None
                try:
                    content = doc.extract_image(xref)
                    return content["image"] if content else None
                    # TODO: ask Eric about that part that seems unused
                    pix = fitz.Pixmap(content["image"])
                    smask = content["smask"]
                    if smask > 0:
                        logger.debug(f"get content with mask page {i} {xref=} {smask=}")
                        mask = fitz.Pixmap(doc.extract_image(smask)["image"])
                        try:
                            if pix.alpha:
                                pix = fitz.Pixmap(pix, 0)  # remove alpha channel
                            pix = fitz.Pixmap(pix, mask)
                        except Exception:
                            pass
                    ipath = "./images"
                    if not os.path.exists(ipath):
                        os.makedirs(ipath)
                    pix.save(f"{ipath}/img{i}_{xref}.png")
                    return pix.tobytes()
                except Exception:
                    logger.exception(f"pb image content on page {i} {xref=}")
                    return None

            images = [(*image["bbox"], get_content(image)) for image in images if is_large_image(image)]
            logger.debug(f"images on page {i} #entry={len(images)}")
        all_images.append(images)
    # Page extraction if wee need to have single pdf pages.
    if extract_pages:
        get_pages(doc, path, kwargs)
    # We discard the list of words if needed
    if extract_words is None:
        words = None

    # TODO: easyocr commented because requires specific packages installation. Should be discussed
    # ocr = easyocr.Reader(['fr'])
    ocr = None
    return cls(
        jsn,
        words,
        links,
        images=all_images,
        metadata=metadata,
        filename=filename,
        ocr=ocr,
        **kwargs,
    )

`is_section_in_index(section)` ⚓︎

checks whether a normalized Section

Source code in pdfstruct/document.py

def is_section_in_index(self, section):
    """checks whether a normalized Section"""
    if self._tdm:
        return section.index in self._tdm
    else:
        return True

`lpage(n)` ⚓︎

For a given physical page index, returns the logical page associated with it if it is present in the page_mapping attribute dictionary.

Parameters:

Name	Type	Description	Default
`n`	`int`	the index of the physical page	required

Returns:

Type	Description
`int`	the index of the logical page if available, else None.

Source code in pdfstruct/document.py

def lpage(self, n):
    """
    For a given physical page index, returns the logical page associated with it if it is present in
    the `page_mapping` attribute dictionary.

    Args:
        n (int): the index of the physical page

    Returns:
        (int): the index of the logical page if available, else None.
    """
    return self.pages[self.page_mapping[n]] if n in self.page_mapping else None

`process_indexes(indexes)` ⚓︎

stores the detected SecIndex in ._tdm and ._tdm_light.

The keys of ._tdm are the normalized version of the Secindex : - tuple(normalized numbering, normalized body of text)

Source code in pdfstruct/document.py

def process_indexes(self, indexes):
    """stores the detected SecIndex in ._tdm and ._tdm_light.

    The keys of ._tdm are the normalized version of the Secindex :
        - tuple(normalized numbering, normalized body of text)
    """
    for line in indexes:
        numbering, title = index = line.kind.index
        self._tdm[index].append(line)
        # self._tdm_light[line.kind.numbering].append(line)
        self._tdm_light[numbering].append(line)

`single_doc_process()` ⚓︎

Returns a Document object. Called after splitting document. Detects headers and footers, filters potential Sections, filters OtherIndexes.

Source code in pdfstruct/document.py

def single_doc_process(self):
    """Returns a Document object. Called after splitting document.
    Detects headers and footers, filters potential Sections, filters OtherIndexes."""
    # the following methods should be called after sub-document detection
    # (after building a collection of documents)
    logger.info("completing processing single document")
    self.detect_header_and_footer()
    self.check_and_filter_sections()
    self.filter_other_indexes()

    logger.info("done processing single document")
    return self

`split()` ⚓︎

Splits a multi-part Document into a collection of Documents. Also issues warnings about the presence of a table of content for each Document.

Returns:

Type	Description
`list(Document)`	the list of sub-Documents according to the numbering ranges.

Source code in pdfstruct/document.py

def split(self):
    """Splits a multi-part `Document` into a collection of `Documents`.
    Also issues warnings about the presence of a table of content for each `Document`.

    Returns:
        (list(Document)): the list of sub-Documents according to the numbering ranges."""

    # TODO : change + reset of page numbering in headers or footers could be used
    # numbering ranges is a list of tuples (start, end)
    numbering_ranges = self.detect_page_numbering()
    if len(numbering_ranges) > 1:
        docs = []
        for start, end in numbering_ranges:
            logger.debug(f"handling numbering range start={start} end={end}")
            potential_tdm = []
            prev_tdm = None
            lpages = set([])
            # going through each page in the interval and collecting the tables of contents
            for page in self.pages[start : end + 1]:
                if page.is_tdm_page():
                    if not prev_tdm:
                        potential_tdm.append(page.id)
                    prev_tdm = page
                    # collecting the logical page in the SecIndex matching regex (ex : "title .... page x"-> extract x) # noqa: E501
                    for line in page.tdm:
                        lpages.add(line.kind.page)
                else:
                    prev_tdm = None
            if potential_tdm:
                lpages_ = sorted(list(lpages))
                lpage_max = lpages_[0]
                # sometimes a bad ToC entry is returned with a very large page number
                # we check that the max page number is close from the other page numbers
                for lpage in lpages_[1:]:
                    if lpage > lpage_max + 20:
                        break
                    lpage_max = lpage
                if len(potential_tdm) > 1:
                    logger.warning(f"Several Tables of Contents within page range {start} {end}")
                logger.debug(
                    f"found ToC at page {potential_tdm[-1]} in page range {start} {end} "
                    f"with last reference on page {start + lpage_max - 1}"
                )
                if start + lpage_max > end + 1:
                    logger.warning(
                        f"Table of content entry for logical page={lpage_max} "
                        f"outside of physical page range {start} {end}"
                    )
                    # TODO : should extend the range of the sub-document if necessary
                    # specially if the next one was a gap filler
            else:
                logger.warning(f"No Table of Content within page range {start} {end}")
            docs.append(self.sub_doc(start=start, end=end + 1))
        return [doc.single_doc_process() for doc in docs]

    logger.debug("single document: no splitting")
    return [self.single_doc_process()]

`sub_doc(start=0, end=None)` ⚓︎

Creates sub-Document from a larger Document with the provided start and end indices. Reassigns page ids and link targets according to new start and new end. Returns: (Document): the updated smaller Document object.

Source code in pdfstruct/document.py

def sub_doc(self, start=0, end=None):
    """Creates sub-`Document` from a larger `Document` with the provided start and end indices.
    Reassigns page ids and link targets according to new start and new end.
    Returns:
        (Document): the updated smaller `Document` object.
    """
    if not end:
        end = len(self.pages)
    logger.info(f"Potential sub-document between pages {start} and {end - 1} #pages={end - start}")
    pages = self.pages[start:end]
    doc_len = len(pages)
    for i, page in enumerate(pages):
        # updating page ids
        page.id = i
        page.doc_len = doc_len
        for line in page.lines:
            line.page = i
            # updating links
            if line.links and start:
                old = line.links
                line.links = [(target_id - start, src) for target_id, src in line.links]
                logger.info(
                    f"Page {i} updated links with {start} {list(zip(*old))[0]} to {list(zip(*line.links))[0]}"
                )
    return Document(pages, metadata=self.metadata, filename=self.filename)

`to_html(path='./pages', pdf=None)` ⚓︎

Produces an HTML version of the Document (1 file per page)

Source code in pdfstruct/document.py

def to_html(self, path="./pages", pdf=None):
    """
    Produces an HTML version of the Document (1 file per page)
    """
    os.makedirs(path, exist_ok=True)
    with open(f"{path}/overview.html", "w", encoding="utf8", newline="\n") as inf1:
        inf1.write(self.display_html_overview())
    for i, page in enumerate(self.pages):
        with open(f"{path}/page_{str(i + 1)}.html", "w", encoding="utf8", newline="\n") as inf:
            inf.write(page.display_html())

Document

caption_tdm property ⚓︎

captions property ⚓︎

cells property ⚓︎

sections property ⚓︎

tdm property ⚓︎

__getitem__(n) ⚓︎

__len__() ⚓︎

add_page(content, links=None, images=None) ⚓︎

check_and_filter_sections() ⚓︎

collect(kind=Section) ⚓︎

detect_header_and_footer() ⚓︎

detect_page_numbering() ⚓︎

filter_other_indexes() ⚓︎

from_json(filename, **kwargs) classmethod ⚓︎

from_pdf(filename, extract_words=None, extract_pages=None, path=None, extract_images=False, **kwargs) classmethod ⚓︎

is_section_in_index(section) ⚓︎

lpage(n) ⚓︎

process_indexes(indexes) ⚓︎

single_doc_process() ⚓︎

split() ⚓︎

sub_doc(start=0, end=None) ⚓︎

to_html(path='./pages', pdf=None) ⚓︎

`caption_tdm` `property` ⚓︎

`captions` `property` ⚓︎

`cells` `property` ⚓︎

`sections` `property` ⚓︎

`tdm` `property` ⚓︎

`getitem(n)` ⚓︎

`len()` ⚓︎

`add_page(content, links=None, images=None)` ⚓︎

`check_and_filter_sections()` ⚓︎

`collect(kind=Section)` ⚓︎

`detect_header_and_footer()` ⚓︎

`detect_page_numbering()` ⚓︎

`filter_other_indexes()` ⚓︎

`from_json(filename, **kwargs)` `classmethod` ⚓︎

`from_pdf(filename, extract_words=None, extract_pages=None, path=None, extract_images=False, **kwargs)` `classmethod` ⚓︎

`is_section_in_index(section)` ⚓︎

`lpage(n)` ⚓︎

`process_indexes(indexes)` ⚓︎

`single_doc_process()` ⚓︎

`split()` ⚓︎

`sub_doc(start=0, end=None)` ⚓︎

`to_html(path='./pages', pdf=None)` ⚓︎