Parsing latex pdf's with python's mupdf library

Why?

In the Fall of 2020, I needed to switch an entire class to an all online (asynchronous) format. All …

Why?

In the Fall of 2020, I needed to switch an entire class to an all online (asynchronous) format. All my previous slides were made using LaTeX's beamer package. I needed to either redo all this content or repurposed this material for online content. I had previously made mpg4 videos for a portion fo this class. but was dissatisfied with the videos several reasons:

  1. Videos are relatively large files and some students did not have good internet connections,
  2. Videos were difficult to edit. I needed to listen to the entire video to find errors and edit them. This was very time-consuming,
  3. I saw no easy way to modify or update a video.

I decided that slideshows with voice overs would provide a better format both for viewing and for maintenance. After some searching I settled on using reveal.js to create my slideshows. This provided a way to mix audio and graphics, had plugins that allowed quizzes to be created for the slideshows, allowed inclusion of math, and allowed the inclusion of gifs and videos. Packages in Emacs (a text editor) provided a way of creating slideshows without the need to write HTML, simplifying the process. Because HTML is a plain text format, writing scripts to modify the files would be fairly easy.

Instead of rewriting the LaTeX files used to make the PDF files, I decided to convert the PDF pages into PNG graphic files. This allowed me to quickly reuse slides I'd made. Automating this process is the theme for the rest of this post.

Automating slide conversion

As I use Python in my daily work, I searched for an appropriate python module that would provide a tool for manipulating PDFs, and settled on using PyMyPDF. The goal was to be able to read in a PDF file containing my class slides and convert each page to a PNG (or similar) graphic file.

The basic module in PyMuPDF for accessing the PyMuPDF functions is called fitz and the Page class within fitz is used for manipulating PDF pages. To read each PDF page and write them out as PNG graphic files, the module reads a PDF file, converts it to a PNG object, and writes the file. Each PNG file is given the name of the original PDF and the page number (e.g. PdfFile_01.png). This code below is part of a larger python script used to convert PDF files into a PNG based slideshow.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
from pathlib import Path
from shutil import copyfile
import fitz

class make_base:
        def __init__(self,pdf_file='energy.pdf',topic='25_energy'):
                ''' set up location for input PDF file'''
                self.lect_path=Path('../lect')
                self.topic=Path(f'./{topic}')
                self.pdf_file=pdf_file
                self.source=self.lect_path.joinpath(self.topic,self.pdf_file)

        def split_pdf(self,):
            '''
                split pdf pages into seperate png images
                dest: location of pdf file
                self.slide_titles: stores slide header from the LaTeX beamer page for future use
                self.slide_num: stores the slide number for future use
                '''
                dest=self.topic.joinpath('slides',self.pdf_file)
                pdf_obj=fitz.open(dest)
                self.slide_titles=[]
                self.slide_num=[]
                slide_base=self.pdf_file.split('.')[0]
                for i,page in enumerate(pdf_obj):
                        pgtxt = page.getText()
                        # read in page text and save line (2nd from back).
                        # The first item in this list is the page number
                        pagenum=pgtxt.split('\n')[-2].split('/')
                        self.slide_num.append(int(pagenum[0]))
                        # get the header. If no header exists in the beamer file, incorrect data will be saved here
                        self.slide_titles.append(page.getText('blocks')[0][4])
                        '''
                        matrix represents the first two columns of a 3 by 3 matrix. This is used to magnify,
                        shear and rotate the page information in the line below, the matrix term multiplies
                        the pages data by 3, magnifying it and making the pixelmap larger, increasing resolution.
                        '''
                        png=page.getPixmap(matrix=[3,0,0,3,0,0])
                        png.pillowWrite(f"./{self.topic}/slides/{slide_base}_{i:03}.png",dpi=(300,300))
        pdf_obj.close()