Diving into WebAssembly for an upcoming work presentation, I was enthusiastic about showcasing its capabilities to my colleagues. But as with many technical explorations, theory met practice in an unexpected way. While gearing up for the presentation, I realized I had a personal challenge on my hands: uploading a hefty PDF to my bank's portal which had a restrictive file size limit.
To my surprise, these two seemingly unrelated ventures suddenly appeared interconnected...

Online file converters are not safe

Now, why not just default to an online PDF compressor? It's about the content—my confidential bank statements. Entrusting such vital information to a random online server is like playing with fire. There's more at stake than just privacy; there's the undeniable issue of data security. Moreover, in our current landscape where impersonation has become a common tool for malicious activities, the risk of personal data falling into the wrong hands could lead to catastrophic consequences.

The idea

My aim was to combine the accessibility of a web interface (so simple even my mum could use it), the security of running a script locally on my machine, and the efficacy of a top-tier PDF compressor.

WebAssembly to the rescue?

WebAssembly is a binary instruction format that lets code written in languages like C and C++ run in the browser. This provides an opportunity to port almost any existing codebase into a web environment. Taking advantage of this, one could integrate even long-standing, robust tools like Ghostscript—a leading library for PDF command-line manipulation that's been around since the 1980s.
By leveraging WebAssembly, it would become feasible to utilize Ghostscript's powerful PDF processing capabilities directly within a browser.

And that's exactly what Github user ochachacha did: he created a version of a PostScript file reader in the browser via Ghostscript, converting the PostScript file into a PDF file via Ghostscript compiled in WASM and then rendering the PDF into the browser (supported natively).

My turn

The work of ochachacha was interesting but I wanted something different:

  • it would need to run in a webpage and not in a chrome extension
  • it would need to take another command in to compress the pdf instead of converting it to a PostScript file
  • it would need to use modern bundler so that the WASM is loaded only when required (Ghostscript wasm is around 18MB!). We will use Vite.
  • it would need to use modern frontend framework I am familiar with. We will use React.

Crunchy bits

Compression command

In my opinion, the best balance between quality and size for PDF is the eBook version which compresses everything to 72dpi. It doesn't destroy the texts, compresses the images properly, every is still looking nice when printed. It's a delight.

You can obtain it with:

gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/ebook -dNOPAUSE -dQUIET -dBATCH -sOutputFile=output.pdf input.pdf

Passing the file around

The UI looks like this:

Passing the file between the frontend and the WASM is not trivial. Here is how I did:

  • Once the user selects the file, we create a local url with createObjectURL (here)
  • Once the user submits the PDF, we call the _GSPS2PDF that I didn't change much
function loadScript(url, onLoadCallback) {
    import("./gs.js")
}

export function _GSPS2PDF(dataStruct, responseCallback, progressCallback, statusUpdateCallback) {
    // first download the ps data
    var xhr = new XMLHttpRequest();
    xhr.open("GET", dataStruct.psDataURL);
    xhr.responseType = "arraybuffer";
    xhr.onload = function () {
        // release the URL
        window.URL.revokeObjectURL(dataStruct.psDataURL);
        //set up EMScripten environment
        Module = {
            preRun: [function () {
                const FS = window.FS;
                var data = FS.writeFile('input.pdf', new Uint8Array(xhr.response));
            }],
            postRun: [function () {
                const FS = window.FS;
                var uarray = FS.readFile('output.pdf', {encoding: 'binary'}); //Uint8Array
                var blob = new Blob([uarray], {type: "application/octet-stream"});
                var pdfDataURL = window.URL.createObjectURL(blob);
                responseCallback({pdfDataURL: pdfDataURL, url: dataStruct.url});
            }],
            arguments: ['-sDEVICE=pdfwrite', '-dCompatibilityLevel=1.4', '-dPDFSETTINGS=/ebook', '-DNOPAUSE', '-dQUIET', '-dBATCH',
                '-sOutputFile=output.pdf', 'input.pdf'],
            print: function (text) {
                statusUpdateCallback(text);
            },
            printErr: function (text) {
                statusUpdateCallback('Error: ' + text);
                console.error(text);
            },
            setStatus: function (text) {
                if (!Module.setStatus.last) Module.setStatus.last = {time: Date.now(), text: ''};
                if (text === Module.setStatus.last.text) return;
                var m = text.match(/([^(]+)\((\d+(\.\d+)?)\/(\d+)\)/);
                var now = Date.now();
                if (m && now - Module.setStatus.last.time < 30) // if this is a progress update, skip it if too soon
                    return;
                Module.setStatus.last.time = now;
                Module.setStatus.last.text = text;
                if (m) {
                    text = m[1];
                    progressCallback(false, parseInt(m[2]) * 100, parseInt(m[4]) * 100);
                } else {
                    progressCallback(true, 0, 0);
                }
                statusUpdateCallback(text);
            },
            totalDependencies: 0
        };
        Module.setStatus('Loading Ghostscript...');
        window.Module = Module;
        loadScript('gs.js', null);
    };
    xhr.send();
}

In pseudocode:

  1. Use xhr to fetch the PDF content as an array buffer from a Blob URL.
  2. Once the content is loaded (xhr.onload), the preparations for WASM execution begin: preRun: Happens before the WASM module runs. Here, the fetched file data is written to the Emscripten virtual filesystem as 'input.pdf'. This step is crucial as WASM cannot directly interact with the actual filesystem but can access this virtual one.
  3. arguments: This provides command-line arguments to the WASM module, just as if you were running a console-based program. In this instance, these arguments instruct the Ghostscript to convert 'input.pdf' into a compressed 'output.pdf'.
  4. postRun: Executes after the WASM module has finished its processing. 'output.pdf', created by the WASM module, is read from the virtual filesystem. It's then converted into a Blob, and subsequently, a Blob URL. This URL is then passed back for further use.

The frontend

Peeking at the frontend code, it's clear how handy abstraction can be. With all the heavy-duty logic tucked away in the WebAssembly functions, our frontend stays clean and simple. We manage everything with just a few states (init, selected, loading, toBeDownloaded), making the app's flow straightforward. Functions like changeHandler and onSubmit take care of user interactions, no fuss. The setup is a great example of doing the hard work behind the scenes, while keeping things easy up front.

const [state, setState] = useState("init")
    const [file, setFile] = useState(undefined)
    const [downloadLink, setDownloadLink] = useState(undefined)

    function compressPDF(pdf, filename) {
        const dataObject = {psDataURL: pdf}
        _GSPS2PDF(dataObject,
            (element) => {
                console.log(element);
                setState("toBeDownloaded")
                loadPDFData(element, filename).then(({pdfURL}) => {
                    setDownloadLink(pdfURL)
                });
            },
            (...args) => console.log("Progress:", JSON.stringify(args)),
            (element) => console.log("Status Update:", JSON.stringify(element)))
    }

    const changeHandler = (event) => {
        const file = event.target.files[0]
        const url = window.URL.createObjectURL(file);
        setFile({filename: file.name, url})
        setState('selected')
    };

    const onSubmit = (event) => {
        event.preventDefault();
        const {filename, url} = file;
        compressPDF(url, filename)
        setState("loading")
        return false;
    }

Result

The result is live, and you can have a look at the network tab, nothing gets sent to any server: https://laurentmmeyer.github.io/ghostscript-pdf-compress.wasm/

The code is open source and available at https://github.com/laurentmmeyer/ghostscript-pdf-compress.wasm