Page MenuHomePhabricator

Generate PDFs for Phortune invoices
Open, LowPublic

Description

It's common for billing/accounts organizations to have processes which are heavily focused on emailing PDFs around.

We can likely generate a PDF version of an invoice without enormous amounts of difficulty. The pathways to go from HTML to PDF seem very messy and all seem to involve running a headless browser and simulating "Print > To PDF" in it, but actually writing the PDF format directly is not that much of a mess.

FPDF seems to have a functional writer implementation in ~1,800 lines and we can likely cheat through most of this.

Event Timeline

epriestley created this task.

PDF files appear to consist of a series of objects that (mostly) look like this:

<object number> 0 obj
<<

<... various different kinds of data ...>

>>

The object numbers are 1-based. You don't need to write the objects in order, so you can write object 3 before object 1 if you want, but you do have to write an index table of offsets at the end of the file. It looks like you could pick sparse object numbers, but you'd end up with bogus lines in the index table (at a minimum) and maybe have to write actual empty object definitions, so you're probably better off starting at 1 and using every number from there on up.

Additionally, a lot of objects refer to other objects. For example, the root "Catalog" object refers to a "Pages" object, which refers to a list of "Page" objects. All these references need to use the number of the referenced object.

I'm cribbing heavily from FPDF, which has a series of different, very clever solutions for managing these object references, involving a mixture of hard-codes, guesses, and maybe a pinch o' the ol' luck.

So far, I'm writing a PDF which technically opens but isn't showing text yet. An identical-on-the-face-of-it PDF from FPDF shows text, but I can't figure out what's different with diff since the FPDF output is in a wacky order.

Hopefully this is just FPDF being wacky and not some kind of actual requirement for PDFs to work. I'm somewhat heartened and somewhat confused that PrintSave as PDF produces output in a different order, albeit also a wacky non-sequential order.

Also, this is sort of interesting -- the Mac PDF does this:

10 0 obj
<< /Length 11 0 R ...

...
>>
11 0 obj
1116
endobj

I think that instead of encoding "/Length 1116", it's encoding "Length $OBJECT_11" and then writing the length into object #11. Powerful!

in-depth-tutorial.php
<?php

require_once 'scripts/init/init-script.php';


$info_object = new PhabricatorPDFInfoObject();

$catalog_object = new PhabricatorPDFCatalogObject();

$pages_object = new PhabricatorPDFPagesObject();
$catalog_object->setPagesObject($pages_object);

$page_object = new PhabricatorPDFPageObject();
$pages_object->addPageObject($page_object);

$contents_object = new PhabricatorPDFContentsObject();
$contents_object->setRawContent(<<<EOCONTENT
2 J
0.57 w
BT /F1 16.00 Tf ET
BT 31.19 794.57 Td (Hello World!) Tj ET

EOCONTENT
  );
$page_object->setContentsObject($contents_object);

$resources_object = new PhabricatorPDFResourcesObject();
$page_object->setResourcesObject($resources_object);

$font_object = new PhabricatorPDFFontObject();
$resources_object->addFontObject($font_object);

$generator = id(new PhabricatorPDFGenerator())
  ->setCatalogObject($catalog_object)
  ->setInfoObject($info_object);

$iterator = $generator->newIterator();
foreach ($iterator as $chunk) {
  echo $chunk;
}

Just some observations after popping open PDF invoices which have shown up in my inbox recently:

  • Google's PDF has no data identifying the generator.
  • MailGun's PDF is generated by wkhtmltopdf. (I did attempt to get this working, but was less excited about it when I realized it requires the host be running an X11 server to work, or requires compiling against a patched QT? This felt like a very deep swamp to wade into.)
  • According to the document metadata, SendGrid's PDF is generated by "Zachary Nelson", an accountant who works at SendGrid. The "Aspose.Words" Java module is used to write a Word file, which Zachary opens in Word and prints to PDF. He then sends you an email with the PDF attached. 😃

I've worked with wkhtmltopdf several times, and I don't remember it actually requiring any special setup - in all cases, I basically dropped the binary in the machine (Although it's possible all my machines had X11 built in for some reason? I've never intentionally installed X11).

When I apt-get install wkhtmltopdf (or something like that) on aux001, which is Ubuntu 14, I get this:

$ wkhtmltopdf
...

Name:
  wkhtmltopdf 0.9.9

...

Reduced Functionality:
  This version of wkhtmltopdf has been compiled against a version of QT without
  the wkhtmltopdf patches. Therefore some features are missing, if you need
  these features please use the static version.

  Currently the list of features only supported with patch QT includes:

...

 * Running without an X11 server.

...

Then:

$ wkhtmltopdf http://example.org out.pdf
wkhtmltopdf: cannot connect to X server 

This may not actually be difficult to resolve, but I'd like to be convinced that generating PDFs is genuinely hard before we add a Webkit dependency, even if X11 isn't really a dependency. So far, I haven't hit any major issues.

Just my two cents here, but chrome has an headless mode that allows to "print to pdf" : https://developers.google.com/web/updates/2017/04/headless-chrome

But it might require an X server too

Yeah, I'm really hoping to not require us to run an entire browser or depend on an external service to generate PDFs. The approaches I found in my research were:

ApproachMajor Drawback
wkhtmltopdfRequires Webkit
Node html-pdfRequires PhantomJS, which is Webkit in a ghost costume
Headless Chrome (or Node: chrome-headless-render-pdf)Requires Chrome
Pandoc/LaTeXPro: Writing invoices in LaTeX. Con: Markup > LaTeX > pdflatex > PDF still feels like a lot of steps.
FPDFPHP libraries are inherently suspicious, a lot of the code is very fragile/magical
PHP pdflib bindingsScary API, dependency on PDFlib, PDFlib seems weird (commercial-ish?)
mPDFFPDF in a ghost costume, code has a bunch of very fragile FPDF-isms
TCPDFEveryone writing PHP but me is crazy / TCPDF is mostly contained in one 24,512 line file / full of fragile patterns

I'd eagerly trade away almost any other benefit to avoid adding an external dependency we don't control.

For example, if the resulting PDF looks incredibly stupid but we don't have to keep an eye on security vulnerabilities in all of Chrome, that's a huge win in my book.

Since master can already generate stupid-looking PDFs after I spent an afternoon fiddling with it, I'd be surprised if we end up on any path other than first-party PDF generation, although I haven't actually made it through the Postscript part yet.

you can add libreoffice/soffice to your list. last time we had to do sth like this we used soffice --headless --convert-to pdf
not really better than running a browser though... in our use case we needed to use merchant uploaded doc/docx templates in combination with our data. so even less options.
but yeah most options around pdf generation are clunky.