PDF to reStructuredText

Published: Fri 07 August 2015
By EWS

In Blog.

If everybody else is using Markdown, I like to choose something else for drafting documentation, in this case: reStructuredText (aka RST). I guess I happened upon it because it's the default "go-to" markup choice for Pelican (even though Pelican does still support other standards like Markdown as well as AsciiDoc). A democratically determined consensus is that Markdown achieves its goal of being the closest thing to what one would expect formatted text to look like (that is, the least contrived form of "marked up text") whilst reStructuredText is more powerful. I'm sure the original inventors of US ASCII didn't intend it to have to perform formatting contortions, so this issue belongs firmly in the department of "Hotly Debated Tradeoffs".

Recently I had to convert a document from PDF to "something editable" since editing was required. The are various PDF to MS Word conversion utilities available, with an associated per-usage fee, but I didn't need to be too fussy about format. Some basic eliding of sensitive information was required, otherwise the getting the "gist" across was all that was required. Adobe have sneakily inserted a pay-for-use option in their Reader to convert to MS Word, as well:

Adobe Pay per Use

The source document was large and detailed, and contained all the usual annotations, including:

  • Hierarchically numbered headings;
  • Tables;
  • Figures;
  • Code and program output, both of-which needed to be rendered "verbatim".

There's a natural itch that every programmer wants to scratch in this situation:

"Don't do it manually ... write a program"

This is naturally followed by an OCD-like internal battle between the Angel-of-Pragmatism on one shoulder and the Devil-of-Automation on the other (or should it be the other way around?). In my case, after trying various things and discoverying the limits imposed by my existing Python rustyness, I opted for:

  1. "Save As" text from Adobe Reader;

  2. Do some simple "fixups" with Vim (score one for Angel-of-Pragmatism), notably:

    1. Replace smart quotes (“”) with their more boring, but easier-to-manage 7-bit ASCII cousin (")

      A handy table for smart double-quote unicode values:

      Quote Unicode
      Left Double 0x201C
      Right Double 0x201D
    2. Replace em-dashes (—) with their more boring, but easier-to-manage 7-bit ASCII cousin (-)

      The unicode value for em-dash is 0x2014.

  3. Write a Python script to convert all hierarchically numbered sections to look like reStructuredText (score one for Devil-of-Automation);

  4. Format bulleted lists manually (score one-hundred for Angel-of-Pragmatism);

  5. Insert RST .. image:: directives to image-snipped versions of the corresponding PDF in the case of tables and figures. I contemplated writing something to convert the table-dumped text into RST-compliant tables, but that would have been a been a level of masochism unpalatable to the most hardened self-flagellator.

Point 3 was the interesting bit, and here's the Python script that I ended up with:

 1import sys
 2import re
 3
 4h_underlines = ['=', '-', '~', '*', '+', '^']
 5
 6input_file = sys.argv[1]
 7
 8with open(input_file, 'r') as f:
 9    for l in f:
10
11        heading_level = None
12
13        mo = re.match(r'^(\d+\.(?:[\d]+\.?)*) ', l)
14        if mo:
15            heading_level = len(mo.group(0).rstrip('.').split('.'))
16
17        if heading_level:
18            heading_text = l[mo.end():].rstrip()
19            words = len(heading_text.split())
20            if heading_level < 4 and words > 7:
21                print l.strip()
22            else:
23                print heading_text
24                print h_underlines[heading_level-1]*len(heading_text)
25            continue
26
27        print l.rstrip()

Everybody likes to show their regular expressions off, so I'll kick off with that: line 13 pulls out the heading number, so something like 1.2.3 How to Foo your Bar results in a match of 1.2.3 (no surprises there).

The only other interesting bit is line 24 which uses a neat feature of Python whereby a character (or string) can be repeated n times simply by multiplying it by n (in our case, n is the length of the text of the heading).

Using the previous example, we end up with the following conversion:

1.2.3 How to Foo your Bar

becomes:

How to Foo your Bar
~~~~~~~~~~~~~~~~~~~

At the end of the day, I spent a way too much time formatting bullet lists manually, so Angel-of-Pragmatism thumped the competition ... something tells me my metaphor needs inverting.

Comments !

social