so like many other weekends, i spent the last 2 days scratching my brain around an interesting problem, reading raw PDF bytes and figuring out how one renders on screen. interestingly, i learnt that PDFs use a "graphical description" language - PostScript (although modern PDF spec is a subset of it), which is turing-complete. one could easily implement anything in the world they can dream of on a PDF! (of course if the renderer of your choice supports the spec).
funnily enough, all this goes back to history of the language - when it was used for telling printers how to literally "draw" vectors/lines/text on a page. so the script tells the print head whether to move left/right/top/down and/or which position (x, y) on page, in the next command.
The PostScript Example
for example, to print "Hello" on a page the script would go like -
/Helvetica findfont
24 scalefont
setfont
100 700 moveto
(Hello) show
showpagebreak this down:
/Helvetica findfont- find the Helvetica font24 scalefont- scale it to 24 pointssetfont- set it as current font100 700 moveto- move cursor to position (100, 700)(Hello) show- draw the text "Hello"showpage- render the page
Why Did I Do This?
one would wonder, but what did i do? and ask me to "go get a life bro - spend your time outside on a weekend!" yeah, i know but when i hear a language being turing-complete it interests me beyond measure. i just get too excited.
a couple of months back i fell into the rabbit hole of how text is drawn on a PDF, and i happen to come across another use case where i needed to grab text from PDFs this week. and that is when i decided to understand the spec, and playfully write my own Go lib to render it.
The Nuances
this sounds simple on paper but is actually nuanced. here's why -
1. Text as Vectors
rendering text on pdf is similar to drawing vectors/characters. as we saw, text is just another drawing operation.
2. Chunked Text
text is usually chunked, and a single line could be split and put in 5 different places with each having a "moveto" direction on top, so one would possibly need to literally "render" the whole script before finding even the first line.
The Experiment
so i began the experiment, and wrote a text extraction lib as the starting point. there were some cool nuances with it too -
1. Reading the Spec
the usual, i had to read the full spec before even being able to begin. the PDF spec is... massive. like 1000+ pages massive.
2. Glyph Space Units
i messed up with something called "glyph space units" initially, causing weird spaces between words. so postscript follows cartesian system strictly to deal with positions and spaces. if you forget to apply the adjustment (font-size and scaling) you'd end up adding spaces to the multiple of 1000x, which I did.
3. Security Concerns
however interesting all this may be, but i realised that if pdf allows you to write an algorithm using stacks, it could lead to arbitrary code executions, and a very possible attack vector.
The Go Implementation
here's a simplified version of what i ended up with:
type TextExtractor struct {
currentPage int
fonts map[string]Font
stack []float64
}
func (te *TextExtractor) ProcessCommands(commands []string) {
for _, cmd := range commands {
switch cmd {
case "moveto":
x := te.pop()
y := te.pop()
te.moveTo(x, y)
case "show":
text := te.popString()
te.showText(text)
// ... more commands
}
}
}
func (te *TextExtractor) showText(text string) {
// Apply current font transformations
// Convert glyph space to user space
// Extract actual text content
}Next Steps
so next weekend i'd be spending time sanitizing pdfs before/during rendering. the security implications are non-trivial.
Why This Matters
find postscript interesting? thanks to Adobe. read more on it on wikipedia [1]
but more practically, understanding this helps with:
- PDF text extraction (my original use case)
- PDF generation
- Security analysis of PDFs
- Understanding document rendering systems
The Rabbit Hole
what started as a simple text extraction problem led me down this rabbit hole of:
- PostScript language basics
- PDF internal structure
- Font rendering systems
- Coordinate space transformations
- Security implications of turing-complete document formats
Conclusion
was this practical? maybe not immediately. but understanding how things work at the fundamental level is what makes engineering interesting.
sometimes you need to go deep to understand the surface.
references: [1] PostScript on Wikipedia - https://en.wikipedia.org/wiki/PostScript [2] PDF Security Considerations - upcoming research
code will be open-sourced once i clean it up a bit more.