An overview of PDF inaccessibility
There’s a lot of help online about making Portable Document Format (PDF) accessible. Even with all the advice out there, I still encounter people who find it difficult to make their documents friendly to people with disabilities.
It seems a lot of people have a hard time translating what’s in a tutorial to the PDF they’re working on. Unlike fixing accessibility errors in HTML, the solution for most PDF errors is usually the same for any PDF. It’s actually difficult to have multiple solutions when fixing accessibility errors in PDFs. Unlike HTML, a PDF that’s only got rudimentary accessibility is usually still too hard to use.
But why? What makes PDF inaccessible in the first place?
Different approaches to the same problem
Both HTML and PDF are documents. Both need some sort of application to read them. But what’s the fundamental difference between an HTML document and a PDF document?
An HTML document is made with it’s structure in mind (at least it should be). On the other hand, PDFs are made with the layout in mind. PDFs essentially make your computer act like a printer. The output strives to be an exact representation of what the original document looked like.
In HTML, the structure needs to be established in order to achieve the desired output. Typing a bunch of text into a text editor will have different outcomes when making HTML than a PDF. Formatting such as a clump of spaces or many hits of the enter key disappears when an HTML file is rendered. A PDF will look exactly the way it did in your text editor with formatting applied.
Yet, there’s an important consideration going on with these two files. HTML is displaying text on a screen, while the PDF is displaying a representation of the text on the screen. In reality, what you’re seeing in a PDF is not actually text.
First some history
Remember I said that PDFs make your computer act like a printer? There’s a reason for that.
The PDF file format is a subset of the PostScript page description language. A page description language is a programming language that describes the appearance of a printed page.
PostScript changed the way printers work. Before PostScript, it was hard to print text along with images. Vector drawings took expensive machines to print. Glyphs for text had to be physically changed if you wanted to use a different font. Printing Raster images required separate people to write programs for various printers.
PostScript converts pictures and text into a series of straight and Bézier vector curves. Vectors let you create a file with a code-based representation for everything that exist on a page. In years past, many printers shipped with fonts, but with PostScript, people could use fonts like those on their computer. While these features were super cool, PostScript came with its share of nuisances.
For example, while it’s possible to display PostScript on the screen, it wasn’t very practical. If you wanted to view page 150, pages 1 through 149 had to process first. That took a lot of processing power on computers, which made the process very expensive.
The Portable Document Format still generates output like PostScript. But it also introduced a structure storage system and advanced compression. And it interpreted the PostScript through tokenization. Tokenization is a way of breaking up something into a series of chunks to classify and label what the code means. Storing tokens that repeat allows software to display page 150 without having to render any of the pages before it.
PDF also offered the ability to embed fonts inside the file format. This meant that end users didn’t need to have a bazillion fonts on their computers. PDF draws text on the page as vector lines and curves that are derived from the embedded fonts.
A PDF is a visual representation of the source document. To display where text resides on a page, it uses the page as the reference for the text’s location. The structure of the page defines its content because it is a set of printing instructions.
On its own, there is no inherent structure assigned to the objects in a PDF. Everything in a PDF exists in relation to the layout of the page. For PDF documents to have structure, there needs to be a means to separate the useless content from the content that matters (After many confusing conversations with me, this has become a dream of my wife’s for years).
There are three basic ways content in a PDF is rendered. The first is how content is presented to the end user. This is the primary purpose of PDFs: displaying electronic print output on your computer screen. The second is how content is organized based on the way the document looks when you’re looking at it on the computer. The last way is optional. Defined tags can provide meaning to those tokens I mentioned earlier.
A closer look behind the scenes
PDF is basically a sequence of bytes. These bytes’ values can be represented as defined in ASCII plus white space characters. When the bytes are grouped together, they’re called tokens. The tokens represent a means to what is visually described in a PDF. Objects are data structures composed of small sets of bytes from a small set of data objects.
A PDF file consists of one or more pages and the objects in it. The fact that it actually contains a page is very important. All content that came from the application that created the PDF is broken down into tokenized objects. These tokens describe the graphics and text on each page.
Images in PDF documents aren’t added to the PDF file. Images are extracted and represented as data. Objects describe their transparency, values, location, color space, and so on. But the raw image is stored in the PDF file structure so it can be extracted from the PDF.
Like images, text is extracted when the PDF is created. Therefore, text in a PDF aren’t actual characters on the page. They are a rendering of lines and bezier curves that represent the shapes of letters and numbers that you’d see on the page. From an object perspective, there’s no actual correlation between the text on the page and what you see when you look at a PDF. When you see a table, your mind visually interprets it as a table. In a PDF, a table is just a sequence of objects, representing other objects, representing bytes on the page. The content in a PDF needs to be structured if you want to be able to extract text from the PDF
When you need to interact with multimedia on a page, PDF includes something called annotations to the file. Annotations are objects that are layered over the content that provide interactivity.
Objects that make up text, images, and annotations aren’t connected to each other. Instead, these objects relate to where they are in relation to the page. As objects are independent from one another, they only describe what the page looked like before it became a PDF.
What tags do
PDFs are essentially an organized mess. Letters, numbers, and spaces aren’t automatically connected to themselves or to the graphics they sit next to. Any hyperlinks on the page are only clickable rectangles that float willy-nilly above the content. You can actually move these clickable rectangles separately from the URLs that happen to be underneath them.
For content to make any kind of sense, you must define the relationships between these objects. You do that by giving a PDF what’s called a logical structure. The logical structure ranks groups of objects on the page. Objects receive attributes that provide further define their meaning. The logical structure lets you organize objects independently from the chaos of the PDF content. Under the hood, logical structure adds operands around the tokens in the content streams.
Operands give special instructions to the tokens to provide undefined semantics. To define tokens, tags must be included along with the logical structure. Tags are a set of instructions for PDF Objects that provide meaning and purpose within the logical structure. Tags can be organized by adding them to the “structure tree,” also known as a “tag tree.” This contains the logical structure of a PDF. The tag tree has a root, where all structure begins from. Inside this root is the first child and subsequent grouped structure elements, represented by tags. That first child generally describes the type of document that’s being structured. The first child of most PDFs will be a
<Document> tag explaining that the PDF is, well, a document.
Consider a simplified example: Suppose that you were to add a tag root to the PDF in Acrobat. At this point, there’d be no actual tags associated with the document yet. In the PDF’s source code, all PDF objects now have a bunch of operands, or markers, assigned to them. These markers expose every single object in the PDF, regardless of how useful this information is.
In order to provide a semantic relationship between what’s important and what’s not, we associate tags to the content. This separates real content from the content used to construct the page. For the parts of the document that are not important to understand the meaning of the content, we add a special designation. That designation is called an artifact.
The PDF specification — ISO 32000— defines how certain content types need to be marked up. For example, hyperlinks must be tagged to associate the hyperlink to both the clickable rectangle and to the text or image that needs to be associated to it. The PDF specification explains the requirements about how to use specific tags. But it doesn’t always require you to use specific tags when associating content.
For example, a PowerPoint presentation that’s exported to a PDF will represent its slides as containers. No matter what layout you use for your slides, each slide within that PDF will contain a first-level heading and a slew of paragraphs, even if the slide itself were to have the appearance of a bulleted list. In other words, there’s no requirement that stipulates something that says that elements within a PDF must be of the type that they look like.
“Tagged PDF” vs Accessible PDF
ISO 32000 doesn’t include a requirement for software to create PDFs as Tagged PDFs. It only requires software that supports Tagged PDFs to adhere to the chapter in ISO 32000 that covers Tagged PDF. Further, making a Tagged PDF does not necessarily render it accessible.
Making PDFs work with assistive technologies (AT) was a bit of an afterthought. Tagged PDF was initially used to export to other formats like HTML or XML. The requirements for Tagged PDF leaves much to interpretation. That means that software companies can determine which tags they want to include when exporting to a PDF. Worse, there’s no consensus for which tags matter most to AT. What we’ve ended up with is a bunch of PDFs that could have provided more semantic information to AT. But AT tends to ignore information that they don’t recognize because it conflicts with its own interpretation. It has basically boiled down to some really poor communication between developers and assistive technology companies in the past.
Not a solution, but a step in the right direction
PDF/UA was released to try to resolve this miscommunication. PDF/UA makes requirements for both software that creates Tagged PDF and the AT that reads them. Yet, believe it or not, making a PDF file PDF/UA does not actually render a PDF accessible. PDF/UA is what we call a companion standard. A companion standard is a standard that’s used alongside other standards. PDF/UA defines requirements for software that creates tagged PDFs and for the assistive technologies that reads them. But saving a PDF file as a PDF/UA doesn’t automatically make the PDF accessible.
PDF/UA primarily deals with making programmatically determined PDFs. PDF/UA places the responsibility for the content’s accessibility on the author. To make a PDF accessible, the document must be able to do a few things:
- The content that matters needs to be separate from the content that doesn’t;
- Content needs to be organized to reflect the visual representation;
- Relevant stuff needs to make enough sense to to someone who isn’t looking at it;
- It needs to follow the rules of ISO 32000;
- Something needs to tell Assistive Technologies what it expect when reading it.
Putting it all together
When you create a PDF, the file format creates a vector reproduction of the file’s visual presentation. To convey the presentation to other technology, the PDF must include some logic to define what’s what. Tagging a PDF does that.
When assistive technologies accesses Tagged PDF, it needs to understand what’s going on in the document. It uses something called a Document Object Model or DOM to understand it. In PDF, the DOM is the structure tree.
The authoring software decides the tags that are assigned to the PDF objects. The assignment of tags to objects can (and usually must be) fine-tuned by an end user. Unfortunately, one person’s interpretation of how to tag a document can be different from someone else’s. We agree to a common interpretation by following standards. Standards represent an agreement on how to structure content.
PDF/UA specification helps to define a common set of semantics for PDF. The Web Content Accessibility Guidelines provide guidance on how to describe content. Combined, those can go a long way toward helping you use the most semantic approach when you provide structure and meaning to PDFs.
In my next article, I’ll be covering how using PDF/UA and WCAG together will help to provide better accessibility for PDFs.