Regex Pattern to Find Illegal Characters in ePub

Unsatisfied with the tools I’ve come across, I decided to build the ePub file for “Upload” from scratch. I followed the detailed instructions in this very helpful article, and I ran into a problem when I got to the part where you’re supposed to “Package and check your EPUB”. When I ran the epubcheck tool (java -jar epubcheck-3.0b5.jar Upload.epub), I got this output:

Epubcheck Version 3.0b5

Validating against EPUB version 2.0
ERROR: Upload.epub: I/O error reading OEBPS/content.html
ERROR: Upload.epub/OEBPS/toc.ncx(60,45): 'Chapter7': fragment identifier is not
defined in 'OEBPS/content.html'

After doing a bit of searching, I discovered that the I/O error was probably due to a character that is not valid XML content. And the subsequent “fragment identifier is not defined” error stemmed from the fact that the I/O error had caused the epubcheck tool from reading past some illegal character in Chapter 6, since all of my chapters are in one XHTML file, “content.html”.

The problem, then, was locating the illegal character in Chapter 6 that was causing the I/O error. I started with the brute-force approach: visually scan the content of Chapter 6 in search something that seems wonky. This proved to be fairly tricky, and I realized that a simple regex search should uncover any characters outside the usual set of letters, numbers, and punctuation characters that one finds in a novel. I opened my content.html file in good old Textpad, searched for illegal characters using the regex pattern below, and quickly discovered that I had copied-and-pasted a “smart quote” apostrophe from Word. If you have additional punctuation, you may find that you have to tweak this pattern, but I hope it will at least get you pointed in the right direction:

[^-a-zA-Z <>"',./0-9:=\?&;#!()]

The basic idea is to look for any character that is NOT one of the common characters in U.S. English prose.

In case you are not familiar with regular expressions, I will give you a VERY brief breakdown of that pattern.

First off, the square brackets define a class of characters. Typically, with square brackets, you’re telling the regex software to look for any character which falls within the defined class. For example, this simple class definition matches any lower-case character:

[a-z]

However, that’s the white-list approach to searching: you’re defining the class of characters that you want to match. What we really want here is to define a class of characters that we want to skip over, so that we can find the illegal characters. This problem is easily solved by adding a caret (‘^’) as the first character in square brackets. For example, the pattern below will match any character which is not a lower-case letter:

[^a-z]

This black-list approach allows you to define the class of characters you don’t want to match on. If anything else comes along, match that instead.

Extending that concept, I came up with what proved to be a sufficiently large class of characters to catch any character in my novel that wasn’t “normal”. Anything outside the “normal” class was probably something that XML doesn’t like, which would have to be replaced by either a “normal” character (e.g., a simple apostrophe instead of a smart-quotes apostrophe) or a special XML-encoded version of the character. (There’s a pretty good list of XML-encoded characters on Wikipedia.)

So, with my handy-dandy regex pattern, I can search my novel for illegal characters, make substitutions as necessary, rebuild my ePub file, and then run it through epubcheck again.

Hope that helps!

Leave a Reply Cancel reply