Recovered: Hapaxes in the Ultraviolet Grasslands
At the beginning of the glossary of the Ultraviolet Grasslands (UVG), Luka asks: “What have I missed? What needs more details?” One way to find things that might be missing is to look for hapaxes in the work. This is not a good plan, but I tried anyway.
Process
The following stuff was done in bash. I assume some familiarity with the commands, but comment on particular decisions that I made. It could be cleaned up.
First, we need the corpus as text so that we can work with it:1
$ python3-pdf2txt.py -o UVG.txt UVG.pdf
Then we clean up the text, and select all the words that only appear once:
$ cat UVG.txt |\
tr A-Z a-z |\
sed -e 's/\s/\n/g' |\
sed -E 's/[][<>.,();:+?!%/©&]//g' |\
sed -e "s/[‘’]/'/g" |\
sed -e 's/[“”"]//g' |\
sed -e 's/[–—]/-/g' |\
sed -e 's/[-"'\'']$//g' |\
sed -e 's/^[-"'\'']//g' |\
grep -Ev "^[-0-9'd]+$" |\
sort | uniq -u > UVG.hapax
Line breaks have been added for clarity. Parts of this bear closer examination:
sed -e 's/[“”"]//g' |\
This could be folded into the second sed statement, but it might be useful to keep but normalize double quotes for some purposes.
sed -e 's/[-"'\'']$//g' |\
sed -e 's/^[-"'\'']//g' |\
Quotes and hyphens at the beginning or end of a word are unlikely to carry much information, so they are stripped. This must happen after all the dash and quote characters have been “normalized.”
Lots of the words that only appear once (6832 now) are not exciting. So we’ll remove all the dictionary words:
$ /bin/diff -i /usr/share/dict/words UVG.hapax |\
grep ">" |\
cut -d " " -f2 > UVG.hapax.new
Again, line breaks have been added for clarity. The full path to diff
is specified because I’ve otherwise aliased diff
to colordiff
.
Results
Of the 1612 hapaxes now left, it might be interesting to see how the characters are distributed.
$ cat UVG.hapax.new | fold -c1 | sort | uniq -c | sort -gr
This gives a table of character frequency:
3223
1647 e
1295 a
1155 i
1106 o
1097 r
1016 s
1010 n
916 t
877 l
837 -
…
4 3
3
3 8
2 ô
2 ç
2 9
2 7
1 Ö
1 ñ
1 ë
1 â
The most common “character” is blank, and I suspect this is related to newlines (3223=2*1612-1). The other “blank” character appears to be a space that did not get stripped out initially, or which was later re-introduced. Perhaps it is some kind of other whitespace.
The most exciting thing in this table (I think) is the high occurrence of the hyphen. This means that roughly half of the “hapaxes” are likely composite words, and worth considering separately. For example:
- sub-node
- six-lives
- noble-pillared
- mercy-is-weakness
- marrow-beet
- curse-maddened
- six-limbed
- force-glass
- stock-piled
- self-regenerating
Disregarding hyphens, these are all words a dictionary knows, but which Luka may be using in novel ways.
The remaining (unhyphenated) words, are a mixed bag. Take this random sampling:
- pyrokinetic
- skalin
- psionics
- dustland
- irshe
- replicator
- 10x
- visec
- mearls
- mirodar
Many of these just show the limitations of my dictionary (“pyrokinetic,” “replicator”). Some of them show the limitations of the process (“10x,” “jrientsblogspotcom”). Some are ad-hoc compound words (“dustland,” “malicereflective”). The rest are either made-up, proper nouns, or typos, and I don’t have a way to distinguish between them. It’s possible that some of these were “created” by pdf2txt
, which uses tunable heuristics to decide where to draw word boundaries.
If you’re interested in playing with the lists, I’ve uploaded them here. They are split into “hyphen” and “nohyphen,” and should be alphabetical.
Hapaxes in Context
Not satisfied with the strength of these conclusions, I attempted to add context to the hapaxes through the use of some inadvisable perl scripts.
Process
Again, I’ll assume some familiarity with bash. I’m also using a pair of perl scripts I wrote which probably don’t generalize well, but which were fit for purpose. If I’d planned ahead, I’d have written them both as one script.
First, we’ll do some very similar things to what we did last time:
$ python3-pdf2txt.py -o UVG.txt UVG.pdf
$ cat UVG.txt |\
tr A-Z a-z |\
sed -E "s/\s+|['‘’]s\s+|[–—-]+/\n/g" |\
sed -E 's/[][<>.,();:+?!%/©&“”"#*]//g' |\
sed -e "s/^['’‘]//g" |\
sed -e "s/['’‘]$//g" |\
grep -Ev "^[0-9d]+$" |\
sort | uniq -u > UVG.hapax
$ /bin/diff -i /usr/share/dict/words UVG.hapax |\
grep ">" |\
cut -d " " -f2 > UVG.hapax.new
As before, linebreaks have been added for clarity. Also note that this time we remove the possessive “s” from the ends of strings, and we split on hyphens as well as spaces.
Next, we go back to the PDF, but we extract it as XML to access the character position data:
$ python3-pdf2txt.py -t xml -o UVG.xml UVG.pdf
$ cat UVG.xml | ./xml2tsv.pl > UVG.tsv
$ cat UVG.hapax.new | ./pdfmarker.pl UVG.tsv > UVG.pdfmark
The scripts I mentioned earlier are xml2tsv.pl
and pdfmarker.pl
. The former (very naively) strips extraneous markup, leaving each line with a character, four coordinates, and a page number. The latter reads that tsv file (given as an argument) and locates the coordinates of each word piped to it. (As each word only appears once, this is straightforward.) It outputs these coordinates in pdfmark format, a way to annotate PDFs.
Finally, we merge the annotations back into the original PDF as highlights:
$ gs -o UVG.ann.pdf -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress UVG.pdf UVG.pdfmark
Results
This method is much pickier, in part because of the difficulty of parsing a PDF consistently. From the text file, we extracted 164 words of interest, but there are only 139 annotations in the final count. I expect that the difference is words that are represented differently between the txt and xml formats. For example, if the space between two words isn’t represented by a whitespace character in the xml, it does not detect as a word boundary when we look for it. But the heuristics that build the text output may still correctly “add” the space back in. This method also considers each half of a hyphenated word separately, so they are more likely to appear multiple times or to be in the dictionary.
These numbers are smaller than before for a different reason also: I have been using the free sample version, so that I can share the results. This is 78 pages, down from 158 pages in the backer version I was using before. So while we can still get a list of the output as before:
- rewatch
- pusca
- eskatin
- ashwhite
- demiwarlock
- vidy
- engobes
- tollmistress
- dejus
- orangeware
We can also then go find where these words are highlighted in the PDF:
In the highlighted PDF, it’s easier to see that the majority of the hapaxes are proper names and normal words that my dictionary doesn’t contain, like “lunchbox” and “calcinous”. There are still lots of gems though, like a sign that reads “No Lones to Adventerers, Frybooters or Wagonbonds,” “the goddess Hazmaat,” and “zombastodon lair.” You can take a look here:2
“Hapaxes in the Ultraviolet Grasslands” was first shared on March 19, 2019, and “Hapaxes in Context” on April 7, 2019, both during the initial Kickstarter campaign for UVG. The crowdfunding campaign is over, but Luka’s Patreon continues. The initial post was received quite flatteringly.
python3-pdf2txt.py
is provided by PDFMiner. The specific name looks like it may have since changed topdf2txt.py
.↩︎The free version of the PDF (available unannotated in the Kickstarter description), is licensed under a CC BY-NC-ND 4.0 license. Arguably, because all the changes I have made to it were procedural, maybe this still complies with the “NoDerivatives” part of that. But I don’t actually know, so I went ahead and asked Luka and he said this was OK.↩︎