Recovered: Hapaxes in the Ultraviolet Grasslands

At the beginning of the glossary of the Ultraviolet Grasslands (UVG), Luka asks: What have I missed? What needs more details?” One way to find things that might be missing is to look for hapaxes in the work. This is not a good plan, but I tried anyway.

An early cover of a draft of UVG. Largely black with bold purple letters and two sketchy figures on the horizon. A magenta gradient forms the ground beneath them.Image by Luka Rejec, © 2019.

Process

The following stuff was done in bash. I assume some familiarity with the commands, but comment on particular decisions that I made. It could be cleaned up.

First, we need the corpus as text so that we can work with it:1

$ python3-pdf2txt.py -o UVG.txt UVG.pdf

Then we clean up the text, and select all the words that only appear once:

$ cat UVG.txt |\
    tr A-Z a-z |\
    sed -e 's/\s/\n/g' |\
    sed -E 's/[][<>.,();:+?!%/©&]//g' |\
    sed -e "s/[‘’]/'/g" |\
    sed -e 's/[“”"]//g' |\
    sed -e 's/[–—]/-/g' |\
    sed -e 's/[-"'\'']$//g' |\
    sed -e 's/^[-"'\'']//g' |\
    grep -Ev "^[-0-9'd]+$" |\
    sort | uniq -u > UVG.hapax

Line breaks have been added for clarity. Parts of this bear closer examination:

sed -e 's/[“”"]//g' |\

This could be folded into the second sed statement, but it might be useful to keep but normalize double quotes for some purposes.

sed -e 's/[-"'\'']$//g' |\
sed -e 's/^[-"'\'']//g' |\

Quotes and hyphens at the beginning or end of a word are unlikely to carry much information, so they are stripped. This must happen after all the dash and quote characters have been normalized.”

Lots of the words that only appear once (6832 now) are not exciting. So we’ll remove all the dictionary words:

$ /bin/diff -i /usr/share/dict/words UVG.hapax |\
    grep ">" |\
    cut -d " " -f2 > UVG.hapax.new

Again, line breaks have been added for clarity. The full path to diff is specified because I’ve otherwise aliased diff to colordiff.

Results

Of the 1612 hapaxes now left, it might be interesting to see how the characters are distributed.

$ cat UVG.hapax.new | fold -c1 | sort | uniq -c | sort -gr

This gives a table of character frequency:

3223    
1647    e
1295    a
1155    i
1106    o
1097    r
1016    s
1010    n
916 t
877 l
837 -

4   3
3    
3   8
2   ô
2   ç
2   9
2   7
1   Ö
1   ñ
1   ë
1   â

The most common character” is blank, and I suspect this is related to newlines (3223=2*1612-1). The other blank” character appears to be a space that did not get stripped out initially, or which was later re-introduced. Perhaps it is some kind of other whitespace.

The most exciting thing in this table (I think) is the high occurrence of the hyphen. This means that roughly half of the hapaxes” are likely composite words, and worth considering separately. For example:

  • sub-node
  • six-lives
  • noble-pillared
  • mercy-is-weakness
  • marrow-beet
  • curse-maddened
  • six-limbed
  • force-glass
  • stock-piled
  • self-regenerating

Disregarding hyphens, these are all words a dictionary knows, but which Luka may be using in novel ways.

The remaining (unhyphenated) words, are a mixed bag. Take this random sampling:

  • pyrokinetic
  • skalin
  • psionics
  • dustland
  • irshe
  • replicator
  • 10x
  • visec
  • mearls
  • mirodar

Many of these just show the limitations of my dictionary (“pyrokinetic,” replicator”). Some of them show the limitations of the process (“10x,” jrientsblogspotcom”). Some are ad-hoc compound words (“dustland,” malicereflective”). The rest are either made-up, proper nouns, or typos, and I don’t have a way to distinguish between them. It’s possible that some of these were created” by pdf2txt, which uses tunable heuristics to decide where to draw word boundaries.

If you’re interested in playing with the lists, I’ve uploaded them here. They are split into hyphen” and nohyphen,” and should be alphabetical.

Hapaxes in Context

Not satisfied with the strength of these conclusions, I attempted to add context to the hapaxes through the use of some inadvisable perl scripts.

Process

Again, I’ll assume some familiarity with bash. I’m also using a pair of perl scripts I wrote which probably don’t generalize well, but which were fit for purpose. If I’d planned ahead, I’d have written them both as one script.

First, we’ll do some very similar things to what we did last time:

$ python3-pdf2txt.py -o UVG.txt UVG.pdf
$ cat UVG.txt |\
tr A-Z a-z |\
sed -E "s/\s+|['‘’]s\s+|[–—-]+/\n/g" |\
sed -E 's/[][<>.,();:+?!%/©&“”"#*]//g' |\
sed -e "s/^['’‘]//g" |\
sed -e "s/['’‘]$//g" |\
grep -Ev "^[0-9d]+$" |\
sort | uniq -u > UVG.hapax
$ /bin/diff -i /usr/share/dict/words UVG.hapax |\
grep ">" |\
cut -d " " -f2 > UVG.hapax.new

As before, linebreaks have been added for clarity. Also note that this time we remove the possessive s” from the ends of strings, and we split on hyphens as well as spaces.

Next, we go back to the PDF, but we extract it as XML to access the character position data:

$ python3-pdf2txt.py -t xml -o UVG.xml UVG.pdf
$ cat UVG.xml | ./xml2tsv.pl > UVG.tsv
$ cat UVG.hapax.new | ./pdfmarker.pl UVG.tsv > UVG.pdfmark

The scripts I mentioned earlier are xml2tsv.pl and pdfmarker.pl. The former (very naively) strips extraneous markup, leaving each line with a character, four coordinates, and a page number. The latter reads that tsv file (given as an argument) and locates the coordinates of each word piped to it. (As each word only appears once, this is straightforward.) It outputs these coordinates in pdfmark format, a way to annotate PDFs.

Finally, we merge the annotations back into the original PDF as highlights:

$ gs -o UVG.ann.pdf -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress UVG.pdf UVG.pdfmark

Results

This method is much pickier, in part because of the difficulty of parsing a PDF consistently. From the text file, we extracted 164 words of interest, but there are only 139 annotations in the final count. I expect that the difference is words that are represented differently between the txt and xml formats. For example, if the space between two words isn’t represented by a whitespace character in the xml, it does not detect as a word boundary when we look for it. But the heuristics that build the text output may still correctly add” the space back in. This method also considers each half of a hyphenated word separately, so they are more likely to appear multiple times or to be in the dictionary.

These numbers are smaller than before for a different reason also: I have been using the free sample version, so that I can share the results. This is 78 pages, down from 158 pages in the backer version I was using before. So while we can still get a list of the output as before:

  • rewatch
  • pusca
  • eskatin
  • ashwhite
  • demiwarlock
  • vidy
  • engobes
  • tollmistress
  • dejus
  • orangeware

We can also then go find where these words are highlighted in the PDF:

“rewatch” in “Makes for great rewatch value!”“rewatch” in “Makes for great rewatch value!”

“Pusca” in “Yuan di Pusca.”“Pusca” in “Yuan di Pusca.”

“Eskatin” in “Tri Eskatin,”“Eskatin” in “Tri Eskatin,”

In the highlighted PDF, it’s easier to see that the majority of the hapaxes are proper names and normal words that my dictionary doesn’t contain, like lunchbox” and calcinous”. There are still lots of gems though, like a sign that reads No Lones to Adventerers, Frybooters or Wagonbonds,” the goddess Hazmaat,” and zombastodon lair.” You can take a look here:2

A smaller version of the cover image from the top of the post.Here! Click me!

Hapaxes in the Ultraviolet Grasslands” was first shared on March 19, 2019, and Hapaxes in Context” on April 7, 2019, both during the initial Kickstarter campaign for UVG. The crowdfunding campaign is over, but Luka’s Patreon continues. The initial post was received quite flatteringly.


  1. python3-pdf2txt.py is provided by PDFMiner. The specific name looks like it may have since changed to pdf2txt.py.↩︎

  2. The free version of the PDF (available unannotated in the Kickstarter description), is licensed under a CC BY-NC-ND 4.0 license. Arguably, because all the changes I have made to it were procedural, maybe this still complies with the NoDerivatives” part of that. But I don’t actually know, so I went ahead and asked Luka and he said this was OK.↩︎



Date
November 3, 2023



Comment