More Erroneous Pictures of Whales

Recently, Project Gutenberg added a new feature. Can you spot it? Let’s look at Moby-Dick:

A screenshot of the Project Gutenberg page for Moby-Dick (no. 15, of the several versions they host).Captured 20 Aug 2025.

Summaries! But there’s something off about them…

Moby-Dick; or, The Whale” by Herman Melville

This part is redundant, appearing immediately below the same information.

is a novel

This information also appears elsewhere on the page under Similar Books,” as In Category: Novels.” But that isn’t the most obvious place for it, and less popular books are less consistently tagged. It could possibly be inferred from other factors like length and genre, given a dedicated tool. However, of the whole summary, this classification (“novel,” essay,” poetry collection,” almanac,” etc.) is probably the most useful information.

written in the mid-19th century.

This is less than helpful, it’s a loss of knowledge. For most books we can’t know when they were written (although in this case, we know that it was written between Feb 1850 and October 1851) but in almost every case, we should know when they were published (October 1851). This is surely information that Project Gutenberg already has, but unfortunately it isn’t currently exposed.

The book explores themes of obsession, vengeance, and humanity’s relationship with nature through the experiences of its central character, Ishmael,

This is not at all what I want from Project Gutenberg. I don’t want a Cliff’s Notes study guide. I don’t even want a carefully synthesized survey of expert analyses. I want the facts about the book, not its contents: Who wrote it? What kind of thing is it? When and where was it published? Why am I looking at it? How did it get here?

A long-time strength of Project Gutenberg has been its sometimes bull-headed impartiality. They offer books that are factually incorrect, racist, bizarre, and propagandistic, all alongside books that are classic, forgotten, delightful, and unremarkable. They offer no judgment. Their inclusion criteria are only it’s in the public domain and someone thought it worth the effort of preserving.” This summary unduly elevates Project Gutenberg itself to the role of a salesperson or reviewer, inserted awkwardly between myself and my books.

who embarks on a whaling voyage aboard the Pequod, captained by the enigmatic and vengeful Ahab.

This is fine. If we stop here, and excise the previous interpretive segment, I think it would be an OK summary (with one massive caveat, see below). A summary like that would serve to orient you to the book and provide the barest, most minimal means to recognize the book: a few proper nouns (Ishmael, Pequod, Ahab) and a couple big ideas (whaling, vengeance). (I thought it reasonable to include vengeance here as a plot element, not merely as a theme, especially since it appears in the sentence twice already.)

The opening of Moby-Dick” introduces Ishmael, who shares his existential musings and the reasons for his desire to go to sea.

I question the utility of the remaining point-by-point breakdown of the story, which continues below the fold, and which gets so caught up in details that it effectively elides the bulk of the book.

He portrays the bustling port city of New Bedford, highlighting the magnetic pull of the ocean on the hearts of men. As he prepares for his journey, Ishmael reflects on his own internal struggles and motivations, ultimately leading him to desire adventure in the whaling industry. He arrives in New Bedford, confronts the challenges of finding a place to stay, and has a rather amusing encounter with the landlord and an unexpected harpooneer, setting the stage for his subsequent adventures at sea.

This isn’t technically inaccurate, but it’s very weird. Proportionally, New Bedford is only about a tenth of the book. From this text alone you could almost conclude that Ishmael doesn’t actually go to sea, leaving you searching for Moby-Dick 2: Ishmael at on the Waves. It also doesn’t even name Queequeg, leaving him equal in importance to the landlord, another strange choice.

(This is an automatically generated summary.)

But now I’m mad! I can’t even trust it! I’m sure the LLM1 has been fed enough about Moby-Dick and other classics to get by, but it explains the imprecision and the waffling and the 8th-grade-Summer-book-report voice. How can an LLM know anything external to the text? How can I trust it to faithfully repeat the presented facts internal to it? Without more details about the process, I can’t even say how much might be correct. And it feels like a betrayal to find that admission hidden below the fold, after I’ve already read it.

But you can ignore it

First of all, I can’t. My brain sees a little chunk of text at the top of a page and it starts reading. Every time. I’m at least thru the first sentence before I catch myself and remember not to trust anything I just read. The experience of reading, noticing, and unlearning repeatedly in short bursts, on every page while browsing is viscerally unpleasant to me. I enjoy the website less because of it.

But also, it’s everywhere. I know enough about Moby-Dick to critique and catch errors, but what happens when I’m browsing a less familiar work? I have to scroll down to the Subject” tags and Similar Books” (when there are any) to have any idea what I’m looking at. And those are often incomplete or over-specific. The automatic summary serves only to make that scroll longer.

Why blurb?

I can see some motivation for this. As I mentioned, that single one-line type descriptor (“pulp short story,” Gothic romance novel,” etc.) would be incredibly useful to put at the top, if it were trustworthy. I have been stung before when I download a book based on its title, only to learn it is completely something else.

That one-sentence overview that could be constructed from the longer could also be worthwhile, or at least inoffensive:

The central character, Ishmael, embarks on a whaling voyage aboard the Pequod, captained by the enigmatic and vengeful Ahab.

Keeping the summary so short leaves no room to meander, to digress, to focus in and out. But it’s still enough to recognize a book or to mentally make some note about it. I again stress though, only if it were trustworthy.

In principle, a blurb of this sort can be used to create excitement or curiosity, to encourage click-through. So far, none of these automated summaries have generated such a response it me, but maybe they do for someone else.

Project Gutenberg’s official reason for the blurbs is that the landing pages are modeled after typical online bookstore experiences,” which generally do have some kind of overview or prominent summary. Superficially, this may sound poorly-considered. After all, what purpose does a summary serve in a typical bookstore, and shouldn’t we aim at that instead? But I find myself surprisingly sympathetic to it. Project Gutenberg is at once the largest and most available source of public domain texts, and also every e-reader’s most-tertiary supported bookstore, with an early-aughts web design that implicitly suggests the hoops you’re about to jump thru. The landing page is their chance to put on a friendly face, something with a designed typeface and a chosen color pallet.

What now?

Well, the first thing I did was complain. I tried to stay reasoned and polite, and I actually got a really reassuring response. Apparently, while some people have complained about the use of AI and ML tools in general, I was the first to complain about the summaries specifically. If you have concerns similar to mine, I would say that it was worth the time to briefly and kindly articulate them, keeping in mind that the team is small and overworked.

But I’m not a total Luddite (I am an automation engineer, alas), so I’m also left thinking what I would prefer instead. The first and simplest thing would be to elevate the information that Project Gutenberg already has, starting with publication date and word count, which currently aren’t exposed at all in the About this eBook” table. Simpler and more robust algorithms exist for automated tagging, which could be useful here, but just as easily might not be, given the oddities of the current Subject” tags. Reading Categories” are already assigned in some automated fashion, but I find them not quite granular or universally applied enough to be reliably helpful. On a project designed from scratch I might suggest a mechanism for user-suggested tags, but I don’t know nearly enough about the backend to say what challenges that implementation might involve.

Finally, if we want to keep them, what can we do about summaries? As outlined above, my ideal summary is informative, distinctive, engaging, neutral, brief, and friendly. That’s a tall order.

Standard Ebooks appears to write them by hand. Consider their Moby Dick:

Call me Ishmael” says Moby Dicks protagonist, and with this famous first line launches one of the acclaimed great American novels. Part adventure story, part quest for vengeance, part biological textbook, and part whaling manual, Moby Dick was first published in 1851. The story follows Ishmael as he abandons his humdrum life on shore for an adventure on the waves. Finding the whaler Pequod at harbour in Nantucket, he signs up for a three year term without meeting the Captain of the ship, a mysterious figure called Ahab. It’s only well into the voyage that Ahab’s thirst for vengeance against the eponymous white whale Moby Dick—and the consequences—become clear. […]

It then continues with biographical notes and a little bit of publication history. This is obviously labor and research intensive, and doesn’t scale.

For better-known books, Wikipedia is a reliable source of context. Consider the opening of their Moby-Dick article:

Moby-Dick; or, The Whale is an 1851 epic novel by American writer Herman Melville. The book is centered on the sailor Ishmael’s narrative of the maniacal quest of Ahab, captain of the whaling ship Pequod, for vengeance against Moby Dick, the giant white sperm whale that bit off his leg on the ship’s previous voyage. A contribution to the literature of the American Renaissance, Moby-Dick was published to mixed reviews, was a commercial failure, and was out of print at the time of the author’s death in 1891. Its reputation as a Great American Novel was established only in the 20th century, after the 1919 centennial of its author’s birth. William Faulkner said he wished he had written the book himself, and D. H. Lawrence called it one of the strangest and most wonderful books in the world” and the greatest book of the sea ever written”. Its opening sentence, Call me Ishmael”, is among world literature’s most famous.

While the prose is free of speculative interpretation and dense with the simple facts I really want (none of the LLM summaries seem to have hit on epic novel,” the year is right up front, and it even gives the author’s nationality!), it doesn’t exactly inspire a need to know more. Still, where available, a direct link to Wikipedia could provide useful context.

Commercial bookstores pay for copy. But many books in Project Gutenberg were once commercial endeavors themselves. I would not be opposed to contemporaneous copy being pulled to the top. Consider Witch of the Demon Seas by Poul Anderson. Currently, the Project Gutenberg summary reads:

Witch of the Demon Seas” by A. A. Craig2 is a fantasy novel3 written in the early 1950s.4 The story revolves around Corun, a pirate condemned to death who finds himself entwined with powerful sorcery and an ambitious witch named Chryseis. Together with a sorcerer and a formidable crew, Corun embarks on a perilous quest to harness the powers of the elusive Xanthi, the Sea Demons, while facing betrayal and intrigue that could change the fate of kingdoms. The beginning of the novel introduces Corun, a proud pirate captured by King Khroman and facing execution, when he is offered a chance at life by the sorcerous duo Shorzon and Chryseis. […]

It’s a bit leaden and it struggles with the basic facts. But the blurb from Planet Stories is included in the front matter of the text itself:

Guide a black galleon to the lost, fear-haunted Citadel of the Xanthi wizards—into the very jaws of Doom? Corun, condemned pirate of Conahur, laughed. Aye, he’d do it, and gladly. It would mean a reprieve from the headsman’s axe—a few more precious moments of life and love … though his lover be a witch!

If your goal is to build excitement or to look more polished, it would be hard to top that. And it minimizes the imposition on the reader: this is not an interpretation, this is historical context.

That’s not too far off from the paperback practice of previewing a dramatic passage out of context before even the front matter, which honestly could also be useful, if perhaps awkward for more varied book types. (Imagining here a dictionary excerpt, for example.) Still, some kind of sentiment analysis could be adapted to select a short exciting” or characteristic” passage, if only as a first pass.

A ideal solution” to this problem, if we admit that one has value, would likely be some combination of these options, probably with a crowd-sourced component that I’m not sure the the current Project Gutenberg system can easily accommodate. None of these ideas alone can match the breadth of works on Project Gutenberg, and all require effort, judgment, and transparency. I fear the current solution requires none but offers the appearance of all three.


  1. I am assuming an LLM here. There are other, older machine-learning summarization techniques with their own issues, but not the ones we see here.↩︎

  2. The pen name of Poul Anderson.↩︎

  3. At ~21,000 words, perhaps a novella” by modern standards, although it calls itself a novel of alien sorcery.”↩︎

  4. Published 1951, by the magazine cover date.↩︎


Tags
Essays AI

Date
September 4, 2025


Comment