Archaeology

30 Oct, 2024

Men digging in the desert under a hot sun

Raiders of the Lost Ark (1981)

For a few years I have been using Pocket to save articles for research and it had been working great. At the end of every week I could export a list of everything I’ve saved and drop it in Dropbox.

All good.

Then in the last month or so they’ve disabled the export feature, so I need to find another solution. Preferably something immune to sudden disruption. Something under my control. Which ultimately means keeping everything locally.

And thinking about it, Pocket’s export is only a list of URLs, so if a website goes down, or the link changes, it would result in a dead link. So I really need something controllable, un-disruptable, local and retains the full text. Not surprisingly, given the this post exists, there are a number of tools that can accomplish this.

NOTE: When choosing what format to keep this research archive in, the obvious answers are html or plain text. Text is small and easily parsable, while html gives access to images (as long as the page stays up), and keeps the look and feel of the original page. I will decide whether that’s important to me later.

Tools

First, what I didn’t choose. I didn’t want to dive into the various self-hosted pocket alternatives like ArchiveBox because they do too much. I don’t need tags or built-in search or even export. I also didn’t want to pick a browser addon like SingleFile as they can’t be automated easily.

So what’s left? There’s wget and curl. Both tools have been around forever and are designed for this kind of use case. Another older tool is Pandoc, a document processor which can grab webpages and convert them into dozens of formats. Monolith is relatively new, but it’s main use is bundling entire pages—including images, media, css, JavaScript—into a single html file. Lastly, there’s readability-cli, an installable version of Firefox’s reader mode, which strips the cruft from a website and saves clean html. For testing purposes I’m using Shed a little ease-of-use bash script I wrote which calls readability. These are all cross-platform, although Shed is a bit finicky to install on Windows.

Testing

When testing I didn’t try anything too fancy just used each tool to grab three random pages (from Wikipedia, ArsTechnica and the BBC), measured how long they took, and how big the resulting files were.

Speed (secs) results

Tool	Wiki	BBC	Ars	Total
Curl	1.09	1.16	n/a	2.25
Monolith	0.89	1.36	2.03	4.28
Monolith Full	17.75	16.51	7.27	41.53
Pandoc	2.97	0.75	2.86	6.58
Pandoc Plain	2.66	0.61	4.56	7.83
Shed	24.48	10.37	15.11	49.97
Wget	1.35	3.96	3.06	8.37

Not surprisingly, the tools that do the most post-processing, Shed and Monolith Full, take the longest. There’s little difference among the others, with Monolith being slightly quicker. Strangely Curl failed to grab the full text of the Ars page. Even on repeated tries. I’m inclined to blame Ars rather than Curl itself.

Update: curl grabbed the Ars page correctly when I tested it today.

Size (KBs) results

Tool	Wiki	BBC	Ars	Total
Curl	778.45	324.91	n/a	1,103.36
Monolith	814.00	152.73	56.81	1,023.54
Monolith Full	2682.52	31993.95	7761.37	42,437.84
Pandoc	708.47	95.48	28.48	832.43
Pandoc Plain	252.31	18.00	5.36	275.67
Shed	231.44	18.80	2.88	253.12
Wget	778.45	324.91	113.00	1,216.36

The slowest tools, Monolith Full and Shed, resulted in the largest and smallest file sizes, respectively. Next smallest Pandoc Plain, was 7x quicker than Shed. The rest of the tools were much of a muchness except Monolith Full which outputted relatively enormous files, you do get a perfect copy of the webpage with that.

See the commands I tested with here.

Conclusion

Basically, for html, wget. It performs well, has been around forever, and it’s pre-installed on good OSs. But given the results and my use case, it’s Pandoc Plain. It’s fast enough and the files are tiny. You lose the “look” of the page but for the kind of research I do, it isn’t a big deal. Any images can be saved separately.

Now to figure out how to run my Pocket export through Pandoc…

Current Status

Cat: Shedding

Writing: Much

Hands: A-tingle