Pythonesque

26 Nov, 2024

I know a little about coding. Enough to sometimes brute force a small quality-of-life tweak into someone else’s code, or write a janky-as-heck shell script. But not enough to build anything substantial from the ground up.

So to fill this knowledge hole, I’ve been reading a few introductions to Python. And while learning in the abstract is a start, I didn’t have a project.

Until my research workflow fell over and I had to spend a couple afternoons in vim writing a replacement: a simple wrapper around Pandoc to make archiving webpages easier. I called it Jones. Because he was into archeology. And research is a kind of archeology. Anyway.

I’m putting the full code here as the internet equivalent of sticking it to my fridge. That way I can smile at its crudeness whenever I put away the milk. It’s also on github. Yeah, I’ve got one of those.

#!/usr/bin/env python3
# rearch archiving with pandoc
import re
import requests
import bs4
import subprocess
import sys
import os

from pathlib import Path

# fail if no url
if len(sys.argv) < 2:
    print("no url")
    sys.exit()

# path setup
archive_path = Path.home() / "ark"

if archive_path.is_dir():
    pass
else:
    archive_path.mkdir()
    print("archive path created")

print("Getting page info...")

# get url
url = sys.argv[1]
res = requests.get(url)

# get title
soup = bs4.BeautifulSoup(res.text, "html.parser")
page_title = soup.title.string

# format title (name.of.article.txt)
word_title = re.sub("\\W", "_", page_title)
short_title = word_title[:36]
dedup_title = re.sub("_{2,}", "_", short_title)
clip_title = dedup_title.rstrip("_")
actual_title = clip_title.lower() + ".txt"

# build destination path
title_path = Path(actual_title)
destination = archive_path / title_path

# pandoc command
print("Running Pandoc...")
subprocess.run(["pandoc", "-f", "html", "-t", "plain", url, "-o", destination])

# append to archive list
catalog_path = archive_path / "ark.catalog"

with open(catalog_path, "a") as file:
    file.write(url + "\n")

Bonus: Un-Pocket

Last time, I wanted to convert the list of links exported from Pocket into full text copies. That turned out to be pretty easy–I just needed htmlq. Then to get an unformatted list of links, I ran this in git-bash on Windows:

cat export.html | htmlq --attribute href a | sed '/^http/!d' > pocket.txt

And ran the list through Jones:

while IFS='' read -r line; do
    jones "$line"
done < pocket.txt