# WIP: Software Hashes **Branch:** `feature/software-hashes` **Started:** 2026-02-17 **Status:** Complete ## Plan Implements [docs/plans/software-hashes.md](software-hashes.md) — a derived `software_hashes` table storing MD5, CRC32 and size for tape-image contents extracted from download zips. ### Tasks - [x] Create `data/zxdb/` directory (for JSON snapshot) - [x] Add `software_hashes` Drizzle schema model - [x] Create `bin/update-software-hashes.mjs` — main pipeline script - [x] DB query for tape-image downloads (filetype_id IN 8, 22) - [x] Resolve local zip path via CDN mapping (uses CDN_CACHE env var) - [x] Extract `_CONTENTS` (skip if exists) - [x] Find tape file (.tap/.tzx/.pzx/.csw) with priority order - [x] Compute MD5, CRC32, size_bytes - [x] Upsert into software_hashes - [x] State file for resume support - [x] JSON export after bulk update (atomic write) - [x] Update `bin/import_mysql.sh` to reimport snapshot on DB wipe - [x] Add pnpm script entries ## Progress Log ### 2026-02-17T16:00Z - Started work. Branch created from `main` at `b361201`. - Explored codebase: understood DB schema, CDN mapping, import pipeline. - Key findings: - filetype_id 8 = "Tape image" (33,427 rows), 22 = "BUGFIX tape image" (98 rows) - CDN_CACHE = /Volumes/McFiver/CDN, paths: SC/ (zxdb) and WoS/ (pub) - `_CONTENTS` dirs exist in WoS but not yet in SC - data/zxdb/ directory needs creation - import_mysql.sh needs software_hashes reimport step ### 2026-02-17T16:04Z - Implemented Drizzle schema model for `software_hashes`. - Created `bin/update-software-hashes.mjs` pipeline script. - Updated `bin/import_mysql.sh` with JSON snapshot reimport. - Added `update:hashes` and `export:hashes` pnpm scripts. ### 2026-02-17T16:09Z - First full run completed successfully: - 33,525 total tape-image downloads in DB - 32,305 rows hashed and inserted into software_hashes - ~1,220 skipped (missing local zips, `/denied/` prefix, `.p` ZX81 files with no tape content) - JSON snapshot exported: 7.2MB, 32,305 rows at `data/zxdb/software_hashes.json` - All plan steps verified working. ## Decisions & Notes - Target filetype IDs: 8 and 22 (tape image + bugfix tape image). - Tape file priority: .tap > .tzx > .pzx > .csw (most common first). - CDN_CACHE comes from env var (not hard-coded, unlike sync-downloads.mjs). - JSON snapshot at data/zxdb/software_hashes.json (7.2MB, committed to repo). - Node.js built-in `crypto` for MD5; custom CRC32 lookup table (no external deps). - `inner_path` column added (not in original plan) to record which file inside the zip was hashed. - `/denied/` and `/nvg/` prefix downloads (~443) are logged and skipped (no local mirror). - `.p` files (ZX81 programs) categorized as tape images but contain no .tap/.tzx/.pzx/.csw — logged as "no tape file". - Uses system `unzip` for extraction (handles bracket-heavy filenames via `execFile` not shell). ## Blockers None. ## Commits b361201 - Ready to start adding hashes 944a2dc - wip: start feature/software-hashes — init progress tracker f5ae89e - feat: add software_hashes table schema and reimport pipeline edc937a - feat: add update-software-hashes.mjs pipeline script 9bfebc1 - feat: add initial software_hashes JSON snapshot (32,305 rows)