explorer/docs/plans/software-hashes.md

# Software Hashes Plan

Plan for adding a derived `software_hashes` table, its update pipeline, and JSON snapshot lifecycle to survive DB wipes.

---

## 1) Goals and Scope (Plan Step 1)

- Create and maintain `software_hashes` for (at this stage) tape-image downloads.
- Preserve existing `_CONTENTS` folders; only create missing ones.
- Export `software_hashes` to JSON after each bulk update.
- Reimport `software_hashes` JSON during DB wipe in `bin/import_mysql.sh` (or a helper script it invokes).
- Ensure all scripts are idempotent and resume-safe.

---

## 2) Confirm Pipeline Touchpoints (Plan Step 2)

- Verify `bin/import_mysql.sh` is the authoritative DB wipe/import entry point.
- Confirm `bin/sync-downloads.mjs` remains responsible only for CDN cache sync.
- Confirm `src/server/schema/zxdb.ts` uses `downloads.id` as the natural FK target.

---

## 3) Define Data Model: `software_hashes` (Plan Step 3)

### Table naming and FK alignment

- Table: `software_hashes`.
- FK: `download_id` → `downloads.id`.
- Column names follow existing DB `snake_case` conventions.

### Planned columns

- `download_id` (PK or unique index; FK to `downloads.id`)
- `md5`
- `crc32`
- `size_bytes`
- `updated_at`

### Planned indexes / constraints

- Unique index on `download_id`.
- Index on `md5` for reverse lookup.
- Index on `crc32` for reverse lookup.

---

## 4) Define JSON Snapshot Format (Plan Step 4)

### Location

- Default: `data/zxdb/software_hashes.json` (or another agreed path).

### Structure

```json
{
  "exportedAt": "2026-02-17T15:18:00.000Z",
  "rows": [
    {
      "download_id": 123,
      "md5": "...",
      "crc32": "...",
      "size_bytes": 12345,
      "updated_at": "2026-02-17T15:18:00.000Z"
    }
  ]
}
```

### Planned import policy

- If snapshot exists: truncate `software_hashes` and bulk insert.
- If snapshot missing: log and continue without error.

---

## 5) Implement Tape Image Update Workflow (Plan Step 5)

### Planned script

- `bin/update-software-hashes.mjs` (name can be adjusted).

### Planned input dataset

- Query `downloads` for tape-image rows (filter by `filetype_id` or joined `filetypes` table).

### Planned per-item process

1. Resolve local zip path using the same CDN mapping used by `sync-downloads`.
2. Compute `_CONTENTS` folder name: `<zip filename>_CONTENTS` (exact match).
3. If `_CONTENTS` exists, keep it untouched.
4. If missing, extract zip into `_CONTENTS` using a library that avoids shell expansion issues with brackets.
5. Locate tape file inside (`.tap`, `.tzx`, `.pzx`, `.csw`):
   - Apply a deterministic priority order.
   - If multiple candidates remain, log and skip (or record ambiguity).
6. Compute `md5`, `crc32`, and `size_bytes` for the selected file.
7. Upsert into `software_hashes` keyed by `download_id`.

### Planned error handling

- Log missing zips or missing tape files.
- Continue after recoverable errors; fail only on critical DB errors.

---

## 6) Implement JSON Export Lifecycle (Plan Step 6)

- After each bulk update, export `software_hashes` to JSON.
- Write atomically (temp file + rename).
- Include `exportedAt` timestamp in snapshot.

---

## 7) Reimport During Wipe (`bin/import_mysql.sh`) (Plan Step 7)

### Planned placement

- Immediately after database creation and ZXDB SQL import completes.

### Planned behavior

- Attempt to read JSON snapshot.
- If present, truncate and reinsert `software_hashes`.
- Log imported row count.

---

## 8) Add Idempotency and Resume Support (Plan Step 8)

- State file similar to `.sync-downloads.state.json` to track last `download_id` processed.
- CLI flags:
  - `--resume` (default)
  - `--start-from-id`
  - `--rebuild-all`
- Reprocess when zip file size or mtime changes.

---

## 9) Validation Checklist (Plan Step 9)

- `_CONTENTS` folders are never deleted.
- Hashes match expected MD5/CRC32 for known samples.
- JSON snapshot is created and reimported correctly.
- Reverse lookup by `md5`/`crc32`/`size_bytes` identifies misnamed files.
- Script can resume safely after interruption.

---

## 10) Open Questions / Confirmations (Plan Step 10)

- Final `software_hashes` column list and types.
- Exact JSON snapshot path.
- Filetype IDs that map to “Tape Image” in `downloads`.