# KRB live scrape → Webflow JS data plan

Purpose: speed up KRB WordPress → Webflow page migration by scraping only component-relevant data from live KRB pages, normalising it into a reusable JSON contract, then feeding that into Iggy Code Lab / Designer JS imports with attention flags.

## Recommendation
Yes — this is the right next step before trying to populate 20+ pages at once. The current About Us import already proves the target data shape (`SOURCE`, `GALLERY_MANIFEST`, `EXPECTED`), but too much source extraction is still manual/fragile. A Python scraper scoped to known WordPress block patterns should produce cleaner, repeatable input and reduce Code Lab work to mapping + verification.

## Proposed pipeline

1. Input
   - List of live KRB URLs and matching Webflow page IDs/slugs.
   - Optional expected component map per page, e.g. `Section / Text Content`, `Section / Two Column Text & Image`, `Section / Gateway CTA`, `Section / Downloads`, `Section / Next Pages`.

2. Fetch
   - Python `requests` + `BeautifulSoup`.
   - Store raw HTML snapshots for audit/replay.
   - Do not scrape whole site indiscriminately; use explicit URL batches.

3. Extract only known block families
   - Hero: h1/title, primary hero image candidates.
   - Intro/text content: leading paragraphs/headings.
   - Two-column/person/profile blocks: heading, body, image, alt, source order.
   - Gateway/link blocks: title, URL, page relationship if known.
   - Downloads: label, file URL, file type.
   - Galleries/sliders: image list and captions.
   - Accordions: heading + body pairs.
   - Next pages: either from page nav/breadcrumb/sidebar or generated from configured relationship map.

4. Normalise to JSON contract
   - Per page:
     - `sourceUrl`
     - `webflowPageId`
     - `hero`
     - `textContent[]`
     - `twoColumn[]`
     - `gatewayLinks[]`
     - `downloads[]`
     - `galleries[]`
     - `accordions[]`
     - `nextPages[]`
     - `attention[]`
   - Keep HTML and plain-text fields where useful.
   - Keep image source URL + filename for Webflow asset matching.

5. Validate before Webflow writes
   - Compare extracted block counts to expected Webflow component counts.
   - Flag:
     - missing/extra source blocks
     - likely duplicate blocks
     - missing images
     - unresolved internal links
     - documents not yet uploaded/matched
     - HTML/rich-text formatting needing manual review
     - component count mismatch on target page

6. Generate/import
   - JS import script should consume this JSON rather than embedding manually curated page data.
   - Run first in read-only/dry-run mode.
   - Then run page group writes in chunks, e.g. 5–20 pages, with structured output.

7. Output reports
   - `krb-scrape-normalized.json`
   - `krb-scrape-attention.csv`
   - `krb-webflow-import-dry-run-report.json`
   - `krb-webflow-import-result.json`

## Why this helps Iggy Code Lab

- Code Lab should become the executor/verifier, not the place where source extraction logic lives.
- The scraper can run locally and deterministically, then Code Lab only applies a known contract to Webflow components.
- Batch import can record page-level attention spots instead of stopping the workflow every time something does not map cleanly.

## App/product improvement observations

- Add a migration data-import mode to Iggy app: upload/paste JSON, choose page batch, dry run, then execute.
- Show page/component match table before writes.
- Show partial-success prop-level results after writes.
- Persist run history and unresolved attention items.
- Add retry only for failed props/components instead of rerunning whole page.
- Provide asset filename matching and unresolved asset queue.
