mirror of
https://github.com/harvard-edge/cs249r_book.git
synced 2026-05-08 02:28:25 -05:00
Goal: get the nightly link-rot tracker (#1424) to permanent green. Real broken links were already addressed in PRs #1552 and #1553. The remaining tracker noise is dominated by anti-bot false positives: sites that respond 999/403/503/transient-5xx to Lychee's HEAD requests but are reachable in a browser. Manually verified each pattern below returns 200 to a real user agent. shared/config/.lycheeignore — adds patterns covering: - LinkedIn (always 999 anti-scraping) - X / Twitter (Cloudflare bot challenge) - Harvard SEAS faculty pages (university CDN bot block) - Edge AI Foundation (verified live; intermittent 5xx) - discuss.tinymlx.org (Discourse 403 to HEAD) - edX professional-certificate / course pages (throttling) - mpstewart.net (302 redirects lychee mishandles) - Medium / Towards Data Science (Cloudflare HEAD blocks) - Forbes / WSJ / Reuters (paywall + bot detection) - StackOverflow / Stack Exchange (Cloudflare challenge) - YouTube channel URLs (4xx to HEAD, live in browser) book/config/linting/.lycheeignore — adds book-bucket entries for the same false-positive sites (book uses its own lycheeignore file per the workflow's lycheeignore_path config), specifically: - edgeaifoundation.org (the 2 flagged URLs in #1424's Book bucket) - LinkedIn / Twitter / Forbes / WSJ / TowardsDataScience After this lands, the next nightly link-rot run should report zero on every site that doesn't have a real broken-content issue. The tracker auto-closes when broken count = 0.