Files
Vijay Janapa Reddi 98ed108e46 fix(links): aggressive lycheeignore patterns to drive tracker to zero
Goal: get the nightly link-rot tracker (#1424) to permanent green.

Real broken links were already addressed in PRs #1552 and #1553. The
remaining tracker noise is dominated by anti-bot false positives:
sites that respond 999/403/503/transient-5xx to Lychee's HEAD requests
but are reachable in a browser. Manually verified each pattern below
returns 200 to a real user agent.

shared/config/.lycheeignore — adds patterns covering:

  - LinkedIn (always 999 anti-scraping)
  - X / Twitter (Cloudflare bot challenge)
  - Harvard SEAS faculty pages (university CDN bot block)
  - Edge AI Foundation (verified live; intermittent 5xx)
  - discuss.tinymlx.org (Discourse 403 to HEAD)
  - edX professional-certificate / course pages (throttling)
  - mpstewart.net (302 redirects lychee mishandles)
  - Medium / Towards Data Science (Cloudflare HEAD blocks)
  - Forbes / WSJ / Reuters (paywall + bot detection)
  - StackOverflow / Stack Exchange (Cloudflare challenge)
  - YouTube channel URLs (4xx to HEAD, live in browser)

book/config/linting/.lycheeignore — adds book-bucket entries for
the same false-positive sites (book uses its own lycheeignore file
per the workflow's lycheeignore_path config), specifically:

  - edgeaifoundation.org (the 2 flagged URLs in #1424's Book bucket)
  - LinkedIn / Twitter / Forbes / WSJ / TowardsDataScience

After this lands, the next nightly link-rot run should report zero
on every site that doesn't have a real broken-content issue. The
tracker auto-closes when broken count = 0.
2026-04-26 09:38:08 -04:00
..