Files
cs249r_book/interviews/CONTRIBUTING.md
Vijay Janapa Reddi a33296df5f feat(vault): corpus licensed CC-BY-NC-4.0 (explicit user decision)
User concern: preventing commercial reuse of the corpus (e.g., a
vendor training a paid product on the questions, selling access to
them). CC-BY-NC-4.0 permits research citation + non-commercial
derivatives while requiring written permission for commercial use.

interviews/vault/questions/LICENSE (NEW)
  CC-BY-NC-4.0 full text with BibTeX template tied to release_hash.
  Commercial licensing contact noted.

interviews/vault/ARCHITECTURE.md §15 #1
  Marked DECIDED. Rationale recorded. vault-cli license
  intentionally left at historical status (not relicensed as part
  of this change).

interviews/vault/REVIEWS.md
  License state: DECIDED. Removed from Phase-3 blocker list.

interviews/CONTRIBUTING.md
  New 'License' section: NC constraint explicit. External corpus
  PRs assumed offered under same CC-BY-NC-4.0. Contact for commercial
  licensing specified.
2026-04-16 13:48:29 -04:00

7.1 KiB

Contributing to StaffML

Thanks for your interest. This guide covers contributing to the StaffML vault (question corpus) and the site (interviews/staffml/).

For the full architecture of the vault pipeline, see vault/ARCHITECTURE.md. For the review ledger behind it, see vault/REVIEWS.md.


Quick start — from clone to first-question-visible

# 1. Clone and pick the staffml worktree
git clone https://github.com/harvard-edge/cs249r_book.git
cd cs249r_book/interviews

# 2. Install vault-cli (Python 3.12+)
pip install -e vault-cli/[dev]
vault --version
pytest vault-cli/tests/

# 3. Phase-aware explore
#    At Phase 0 only --version/--help exist. At Phase 1 the corpus is split
#    and `build` / `check` work end-to-end. At Phase 2+, `stats` / `verify`
#    / `export-paper` exist. See §Phase-by-phase scope at the bottom of this
#    file for exactly which subcommands are live at your checkout.
vault --help                # always works — shows subcommands live at your phase

# 4. Run the local API shim (requires a prior `vault build` so there's a vault.db)
#    Serves the Worker endpoint surface from a local vault.db so you don't
#    need a Cloudflare account to develop the site. Available from Phase 1 onward.
vault build                                      # produces interviews/vault/vault.db
vault api --db interviews/vault/vault.db --port 8002 &

# 5. Run the site against your local API
cd staffml/
cp .env.example .env.local
# edit .env.local: NEXT_PUBLIC_VAULT_API=http://localhost:8002
pnpm install
pnpm dev
# visit http://localhost:3000

The goal is clone → question-visible in under 10 minutes on a fresh machine. If it's longer, file an issue titled "CONTRIBUTING.md getting-started friction".


What can I contribute?

Contribution type Where How
New question vault/questions/<track>/<level>/<zone>/ vault new
Fix a question same vault edit <id>
Reclassify a question same vault move <id> --to <track>/<level>/<zone>
New topic vault/taxonomy.yaml PR with a §7 entry in EVOLUTION.md
Website UX interviews/staffml/src/ Next.js; see AGENTS.md in staffml if present
Worker API interviews/staffml-vault-worker/ (Phase 3+) Wrangler project
Schema evolution vault/schema/ RFC-style PR per EVOLUTION.md
New vault-cli subcommand vault-cli/src/vault_cli/commands/ Land with tests + docs update

Workflow

Branching

  • Start from dev for standalone work: git checkout -b feat/short-description dev.
  • One logical change per branch. Atomic commits. No git add -A on vault changes.
  • No Co-Authored-By tags, no "made with " footers. Commit messages read like regular engineering work.

Before opening a PR

vault check --strict                 # invariants
pytest vault-cli/tests/              # unit + integration + contract
vault codegen --check                # LinkML ↔ Pydantic/DDL/TS drift check (Phase 1+)

CI re-runs these. PRs are merge-blocked on red CI.

PR review

  • Corpus PRs: at least one maintainer review, CI green.
  • Code PRs (vault-cli, worker): one review, CI green.
  • Schema-evolution PRs: two reviews required once external-contributor onboarding opens (Phase 7+).
  • Schema-breaking PRs must include a migration script under vault-cli/migrations/.

Provenance honesty

Every question's provenance field must honestly reflect how it was made:

  • human — written from scratch by a human.
  • llm-draft — produced by vault generate; not yet human-reviewed.
  • llm-then-human-edited — an LLM draft substantially revised by a human (the common case). generation_meta.human_reviewed_at records when.
  • imported — from an external source (e.g., book, published paper). Include source in tags.

Misattributing LLM content as human is a correctness bug, not a style nit.

Author attribution

vault new populates authors from your git config user.email via vault/contributors.yaml (mapping from email → handle). Submit a PR updating that file to add yourself.

For external PRs: commit signatures (GPG/SSH) or GitHub-verified-email match is required — CI rejects authors: claims that don't match the committer identity.


Style

  • Keep questions focused on a single concept. "Good napkin math" is usually a signal; "grab-bag of facts" is usually a code smell.
  • Use realistic hardware specs — check mlsysbook/constants.py and the vault/schema/models.yaml registry for the canonical values.
  • Paper-cite URL format only: https://mlsysbook.ai/book/chapters/<slug>.
  • Scenarios are plaintext; solutions and napkin math can use restricted Markdown + KaTeX.

Things that block external PRs from merging

  1. Provenance lievault mark-exemplar or vault promote --reviewed-by fields that don't match git committer.
  2. Registry mutation — any commit that deletes lines from id-registry.yaml. The registry is append-only.
  3. Schema mixing — questions at different schema_version in the same PR.
  4. Unsigned schema-evolution PR — schema bumps require two maintainer approvals.

Phase-by-phase scope

As of Phase 0, the vault pipeline is scaffolded but not operational end-to-end. What works today:

  • vault --version, vault --help.
  • pip install -e vault-cli/[dev] and pytest.
  • Documentation (ARCHITECTURE.md, REVIEWS.md, TESTING.md, EVOLUTION.md, this file).

What's coming per Phase (see vault/ARCHITECTURE.md §14):

  • Phase 1: new, edit, move, rm, restore, build, check, serve, api. YAML split lands.
  • Phase 2: publish + primitives, paper-exporter rewrite, rollback-symmetry CI.
  • Phase 3: D1 + Worker + @staffml/vault-types + FTS5 load-test gate.
  • Phase 4: Website cutover + service worker + rollback drill.
  • Phase 5: Chain pre-reveal indicator + instrumentation.
  • Phase 6: About-page paper prominence.

External contributions to vault/questions/ become feasible at Phase 1 exit.


License

The corpus under interviews/vault/ is licensed CC-BY-NC-4.0 — see vault/questions/LICENSE. Summary:

  • Share and adapt for non-commercial purposes with attribution.
  • Commercial use (training a paid product on the corpus, selling access, building paid services around it) requires separate written permission from the copyright holders. Contact: vjreddi@g.harvard.edu.
  • Contributions to the corpus are assumed to be offered under the same CC-BY-NC-4.0 terms. Do not submit content you are not entitled to license this way.

The interviews/vault-cli/ tooling is a separate artifact; its license is unchanged from the repository's historical state.


Asking for help

  • Architecture questions → read vault/ARCHITECTURE.md.
  • "Why was X decided this way?" → check vault/REVIEWS.md. Most non-obvious decisions map to a reviewer finding.
  • "Is this bug or intended?" → open an issue with the command you ran and the output.

Thanks for contributing.