Files
cs249r_book/interviews/CONTRIBUTING.md
Vijay Janapa Reddi a33296df5f feat(vault): corpus licensed CC-BY-NC-4.0 (explicit user decision)
User concern: preventing commercial reuse of the corpus (e.g., a
vendor training a paid product on the questions, selling access to
them). CC-BY-NC-4.0 permits research citation + non-commercial
derivatives while requiring written permission for commercial use.

interviews/vault/questions/LICENSE (NEW)
  CC-BY-NC-4.0 full text with BibTeX template tied to release_hash.
  Commercial licensing contact noted.

interviews/vault/ARCHITECTURE.md §15 #1
  Marked DECIDED. Rationale recorded. vault-cli license
  intentionally left at historical status (not relicensed as part
  of this change).

interviews/vault/REVIEWS.md
  License state: DECIDED. Removed from Phase-3 blocker list.

interviews/CONTRIBUTING.md
  New 'License' section: NC constraint explicit. External corpus
  PRs assumed offered under same CC-BY-NC-4.0. Contact for commercial
  licensing specified.
2026-04-16 13:48:29 -04:00

191 lines
7.1 KiB
Markdown

# Contributing to StaffML
Thanks for your interest. This guide covers contributing to the StaffML vault
(question corpus) and the site (`interviews/staffml/`).
For the full architecture of the vault pipeline, see
[`vault/ARCHITECTURE.md`](vault/ARCHITECTURE.md). For the review ledger behind
it, see [`vault/REVIEWS.md`](vault/REVIEWS.md).
---
## Quick start — from clone to first-question-visible
```bash
# 1. Clone and pick the staffml worktree
git clone https://github.com/harvard-edge/cs249r_book.git
cd cs249r_book/interviews
# 2. Install vault-cli (Python 3.12+)
pip install -e vault-cli/[dev]
vault --version
pytest vault-cli/tests/
# 3. Phase-aware explore
# At Phase 0 only --version/--help exist. At Phase 1 the corpus is split
# and `build` / `check` work end-to-end. At Phase 2+, `stats` / `verify`
# / `export-paper` exist. See §Phase-by-phase scope at the bottom of this
# file for exactly which subcommands are live at your checkout.
vault --help # always works — shows subcommands live at your phase
# 4. Run the local API shim (requires a prior `vault build` so there's a vault.db)
# Serves the Worker endpoint surface from a local vault.db so you don't
# need a Cloudflare account to develop the site. Available from Phase 1 onward.
vault build # produces interviews/vault/vault.db
vault api --db interviews/vault/vault.db --port 8002 &
# 5. Run the site against your local API
cd staffml/
cp .env.example .env.local
# edit .env.local: NEXT_PUBLIC_VAULT_API=http://localhost:8002
pnpm install
pnpm dev
# visit http://localhost:3000
```
The goal is clone → question-visible in under 10 minutes on a fresh machine. If
it's longer, file an issue titled "CONTRIBUTING.md getting-started friction".
---
## What can I contribute?
| Contribution type | Where | How |
|---|---|---|
| New question | `vault/questions/<track>/<level>/<zone>/` | `vault new` |
| Fix a question | same | `vault edit <id>` |
| Reclassify a question | same | `vault move <id> --to <track>/<level>/<zone>` |
| New topic | `vault/taxonomy.yaml` | PR with a §7 entry in EVOLUTION.md |
| Website UX | `interviews/staffml/src/` | Next.js; see AGENTS.md in staffml if present |
| Worker API | `interviews/staffml-vault-worker/` (Phase 3+) | Wrangler project |
| Schema evolution | `vault/schema/` | RFC-style PR per EVOLUTION.md |
| New `vault-cli` subcommand | `vault-cli/src/vault_cli/commands/` | Land with tests + docs update |
---
## Workflow
### Branching
- Start from `dev` for standalone work: `git checkout -b feat/short-description dev`.
- One logical change per branch. Atomic commits. No `git add -A` on vault changes.
- No `Co-Authored-By` tags, no "made with <tool>" footers. Commit messages read like regular engineering work.
### Before opening a PR
```bash
vault check --strict # invariants
pytest vault-cli/tests/ # unit + integration + contract
vault codegen --check # LinkML ↔ Pydantic/DDL/TS drift check (Phase 1+)
```
CI re-runs these. PRs are merge-blocked on red CI.
### PR review
- Corpus PRs: at least one maintainer review, CI green.
- Code PRs (vault-cli, worker): one review, CI green.
- Schema-evolution PRs: two reviews required once external-contributor onboarding opens (Phase 7+).
- Schema-breaking PRs must include a migration script under `vault-cli/migrations/`.
### Provenance honesty
Every question's `provenance` field must honestly reflect how it was made:
- `human` — written from scratch by a human.
- `llm-draft` — produced by `vault generate`; not yet human-reviewed.
- `llm-then-human-edited` — an LLM draft substantially revised by a human (the
common case). `generation_meta.human_reviewed_at` records when.
- `imported` — from an external source (e.g., book, published paper). Include
source in `tags`.
Misattributing LLM content as `human` is a correctness bug, not a style nit.
### Author attribution
`vault new` populates `authors` from your `git config user.email` via
`vault/contributors.yaml` (mapping from email → handle). Submit a PR updating
that file to add yourself.
For external PRs: commit signatures (GPG/SSH) or GitHub-verified-email match is
required — CI rejects `authors:` claims that don't match the committer
identity.
---
## Style
- Keep questions focused on a **single concept**. "Good napkin math" is usually
a signal; "grab-bag of facts" is usually a code smell.
- Use realistic hardware specs — check `mlsysbook/constants.py` and the
`vault/schema/models.yaml` registry for the canonical values.
- Paper-cite URL format only: `https://mlsysbook.ai/book/chapters/<slug>`.
- Scenarios are plaintext; solutions and napkin math can use restricted
Markdown + KaTeX.
---
## Things that block external PRs from merging
1. **Provenance lie**`vault mark-exemplar` or `vault promote --reviewed-by`
fields that don't match git committer.
2. **Registry mutation** — any commit that deletes lines from
`id-registry.yaml`. The registry is append-only.
3. **Schema mixing** — questions at different `schema_version` in the same PR.
4. **Unsigned schema-evolution PR** — schema bumps require two maintainer
approvals.
---
## Phase-by-phase scope
As of Phase 0, the vault pipeline is scaffolded but not operational end-to-end.
What works today:
- `vault --version`, `vault --help`.
- `pip install -e vault-cli/[dev]` and `pytest`.
- Documentation (ARCHITECTURE.md, REVIEWS.md, TESTING.md, EVOLUTION.md, this file).
What's coming per Phase (see [`vault/ARCHITECTURE.md`](vault/ARCHITECTURE.md) §14):
- **Phase 1**: `new`, `edit`, `move`, `rm`, `restore`, `build`, `check`, `serve`, `api`. YAML split lands.
- **Phase 2**: `publish` + primitives, paper-exporter rewrite, rollback-symmetry CI.
- **Phase 3**: D1 + Worker + `@staffml/vault-types` + FTS5 load-test gate.
- **Phase 4**: Website cutover + service worker + rollback drill.
- **Phase 5**: Chain pre-reveal indicator + instrumentation.
- **Phase 6**: About-page paper prominence.
External contributions to `vault/questions/` become feasible at Phase 1 exit.
---
## License
The corpus under `interviews/vault/` is licensed **CC-BY-NC-4.0** — see
[`vault/questions/LICENSE`](vault/questions/LICENSE). Summary:
- **Share and adapt for non-commercial purposes** with attribution.
- **Commercial use** (training a paid product on the corpus, selling access,
building paid services around it) requires separate written permission
from the copyright holders. Contact: `vjreddi@g.harvard.edu`.
- Contributions to the corpus are assumed to be offered under the same
CC-BY-NC-4.0 terms. Do not submit content you are not entitled to license
this way.
The `interviews/vault-cli/` tooling is a separate artifact; its license is
unchanged from the repository's historical state.
---
## Asking for help
- Architecture questions → read [`vault/ARCHITECTURE.md`](vault/ARCHITECTURE.md).
- "Why was X decided this way?" → check [`vault/REVIEWS.md`](vault/REVIEWS.md).
Most non-obvious decisions map to a reviewer finding.
- "Is this bug or intended?" → open an issue with the command you ran and the
output.
---
**Thanks for contributing.**