How do I export generated tests for CI (JSON, CSV, JSONL)?
Choose Export → Format and select JSON, JSONL, or CSV. Exports include input prompts, expected outputs or assertion rules, and metadata (template name, version, author). For CI, commit the exported file(s) to your repository and configure your test runner to load the test suite and call your model endpoint. The typical flow is: 1) add tests to the repo, 2) run an evaluation script in the CI job that loads the JSON/JSONL and asserts pass/fail, 3) surface failures as PR checks.
Which model types and hosting setups can run tests generated here?
Exported test packages are provider-agnostic and can be executed against OpenAI-compatible endpoints, Anthropic-style assistants, hosted provider APIs, or local/on-prem inference for Llama-style models. The exported format focuses on inputs and expected outputs so your evaluation harness can translate those into provider-specific calls.
Can I keep my test inputs private and generate tests locally or offline?
Yes. The generator supports privacy-conscious workflows: generate test suites in the browser and download exports locally, or use an offline build to produce packages without sending test content to external services. You decide whether to run evaluations locally, on-prem, or against hosted endpoints.
How should I define pass/fail criteria for subjective tasks like summarization?
Combine machine-evaluable assertions with human-review scaffolds. For summarization, use a mix of objective checks (contains main claim, preserves specific numeric facts) and a short rubric for human reviewers (scale for faithfulness, coverage, and hallucination). Export both assertions and the rubric so CI can run quick checks and route ambiguous cases to reviewers.
What templates are available for code generation and multi-turn dialogue tests?
Code templates include small-scale tasks with explicit function signatures, input-output examples, and unit-style assertions to mark pass/fail. Multi-turn dialogue templates include scripted user corrections and expected assistant acknowledgements to verify statefulness across turns. Each template contains example variants you can expand to increase coverage.
How do I convert a manual QA checklist into machine-evaluable assertions?
Map each checklist item to an assertion that can be evaluated automatically where possible. Example: a checklist item 'Return dates in ISO format' becomes an assertion that the assistant's output matches an ISO date regex. For subjective checks, include an explicit rubric and example good/bad outputs so human reviewers can apply consistent judgments.
Can generated tests be used for hiring or technical interviews? Any recommended best practices?
Yes—exported tests provide reproducibility and comparable evaluation conditions. Best practices: avoid leaking proprietary data, include clear instructions and time limits, pair automated checks with human review for creative tasks, and document rubric and grading criteria alongside the test package to ensure fairness and reproducibility.
How do I integrate exported tests into GitHub Actions or other CI runners?
Add the exported JSON/JSONL files to your repository or fetch them as artifacts. In your workflow: install the evaluation harness, load the test suite, call the target model endpoint or local runner, evaluate assertions, and fail the job on unacceptable regressions. Keep model credentials in secure secrets and split long-running or expensive tests into separate CI stages to control cost.