generate command #79

pelikhan · 2025-07-24T10:43:25Z

Implement PromptPex strategy to generate tests for prompts automatically.

🚀 PromptPex Power-Up: Smarter Prompt Test Generation, Output Cleaning, and Dev Experience Improvements

This PR supercharges the gh models CLI extension with a suite of new features and quality-of-life upgrades focused on prompt test generation, output handling, and developer workflow. Highlights include:

🧪 PromptPex-Based Test Generation
- New generate command leveraging the PromptPex framework for systematic, rules-driven prompt test creation.
- Extensive documentation and examples to guide users through advanced prompt testing workflows.
🧹 Model Output Cleaning Utilities
- Added helpers to normalize and sanitize LLM-generated outputs, removing formatting artifacts and detecting refusal responses.
- Comprehensive unit tests ensure robust text cleaning.
🧠 Context Persistence & Effort Configuration
- Support for loading, saving, and merging prompt test contexts, enabling session persistence and collaborative editing.
- Configurable "effort levels" to control the intensity and scope of test generation runs.
🏆 Rules-Based Output Evaluation
- Integrated rules compliance evaluator for automated, granular assessment of LLM outputs against prompt rules.
🔬 Advanced Parsing & Robustness
- Improved JSON extraction and normalization from LLM responses, handling markdown code blocks, JS-style concatenations, and malformed content.
- Flexible rules parsing from varied list formats.
🛠️ Utility Functions & CLI Enhancements
- New helpers for string slicing, map merging, and template variable parsing.
- Enhanced CLI output formatting and verbose logging for better feedback.
🧪 Expanded Unit Test Coverage
- Table-driven and scenario-based tests for flag parsing, context management, error handling, and end-to-end prompt rendering.
📝 Developer Experience & CI Improvements
- Updated Makefile with new linting targets and improved build workflows.
- Rich project and AI agent instructions for onboarding and code consistency.
🗂️ Prompt File Handling Upgrades
- Cleaner YAML outputs with omitempty struct tags.
- Round-trip file save support and improved type safety for test data.

These changes collectively deliver smarter, more reliable prompt test automation, improved output handling, and a better developer experience for working with LLM prompts and tests.

AI-generated content by prd may be incorrect.

- Implement tests for Float32Ptr to validate pointer creation for float32 values. - Create tests for ExtractJSON to ensure correct extraction of JSON from various input formats. - Add tests for cleanJavaScriptStringConcat to verify string concatenation handling in JavaScript context. - Introduce tests for StringSliceContains to check for string presence in slices. - Implement tests for MergeStringMaps to validate merging behavior of multiple string maps, including overwrites and handling of nil/empty maps.

…ove unused ChatMessage type

…Pex context conversion

…ation

… tests in export_test.go - Changed modelParams from pointer to value in toGitHubModelsPrompt function for better clarity and safety. - Updated the assignment of ModelParameters to use the value directly instead of dereferencing a pointer. - Introduced a new test suite in export_test.go to cover various scenarios for GitHub models evaluation generation, including edge cases and expected outputs. - Ensured that the tests validate the correct creation of files and their contents based on the provided context and options.

- Added NewPromptPex function to create a new PromptPex instance. - Implemented Run method to execute the PromptPex pipeline with context management. - Created context from prompt files or loaded existing context from JSON. - Developed pipeline steps including intent generation, input specification, output rules, and tests. - Added functionality for generating groundtruth outputs and evaluating test results. - Implemented test expansion and rating features for improved test coverage. - Introduced error handling and logging throughout the pipeline execution.

- Implemented TestCreateContext to validate various prompt YAML configurations and their expected context outputs. - Added TestCreateContextRunIDUniqueness to ensure unique RunIDs are generated for multiple context creations. - Created TestCreateContextWithNonExistentFile to handle cases where the prompt file does not exist. - Developed TestCreateContextPromptValidation to check for valid and invalid prompt formats. - Introduced TestGithubModelsEvalsGenerate to test the generation of GitHub Models eval files with various scenarios. - Added TestToGitHubModelsPrompt to validate the conversion of prompts to GitHub Models format. - Implemented TestExtractTemplateVariables and TestExtractVariablesFromText to ensure correct extraction of template variables. - Created TestGetMapKeys and TestGetTestScenario to validate utility functions related to maps and test scenarios.

…tPex configuration

… summary generation

… improved summary reporting

…se and restore its implementation; remove obsolete promptpex.go and summary_test.go files

…covering various scenarios and error handling

…entiment analysis test prompt

…neFlags function and update flag parsing to use consistent naming

… in generate_test.go

…ck responses for sentiment analysis stages

…prompts

…odology for test generation

…derMessagesToString for message formatting

.github/copilot-instructions.md

README.md

cheshire137 · 2025-07-31T14:24:33Z

README.md

+# Specify effort level (low, medium, high)
+gh models generate --effort high my_prompt.prompt.yml
+
+# Use a specific model for groundtruth generation


I haven't seen the term groundtruth before. Is there any good resource we could include a link to, to explain the concept?

It is vaguely used in ML as the annotated dataset with the 'expected' label. Apparently, it is borrowed from meteorology. Groundtruth maybe a bit strong here and we could reword it to something else. It computes the "expected" field in the LLM eval.

README.md

cheshire137 · 2025-07-31T14:41:37Z

cmd/generate/evaluators.go

+	promptContent := RenderMessagesToString(context.Prompt.Messages)
+	rulesContent := strings.Join(context.Rules, "\n")
+
+	systemPrompt := fmt.Sprintf(`Your task is to very carefully and thoroughly evaluate the given output generated by a chatbot in <chatbot_output> to find out if it comply with its prompt and the output rules that are extracted from the description and provided to you in <output_rules>.


Might be nice to have this prompt template in its own file, one that this function reads in, so it's not mixed in with Go.

I'm not exactly sure how we have resource files in Go but yes much easier to keep the prompts in separate file.

@sgoedecke do you have an idea about resource and rending prompts from files? I did not see that in the initial promptpex-go impl.

cmd/generate/generate.go

cmd/generate/parser.go

cheshire137 · 2025-07-31T14:44:57Z

cmd/generate/pipeline.go

+		system := `Based on the following <output_rules>, generate inverse rules that describe what would make an INVALID output.
+These should be the opposite or negation of the original rules.`


This description of what inverse rules are might be nice to have in the readme, too.

do you mean an example of prompt/rules/inverse rules in the readme?

pkg/util/util.go

Co-authored-by: Sarah Vessels <[email protected]>

pelikhan

@copilot Analyze comments around documentation and apply fixes.

…nsion

…ate command

…clarify effort flag usage

pelikhan · 2025-07-31T18:46:41Z

I address a few comments.

Copilot

Pull Request Overview

This PR implements a comprehensive PromptPex-based test generation system for the gh models CLI extension. The primary purpose is to add a new generate command that automatically creates systematic test cases for prompt files using AI-driven analysis and rule extraction.

Adds complete generate command with PromptPex methodology for automated prompt test generation
Refactors template variable parsing into shared utility functions for reuse across commands
Enhances prompt file structure with improved YAML serialization and test data handling

Reviewed Changes

Copilot reviewed 36 out of 37 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
cmd/generate/*.go	Complete implementation of generate command with pipeline, parsing, rendering, and LLM integration
pkg/util/util.go	Extracted shared template variable parsing functionality from run command
pkg/prompt/prompt.go	Enhanced prompt file structure with omitempty tags and SaveToFile method
internal/azuremodels/*.go	Added HTTP logging context support for debugging API calls
examples/test_generate.yml	Example prompt file demonstrating generate command usage
cmd/run/run.go	Updated to use shared ParseTemplateVariables utility function
cmd/root.go	Registered new generate command in CLI

Comments suppressed due to low confidence (1)

cmd/generate/utils.go:9

[nitpick] The function name 'ExtractJSON' is ambiguous - it's unclear whether it extracts JSON from text or converts content to JSON format. Consider renaming to 'ExtractJSONFromText' or 'CleanJSONContent' to clarify its purpose.

func ExtractJSON(content string) string {

Copilot · 2025-08-02T16:27:20Z

cmd/generate/parser.go

+	for _, rawTest := range rawTests {
+		test := PromptPexTest{}
+
+		for _, key := range []string{"testInput", "testinput", "input"} {


This hardcoded slice of field name variants creates maintenance overhead and is error-prone. Consider defining these as package-level constants or using reflection to make the field mapping more explicit and maintainable.

Suggested change

for _, key := range []string{"testInput", "testinput", "input"} {

for _, key := range testInputFieldNames {

Copilot · 2025-08-02T16:27:20Z

cmd/generate/pipeline.go

+	messages = append(messages,
+		azuremodels.ChatMessage{Role: azuremodels.ChatMessageRoleUser, Content: &prompt},
+	)
+


[nitpick] This message construction pattern is repeated multiple times throughout the pipeline. Consider extracting a helper method like 'buildMessagesWithCustomInstruction(systemPrompt, userPrompt, customInstruction)' to reduce code duplication and improve maintainability.

Suggested change

var customInstruction *string

if h.options.Instructions != nil && h.options.Instructions.Tests != "" {

customInstruction = &h.options.Instructions.Tests

}

messages := buildMessagesWithCustomInstruction(system, prompt, customInstruction)

Copilot · 2025-08-02T16:27:21Z

cmd/generate/llm.go

+	}
+	req.Model = parsedModel.String()
+
+	for attempt := 0; attempt <= maxRetries; attempt++ {


[nitpick] The retry logic uses a magic number for maxRetries (3) and doesn't implement exponential backoff for non-rate-limit errors. Consider making maxRetries configurable and implementing proper backoff strategies for better resilience.

Copilot · 2025-08-02T16:27:21Z

cmd/generate/types.go

+	Reasoning string `json:"reasoning,omitempty" yaml:"reasoning,omitempty"`
+	Scenario  string `json:"scenario,omitempty" yaml:"scenario,omitempty"`
+}
+


[nitpick] The PromptPexTest struct mixes input fields (Input, Scenario, Reasoning) with output fields (Expected, Predicted). This coupling could lead to confusion about which fields are for test definition vs. test results. Consider separating these concerns into distinct types or clearly documenting the field purposes.

Suggested change

// PromptPexTestCase represents the input definition of a single test case

type PromptPexTestCase struct {

Input string `json:"input" yaml:"input"`

Reasoning string `json:"reasoning,omitempty" yaml:"reasoning,omitempty"`

Scenario string `json:"scenario,omitempty" yaml:"scenario,omitempty"`

}

// PromptPexTestResult represents the output/result of a test case

type PromptPexTestResult struct {

Expected string `json:"expected,omitempty" yaml:"expected,omitempty"`

Predicted string `json:"predicted,omitempty" yaml:"predicted,omitempty"`

}

// PromptPexTest combines test case definition and result (if needed)

type PromptPexTest struct {

PromptPexTestCase

PromptPexTestResult

}

pelikhan added 30 commits July 21, 2025 13:41

plumbing for commands

871788d

bringing promptpex

9e82844

Add comprehensive Copilot instructions for AI coding agents

d8fcb9d

Enhance ApplyEffortConfiguration to handle nil options gracefully

3ea7a6e

Refactor PromptPexContext to use ChatMessage from azuremodels and rem…

ef7d089

…ove unused ChatMessage type

Implement GitHub Models evaluation file generation and enhance Prompt…

96f9183

…Pex context conversion

Fix dereferencing of Frontmatter fields in GitHub Models prompt gener…

37b761c

…ation

clea content

ee90766

refactor: Remove obsolete export_test_new.go file

1c936c0

refactor: Remove obsolete output options and related tests from Promp…

292917a

…tPex configuration

feat: Add GenerateSummary function and corresponding tests for prompt…

e9c6668

… summary generation

feat: Implement runPipeline function and refactor GenerateSummary for…

5c5a167

… improved summary reporting

refactor: Rename parseTestsFromLLMResponse to ParseTestsFromLLMRespon…

b4b662f

…se and restore its implementation; remove obsolete promptpex.go and summary_test.go files

test: Add comprehensive tests for ParseTestsFromLLMResponse function …

393020f

…covering various scenarios and error handling

feat: Implement generate command with comprehensive options and add s…

6458590

…entiment analysis test prompt

refactor: Consolidate command-line flag definitions into AddCommandLi…

cdc38f1

…neFlags function and update flag parsing to use consistent naming

test: Add comprehensive tests for NewGenerateCommand and flag parsing…

bbdd748

… in generate_test.go

test: Enhance TestGenerateCommandWithValidPromptFile with detailed mo…

7dc3d7d

…ck responses for sentiment analysis stages

move test to common fodler

e812aec

feat: Update generate command description to include evaluations for …

341442f

…prompts

fix: Clarify command description to specify the use of PromptPex meth…

da294e2

…odology for test generation

fix: Update build instructions to include 'make build' command

50b853f

refactor: Rename runPipeline to RunTestGenerationPipeline and add Ren…

5018380

…derMessagesToString for message formatting

Merge remote-tracking branch 'origin/main' into pelikhan/promptpex

9391f0d

refactor: Update test prompt from sentiment analysis to joke analysis

f3f320b

fix: Disable usage help for pipeline failures in generate command

7ab63bc

cheshire137 reviewed Jul 31, 2025

View reviewed changes

pelikhan and others added 12 commits July 31, 2025 18:25

Update README.md

dceeba4

Co-authored-by: Sarah Vessels <[email protected]>

Update cleaner.go

350a15b

Co-authored-by: Sarah Vessels <[email protected]>

Update cleaner.go

7cf92c1

Co-authored-by: Sarah Vessels <[email protected]>

Update cleaner.go

e6281db

Co-authored-by: Sarah Vessels <[email protected]>

Update cleaner.go

09b9b87

Co-authored-by: Sarah Vessels <[email protected]>

Update cleaner.go

7bd2e6c

Co-authored-by: Sarah Vessels <[email protected]>

Update context.go

e2970ef

Co-authored-by: Sarah Vessels <[email protected]>

Update context.go

c127bf4

Co-authored-by: Sarah Vessels <[email protected]>

Update evaluators.go

b2d1244

Co-authored-by: Sarah Vessels <[email protected]>

Update generate.go

58bb353

Co-authored-by: Sarah Vessels <[email protected]>

Update parser.go

72e7a15

Co-authored-by: Sarah Vessels <[email protected]>

Update util.go

e5f6483

Co-authored-by: Sarah Vessels <[email protected]>

pelikhan commented Jul 31, 2025

View reviewed changes

pelikhan assigned Copilot Jul 31, 2025

pelikhan added 6 commits July 31, 2025 17:37

Refactor parser functions and clean up unused files

648ee9b

Update README.md to clarify the purpose of the GitHub Models CLI exte…

2882f71

…nsion

Revise advanced options section in README.md for the generate command

20149e4

Clarify README.md instructions for loading session files in the gener…

c238387

…ate command

Fix function name in TestUnXml to match updated implementation

2925428

Remove RunsPerTest configuration and related tests; update README to …

4a6eee9

…clarify effort flag usage

pelikhan requested review from cheshire137, sgoedecke and Copilot August 2, 2025 09:35

This comment was marked as outdated.

Sign in to view

pelikhan added 2 commits August 2, 2025 16:22

Update default tests per rule to use GetDefaultOptions function

b662738

Refactor generateTests to use TestsPerRule from GetDefaultOptions

caa8aa5

Copilot AI review requested due to automatic review settings August 2, 2025 16:26

Copilot AI reviewed Aug 2, 2025

View reviewed changes

		system := `Based on the following <output_rules>, generate inverse rules that describe what would make an INVALID output.
		These should be the opposite or negation of the original rules.`

	for _, key := range []string{"testInput", "testinput", "input"} {
	for _, key := range testInputFieldNames {

+	var customInstruction *string
+	if h.options.Instructions != nil && h.options.Instructions.Tests != "" {
+		customInstruction = &h.options.Instructions.Tests
+	}
+	messages := buildMessagesWithCustomInstruction(system, prompt, customInstruction)

+// PromptPexTestCase represents the input definition of a single test case
+type PromptPexTestCase struct {
+	Input     string `json:"input" yaml:"input"`
+	Reasoning string `json:"reasoning,omitempty" yaml:"reasoning,omitempty"`
+	Scenario  string `json:"scenario,omitempty" yaml:"scenario,omitempty"`
+}
+// PromptPexTestResult represents the output/result of a test case
+type PromptPexTestResult struct {
+	Expected  string `json:"expected,omitempty" yaml:"expected,omitempty"`
+	Predicted string `json:"predicted,omitempty" yaml:"predicted,omitempty"`
+}
+// PromptPexTest combines test case definition and result (if needed)
+type PromptPexTest struct {
+	PromptPexTestCase
+	PromptPexTestResult
+}

generate command #79

Are you sure you want to change the base?

generate command #79

Conversation

pelikhan commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🚀 PromptPex Power-Up: Smarter Prompt Test Generation, Output Cleaning, and Dev Experience Improvements

Uh oh!

Uh oh!

Uh oh!

cheshire137 Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

pelikhan Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cheshire137 Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

pelikhan Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

pelikhan Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cheshire137 Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

pelikhan Jul 31, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pelikhan left a comment

Choose a reason for hiding this comment

Uh oh!

pelikhan commented Jul 31, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Aug 2, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 2, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 2, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pelikhan commented Jul 24, 2025 •

edited

Loading