-
Notifications
You must be signed in to change notification settings - Fork 14
generate command #79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
generate command #79
Conversation
- Implement tests for Float32Ptr to validate pointer creation for float32 values. - Create tests for ExtractJSON to ensure correct extraction of JSON from various input formats. - Add tests for cleanJavaScriptStringConcat to verify string concatenation handling in JavaScript context. - Introduce tests for StringSliceContains to check for string presence in slices. - Implement tests for MergeStringMaps to validate merging behavior of multiple string maps, including overwrites and handling of nil/empty maps.
…ove unused ChatMessage type
…Pex context conversion
… tests in export_test.go - Changed modelParams from pointer to value in toGitHubModelsPrompt function for better clarity and safety. - Updated the assignment of ModelParameters to use the value directly instead of dereferencing a pointer. - Introduced a new test suite in export_test.go to cover various scenarios for GitHub models evaluation generation, including edge cases and expected outputs. - Ensured that the tests validate the correct creation of files and their contents based on the provided context and options.
- Added NewPromptPex function to create a new PromptPex instance. - Implemented Run method to execute the PromptPex pipeline with context management. - Created context from prompt files or loaded existing context from JSON. - Developed pipeline steps including intent generation, input specification, output rules, and tests. - Added functionality for generating groundtruth outputs and evaluating test results. - Implemented test expansion and rating features for improved test coverage. - Introduced error handling and logging throughout the pipeline execution.
- Implemented TestCreateContext to validate various prompt YAML configurations and their expected context outputs. - Added TestCreateContextRunIDUniqueness to ensure unique RunIDs are generated for multiple context creations. - Created TestCreateContextWithNonExistentFile to handle cases where the prompt file does not exist. - Developed TestCreateContextPromptValidation to check for valid and invalid prompt formats. - Introduced TestGithubModelsEvalsGenerate to test the generation of GitHub Models eval files with various scenarios. - Added TestToGitHubModelsPrompt to validate the conversion of prompts to GitHub Models format. - Implemented TestExtractTemplateVariables and TestExtractVariablesFromText to ensure correct extraction of template variables. - Created TestGetMapKeys and TestGetTestScenario to validate utility functions related to maps and test scenarios.
…tPex configuration
… summary generation
… improved summary reporting
…se and restore its implementation; remove obsolete promptpex.go and summary_test.go files
…covering various scenarios and error handling
…entiment analysis test prompt
…neFlags function and update flag parsing to use consistent naming
… in generate_test.go
…ck responses for sentiment analysis stages
…odology for test generation
…derMessagesToString for message formatting
README.md
Outdated
# Specify effort level (low, medium, high) | ||
gh models generate --effort high my_prompt.prompt.yml | ||
|
||
# Use a specific model for groundtruth generation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't seen the term groundtruth before. Is there any good resource we could include a link to, to explain the concept?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is vaguely used in ML as the annotated dataset with the 'expected' label. Apparently, it is borrowed from meteorology. Groundtruth maybe a bit strong here and we could reword it to something else. It computes the "expected" field in the LLM eval.
promptContent := RenderMessagesToString(context.Prompt.Messages) | ||
rulesContent := strings.Join(context.Rules, "\n") | ||
|
||
systemPrompt := fmt.Sprintf(`Your task is to very carefully and thoroughly evaluate the given output generated by a chatbot in <chatbot_output> to find out if it comply with its prompt and the output rules that are extracted from the description and provided to you in <output_rules>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be nice to have this prompt template in its own file, one that this function reads in, so it's not mixed in with Go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not exactly sure how we have resource files in Go but yes much easier to keep the prompts in separate file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sgoedecke do you have an idea about resource and rending prompts from files? I did not see that in the initial promptpex-go impl.
system := `Based on the following <output_rules>, generate inverse rules that describe what would make an INVALID output. | ||
These should be the opposite or negation of the original rules.` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This description of what inverse rules are might be nice to have in the readme, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mean an example of prompt/rules/inverse rules in the readme?
Co-authored-by: Sarah Vessels <[email protected]>
Co-authored-by: Sarah Vessels <[email protected]>
Co-authored-by: Sarah Vessels <[email protected]>
Co-authored-by: Sarah Vessels <[email protected]>
Co-authored-by: Sarah Vessels <[email protected]>
Co-authored-by: Sarah Vessels <[email protected]>
Co-authored-by: Sarah Vessels <[email protected]>
Co-authored-by: Sarah Vessels <[email protected]>
Co-authored-by: Sarah Vessels <[email protected]>
Co-authored-by: Sarah Vessels <[email protected]>
Co-authored-by: Sarah Vessels <[email protected]>
Co-authored-by: Sarah Vessels <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot Analyze comments around documentation and apply fixes.
…clarify effort flag usage
I address a few comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements a comprehensive PromptPex-based test generation system for the gh models
CLI extension. The primary purpose is to add a new generate
command that automatically creates systematic test cases for prompt files using AI-driven analysis and rule extraction.
- Adds complete
generate
command with PromptPex methodology for automated prompt test generation - Refactors template variable parsing into shared utility functions for reuse across commands
- Enhances prompt file structure with improved YAML serialization and test data handling
Reviewed Changes
Copilot reviewed 36 out of 37 changed files in this pull request and generated 4 comments.
Show a summary per file
File | Description |
---|---|
cmd/generate/*.go | Complete implementation of generate command with pipeline, parsing, rendering, and LLM integration |
pkg/util/util.go | Extracted shared template variable parsing functionality from run command |
pkg/prompt/prompt.go | Enhanced prompt file structure with omitempty tags and SaveToFile method |
internal/azuremodels/*.go | Added HTTP logging context support for debugging API calls |
examples/test_generate.yml | Example prompt file demonstrating generate command usage |
cmd/run/run.go | Updated to use shared ParseTemplateVariables utility function |
cmd/root.go | Registered new generate command in CLI |
Comments suppressed due to low confidence (1)
cmd/generate/utils.go:9
- [nitpick] The function name 'ExtractJSON' is ambiguous - it's unclear whether it extracts JSON from text or converts content to JSON format. Consider renaming to 'ExtractJSONFromText' or 'CleanJSONContent' to clarify its purpose.
func ExtractJSON(content string) string {
for _, rawTest := range rawTests { | ||
test := PromptPexTest{} | ||
|
||
for _, key := range []string{"testInput", "testinput", "input"} { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This hardcoded slice of field name variants creates maintenance overhead and is error-prone. Consider defining these as package-level constants or using reflection to make the field mapping more explicit and maintainable.
for _, key := range []string{"testInput", "testinput", "input"} { | |
for _, key := range testInputFieldNames { |
Copilot uses AI. Check for mistakes.
messages = append(messages, | ||
azuremodels.ChatMessage{Role: azuremodels.ChatMessageRoleUser, Content: &prompt}, | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] This message construction pattern is repeated multiple times throughout the pipeline. Consider extracting a helper method like 'buildMessagesWithCustomInstruction(systemPrompt, userPrompt, customInstruction)' to reduce code duplication and improve maintainability.
var customInstruction *string | |
if h.options.Instructions != nil && h.options.Instructions.Tests != "" { | |
customInstruction = &h.options.Instructions.Tests | |
} | |
messages := buildMessagesWithCustomInstruction(system, prompt, customInstruction) |
Copilot uses AI. Check for mistakes.
} | ||
req.Model = parsedModel.String() | ||
|
||
for attempt := 0; attempt <= maxRetries; attempt++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The retry logic uses a magic number for maxRetries (3) and doesn't implement exponential backoff for non-rate-limit errors. Consider making maxRetries configurable and implementing proper backoff strategies for better resilience.
Copilot uses AI. Check for mistakes.
Reasoning string `json:"reasoning,omitempty" yaml:"reasoning,omitempty"` | ||
Scenario string `json:"scenario,omitempty" yaml:"scenario,omitempty"` | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The PromptPexTest struct mixes input fields (Input, Scenario, Reasoning) with output fields (Expected, Predicted). This coupling could lead to confusion about which fields are for test definition vs. test results. Consider separating these concerns into distinct types or clearly documenting the field purposes.
// PromptPexTestCase represents the input definition of a single test case | |
type PromptPexTestCase struct { | |
Input string `json:"input" yaml:"input"` | |
Reasoning string `json:"reasoning,omitempty" yaml:"reasoning,omitempty"` | |
Scenario string `json:"scenario,omitempty" yaml:"scenario,omitempty"` | |
} | |
// PromptPexTestResult represents the output/result of a test case | |
type PromptPexTestResult struct { | |
Expected string `json:"expected,omitempty" yaml:"expected,omitempty"` | |
Predicted string `json:"predicted,omitempty" yaml:"predicted,omitempty"` | |
} | |
// PromptPexTest combines test case definition and result (if needed) | |
type PromptPexTest struct { | |
PromptPexTestCase | |
PromptPexTestResult | |
} |
Copilot uses AI. Check for mistakes.
Implement PromptPex strategy to generate tests for prompts automatically.
🚀 PromptPex Power-Up: Smarter Prompt Test Generation, Output Cleaning, and Dev Experience Improvements
This PR supercharges the
gh models
CLI extension with a suite of new features and quality-of-life upgrades focused on prompt test generation, output handling, and developer workflow. Highlights include:🧪 PromptPex-Based Test Generation
generate
command leveraging the PromptPex framework for systematic, rules-driven prompt test creation.🧹 Model Output Cleaning Utilities
🧠 Context Persistence & Effort Configuration
🏆 Rules-Based Output Evaluation
🔬 Advanced Parsing & Robustness
🛠️ Utility Functions & CLI Enhancements
🧪 Expanded Unit Test Coverage
📝 Developer Experience & CI Improvements
🗂️ Prompt File Handling Upgrades
omitempty
struct tags.These changes collectively deliver smarter, more reliable prompt test automation, improved output handling, and a better developer experience for working with LLM prompts and tests.