Skip to content

generate command #79

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 152 commits into
base: main
Choose a base branch
from
Open

generate command #79

wants to merge 152 commits into from

Conversation

pelikhan
Copy link

@pelikhan pelikhan commented Jul 24, 2025

Implement PromptPex strategy to generate tests for prompts automatically.


🚀 PromptPex Power-Up: Smarter Prompt Test Generation, Output Cleaning, and Dev Experience Improvements

This PR supercharges the gh models CLI extension with a suite of new features and quality-of-life upgrades focused on prompt test generation, output handling, and developer workflow. Highlights include:

  • 🧪 PromptPex-Based Test Generation

    • New generate command leveraging the PromptPex framework for systematic, rules-driven prompt test creation.
    • Extensive documentation and examples to guide users through advanced prompt testing workflows.
  • 🧹 Model Output Cleaning Utilities

    • Added helpers to normalize and sanitize LLM-generated outputs, removing formatting artifacts and detecting refusal responses.
    • Comprehensive unit tests ensure robust text cleaning.
  • 🧠 Context Persistence & Effort Configuration

    • Support for loading, saving, and merging prompt test contexts, enabling session persistence and collaborative editing.
    • Configurable "effort levels" to control the intensity and scope of test generation runs.
  • 🏆 Rules-Based Output Evaluation

    • Integrated rules compliance evaluator for automated, granular assessment of LLM outputs against prompt rules.
  • 🔬 Advanced Parsing & Robustness

    • Improved JSON extraction and normalization from LLM responses, handling markdown code blocks, JS-style concatenations, and malformed content.
    • Flexible rules parsing from varied list formats.
  • 🛠️ Utility Functions & CLI Enhancements

    • New helpers for string slicing, map merging, and template variable parsing.
    • Enhanced CLI output formatting and verbose logging for better feedback.
  • 🧪 Expanded Unit Test Coverage

    • Table-driven and scenario-based tests for flag parsing, context management, error handling, and end-to-end prompt rendering.
  • 📝 Developer Experience & CI Improvements

    • Updated Makefile with new linting targets and improved build workflows.
    • Rich project and AI agent instructions for onboarding and code consistency.
  • 🗂️ Prompt File Handling Upgrades

    • Cleaner YAML outputs with omitempty struct tags.
    • Round-trip file save support and improved type safety for test data.

These changes collectively deliver smarter, more reliable prompt test automation, improved output handling, and a better developer experience for working with LLM prompts and tests.

AI-generated content by prd may be incorrect.

pelikhan added 30 commits July 21, 2025 13:41
- Implement tests for Float32Ptr to validate pointer creation for float32 values.
- Create tests for ExtractJSON to ensure correct extraction of JSON from various input formats.
- Add tests for cleanJavaScriptStringConcat to verify string concatenation handling in JavaScript context.
- Introduce tests for StringSliceContains to check for string presence in slices.
- Implement tests for MergeStringMaps to validate merging behavior of multiple string maps, including overwrites and handling of nil/empty maps.
… tests in export_test.go

- Changed modelParams from pointer to value in toGitHubModelsPrompt function for better clarity and safety.
- Updated the assignment of ModelParameters to use the value directly instead of dereferencing a pointer.
- Introduced a new test suite in export_test.go to cover various scenarios for GitHub models evaluation generation, including edge cases and expected outputs.
- Ensured that the tests validate the correct creation of files and their contents based on the provided context and options.
- Added NewPromptPex function to create a new PromptPex instance.
- Implemented Run method to execute the PromptPex pipeline with context management.
- Created context from prompt files or loaded existing context from JSON.
- Developed pipeline steps including intent generation, input specification, output rules, and tests.
- Added functionality for generating groundtruth outputs and evaluating test results.
- Implemented test expansion and rating features for improved test coverage.
- Introduced error handling and logging throughout the pipeline execution.
- Implemented TestCreateContext to validate various prompt YAML configurations and their expected context outputs.
- Added TestCreateContextRunIDUniqueness to ensure unique RunIDs are generated for multiple context creations.
- Created TestCreateContextWithNonExistentFile to handle cases where the prompt file does not exist.
- Developed TestCreateContextPromptValidation to check for valid and invalid prompt formats.
- Introduced TestGithubModelsEvalsGenerate to test the generation of GitHub Models eval files with various scenarios.
- Added TestToGitHubModelsPrompt to validate the conversion of prompts to GitHub Models format.
- Implemented TestExtractTemplateVariables and TestExtractVariablesFromText to ensure correct extraction of template variables.
- Created TestGetMapKeys and TestGetTestScenario to validate utility functions related to maps and test scenarios.
…se and restore its implementation; remove obsolete promptpex.go and summary_test.go files
…covering various scenarios and error handling
…neFlags function and update flag parsing to use consistent naming
@pelikhan pelikhan marked this pull request as ready for review July 31, 2025 05:08
@pelikhan pelikhan requested a review from a team as a code owner July 31, 2025 05:08
@pelikhan pelikhan requested a review from Copilot July 31, 2025 05:19
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements advanced automated test generation for prompt files using the PromptPex methodology, adding a comprehensive generate CLI command with robust session management and extensive testing support.

  • Adds new generate command implementing the PromptPex strategy for systematic prompt test generation
  • Refactors template variable parsing from run command to shared utility function in pkg/util
  • Enhances prompt file structure with omitempty YAML tags and new test data types

Reviewed Changes

Copilot reviewed 40 out of 41 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
cmd/generate/ Complete implementation of PromptPex-based test generation with pipeline, utilities, and comprehensive tests
pkg/util/util.go Moves template variable parsing to shared utility function
pkg/prompt/prompt.go Adds SaveToFile method and improves YAML serialization with omitempty tags
internal/azuremodels/ Adds HTTP logging support for debugging API requests
cmd/run/run.go Refactors to use shared template variable parsing utility
Comments suppressed due to low confidence (1)

pkg/prompt/prompt.go:42

  • [nitpick] The type alias TestDataItem doesn't add clarity over the underlying type map[string]interface{}. Consider using the concrete type directly or adding meaningful methods to justify the alias.
type TestDataItem map[string]interface{}

Copy link
Collaborator

@sgoedecke sgoedecke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is looking great! Just some initial comments about files we may not need

README.md Outdated
# Specify effort level (low, medium, high)
gh models generate --effort high my_prompt.prompt.yml

# Use a specific model for groundtruth generation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't seen the term groundtruth before. Is there any good resource we could include a link to, to explain the concept?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is vaguely used in ML as the annotated dataset with the 'expected' label. Apparently, it is borrowed from meteorology. Groundtruth maybe a bit strong here and we could reword it to something else. It computes the "expected" field in the LLM eval.

promptContent := RenderMessagesToString(context.Prompt.Messages)
rulesContent := strings.Join(context.Rules, "\n")

systemPrompt := fmt.Sprintf(`Your task is to very carefully and thoroughly evaluate the given output generated by a chatbot in <chatbot_output> to find out if it comply with its prompt and the output rules that are extracted from the description and provided to you in <output_rules>.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be nice to have this prompt template in its own file, one that this function reads in, so it's not mixed in with Go.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not exactly sure how we have resource files in Go but yes much easier to keep the prompts in separate file.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sgoedecke do you have an idea about resource and rending prompts from files? I did not see that in the initial promptpex-go impl.

Comment on lines +236 to +237
system := `Based on the following <output_rules>, generate inverse rules that describe what would make an INVALID output.
These should be the opposite or negation of the original rules.`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This description of what inverse rules are might be nice to have in the readme, too.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean an example of prompt/rules/inverse rules in the readme?

pelikhan and others added 12 commits July 31, 2025 18:25
Co-authored-by: Sarah Vessels <[email protected]>
Co-authored-by: Sarah Vessels <[email protected]>
Co-authored-by: Sarah Vessels <[email protected]>
Co-authored-by: Sarah Vessels <[email protected]>
Co-authored-by: Sarah Vessels <[email protected]>
Co-authored-by: Sarah Vessels <[email protected]>
Co-authored-by: Sarah Vessels <[email protected]>
Co-authored-by: Sarah Vessels <[email protected]>
Co-authored-by: Sarah Vessels <[email protected]>
Co-authored-by: Sarah Vessels <[email protected]>
Co-authored-by: Sarah Vessels <[email protected]>
Co-authored-by: Sarah Vessels <[email protected]>
Copy link
Author

@pelikhan pelikhan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot Analyze comments around documentation and apply fixes.

@pelikhan
Copy link
Author

I address a few comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants