Skip to content

Fix flaky tests for TestQueryLog #16930

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

kapillamba4
Copy link
Contributor

@kapillamba4 kapillamba4 commented Jul 26, 2025

Closes #16843

  • util/testutil/port.go opens a listener on a random port, then immediately closes it.
  • The tests in TestQueryLog quickly attempt to bind to the same port.
  • TCP TIME_WAIT state keeps closed sockets unavailable for ~60 seconds by default
  • Tested this using act to simulate GitHub Actions workflow runs locally.

go-test:
name: Go tests - ${{ matrix.config.name }} - inst${{ matrix.instance }}
runs-on: ubuntu-latest
continue-on-error: true
strategy:
fail-fast: false
matrix:
config:
- name: "long-timeout"
tcp_fin_timeout: 1200
tcp_tw_reuse: 0
port_range: "60000 60999"
- name: "short-timeout"
tcp_fin_timeout: 1
tcp_tw_reuse: 0
port_range: "40000 60999"
instance: [1, 2, 3]
container:
image: quay.io/prometheus/golang-builder:1.24-base
options: >-
--sysctl net.ipv4.tcp_fin_timeout=${{ matrix.config.tcp_fin_timeout }}
--sysctl net.ipv4.tcp_tw_reuse=${{ matrix.config.tcp_tw_reuse }}
--sysctl net.ipv4.ip_local_port_range="${{ matrix.config.port_range }}"
--privileged
steps:
- uses: actions/checkout@v4
- name: Cache Go modules
uses: actions/cache@v4
with:
path: |
~/.cache/go-build
/go/pkg/mod
key: ${{ runner.os }}-go-${{ hashFiles('**/go.sum') }}
restore-keys: |
${{ runner.os }}-go-
- name: Download Go modules
run: go mod download
- name: Run Go tests - ${{ matrix.config.name }} - inst${{ matrix.instance }}
shell: bash
run: |
start_time=$(date +%s)
log_file="/tmp/test_output_${{ matrix.config.name }}_inst${{ matrix.instance }}.log"
touch "$log_file"
echo "Writing test logs to $log_file"
echo "TCP settings applied via Docker options:"
echo " FIN timeout: ${{ matrix.config.tcp_fin_timeout }}"
echo " TW reuse: ${{ matrix.config.tcp_tw_reuse }}"
echo " Port range: ${{ matrix.config.port_range }}"
for i in {1..10}; do
echo "Run #$i with ${{ matrix.config.name }}" | tee -a "$log_file"
go test --tags=dedupelabels -failfast -count=1 ./cmd/prometheus/... 2>&1 | tee -a "$log_file"
done
end_time=$(date +%s)
duration=$((end_time - start_time))
echo "Total test duration for ${{ matrix.config.name }}: ${duration} seconds" | tee -a "$log_file"

https://github.com/prometheus/prometheus/actions/runs/16550794764/job/46805296935
https://github.com/prometheus/prometheus/actions/runs/16550991675/job/46805798591
https://github.com/prometheus/prometheus/actions/runs/16551077064/job/46806002789

|         reload_test.go:194: time=2025-07-26T18:59:46.970Z level=INFO source=main.go:718 msg="No time or size retention was set so using the default time retention" duration=15d
|         reload_test.go:194: time=2025-07-26T18:59:46.970Z level=INFO source=main.go:769 msg="Starting Prometheus Server" mode=server version="(version=, branch=, revision=unknown)"
|         reload_test.go:194: time=2025-07-26T18:59:46.970Z level=INFO source=main.go:774 msg="operational information" build_context="(go=go1.24.5, platform=linux/amd64, user=, date=, tags=dedupelabels)" host_details="(Linux 6.8.0-76060800daily20240311-generic #202403110203~1714077665~22.04~4c8e9a0 SMP PREEMPT_DYNAMIC Thu A x86_64 pop-os (none))" fd_limits="(soft=1048576, hard=1048576)" vm_limits="(soft=unlimited, hard=unlimited)"
|         reload_test.go:194: time=2025-07-26T18:59:46.977Z level=INFO source=web.go:656 msg="Start listening for connections" component=web address=0.0.0.0:39871
|         reload_test.go:194: time=2025-07-26T18:59:46.977Z level=ERROR source=main.go:1054 msg="Unable to start web listener" err="listen tcp 0.0.0.0:39871: bind: address already in use"
|         main_test.go:777: 
|               Error Trace:    /home/kapil/git/prometheus/cmd/prometheus/main_test.go:777
|                                                       /home/kapil/git/prometheus/cmd/prometheus/main_test.go:793
|               Error:          Condition never satisfied
|               Test:           TestRuntimeGOGCConfig/empty_config_file_with_GOGC_env_var_set
| FAIL
| FAIL  github.com/prometheus/prometheus/cmd/prometheus 21.039s
| FAIL
|         reload_test.go:194: time=2025-07-27T04:50:49.292Z level=INFO source=main.go:774 msg="operational information" build_context="(go=go1.24.5, platform=linux/amd64, user=, date=, tags=dedupelabels)" host_details="(Linux 6.8.0-76060800daily20240311-generic #202403110203~1714077665~22.04~4c8e9a0 SMP PREEMPT_DYNAMIC Thu A x86_64 pop-os (none))" fd_limits="(soft=1048576, hard=1048576)" vm_limits="(soft=unlimited, hard=unlimited)"
|         reload_test.go:194: time=2025-07-27T04:50:49.299Z level=INFO source=web.go:656 msg="Start listening for connections" component=web address=0.0.0.0:40603
|         reload_test.go:194: time=2025-07-27T04:50:49.299Z level=ERROR source=main.go:1054 msg="Unable to start web listener" err="listen tcp 0.0.0.0:40603: bind: address already in use"
|         main_test.go:789: 
|               Error Trace:    /home/kapil/git/prometheus/cmd/prometheus/main_test.go:789
|                                                       /home/kapil/git/prometheus/cmd/prometheus/main_test.go:793
|               Error:          Not equal: 
|                               expected: 77
|                               actual  : 75
|               Test:           TestRuntimeGOGCConfig/gogc_set_through_config_and_env_var
| FAIL
| FAIL  github.com/prometheus/prometheus/cmd/prometheus 24.087s
| FAIL
time=2025-07-27T11:56:58.334Z level=INFO source=main.go:677 msg="Leaving GOMAXPROCS=4: CPU quota undefined" component=automaxprocs
            time=2025-07-27T11:56:58.334Z level=INFO source=memlimit.go:198 msg="GOMEMLIMIT is updated" component=automemlimit package=github.com/KimMachineGun/automemlimit/memlimit GOMEMLIMIT=15095321395 previous=9223372036854775807
            time=2025-07-27T11:56:58.334Z level=INFO source=main.go:[71](https://github.com/prometheus/prometheus/actions/runs/16550794764/job/46805296935#step:6:72)9 msg="No time or size retention was set so using the default time retention" duration=15d
            time=2025-07-27T11:56:58.334Z level=INFO source=main.go:770 msg="Starting Prometheus Server" mode=server version="(version=, branch=, revision=unknown)"
            time=2025-07-27T11:56:58.335Z level=INFO source=main.go:775 msg="operational information" build_context="(go=go1.24.5, platform=linux/amd64, user=, date=, tags=dedupelabels)" host_details="(Linux 6.11.0-1018-azure #18~24.04.1-Ubuntu SMP Sat Jun 28 04:46:03 UTC 2025 x86_64 ba1d7f3b3284 (none))" fd_limits="(soft=1048576, hard=1048576)" vm_limits="(soft=unlimited, hard=unlimited)"
            time=2025-07-27T11:56:58.336Z level=INFO source=web.go:659 msg="Start listening for connections" component=web address=[::1]:60359
            time=2025-07-27T11:56:58.336Z level=ERROR source=main.go:1055 msg="Unable to start web listener" err="listen tcp [::1]:60359: bind: address already in use"
            
        query_log_test.go:125: 
            	Error Trace:	/__w/prometheus/prometheus/cmd/prometheus/query_log_test.go:125
            	            				/__w/prometheus/prometheus/cmd/prometheus/query_log_test.go:329
            	            				/__w/prometheus/prometheus/cmd/prometheus/query_log_test.go:4[73](https://github.com/prometheus/prometheus/actions/runs/16550794764/job/46805296935#step:6:74)
            	Error:      	Received unexpected error:
            	            	Get "http://[::1]:60359/api/v1/rules": dial tcp [::1]:60359: connect: connection refused
            	Test:       	TestQueryLog/rule_queries,_[::1]:60359

@kapillamba4 kapillamba4 changed the title Mitigate port allocation race condition in tests [DRAFT] Mitigate port allocation race condition in tests Jul 26, 2025
@kapillamba4 kapillamba4 marked this pull request as draft July 26, 2025 20:19
@kapillamba4 kapillamba4 changed the title [DRAFT] Mitigate port allocation race condition in tests Mitigate port allocation race condition in tests Jul 26, 2025
container:
image: quay.io/prometheus/golang-builder:1.24-base
steps:
- uses: actions/checkout@v4

Check warning

Code scanning / Scorecard

Pinned-Dependencies Medium

score is 7: GitHub-owned GitHubAction not pinned by hash
Click Remediation section below to solve this issue
- uses: actions/checkout@v4

- name: Cache Go modules
uses: actions/cache@v4

Check warning

Code scanning / Scorecard

Pinned-Dependencies Medium

score is 7: GitHub-owned GitHubAction not pinned by hash
Click Remediation section below to solve this issue
Signed-off-by: Kapil Lamba <[email protected]>
Signed-off-by: Kapil Lamba <[email protected]>
Signed-off-by: Kapil Lamba <[email protected]>
@kapillamba4 kapillamba4 changed the title Mitigate port allocation race condition in tests Fix flaky tests for TestQueryLog Jul 27, 2025
Signed-off-by: Kapil Lamba <[email protected]>
Signed-off-by: Kapil Lamba <[email protected]>
Signed-off-by: Kapil Lamba <[email protected]>
Signed-off-by: Kapil Lamba <[email protected]>
@kapillamba4 kapillamba4 marked this pull request as ready for review July 27, 2025 12:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TestQueryLog is flaky
1 participant