◐ Shell
clean mode source ↗

Use `px/agent_status_diagnostics` script within px cli to detect missing kernel headers by ddelnano · Pull Request #2065 · pixie-io/pixie

@ddelnano mentioned this pull request

Dec 18, 2024
kernel headers

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>

@ddelnano ddelnano marked this pull request as ready for review

December 24, 2024 14:18

ddelnano added a commit that referenced this pull request

Jan 6, 2025
Summary: Add px/agent_status_diagnostics pxl script

This PR adds a new pxl script that helps identify if linux headers are
missing on any PEMs. This can be extended in the future, but the first
use cause will be to execute the script during `px deploy` and `px
collect-logs`.

Relevant Issues: #2051

Type of change: /kind feature

Test Plan: Ran the script in the UI and tested it as part of the
validation for #2065

![Screen Shot 2024-12-18 at 10 51 15
AM](https://github.com/user-attachments/assets/bd4ab6de-ad92-4af3-8714-b044a7df7d65)


Changelog Message: Add `px/agent_status_diagnostics` pxl script for
checking common issues

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>

aimichelle

@ddelnano ddelnano deleted the ddelnano/use-agent-diagnostics-px-deploy-collect-logs branch

January 6, 2025 23:10

ddelnano added a commit to ddelnano/pixie that referenced this pull request

Jan 16, 2025

ddelnano added a commit to ddelnano/pixie that referenced this pull request

Jan 16, 2025
…ect missing kernel headers (pixie-io#2065)"

This reverts commit 3c9c4bd.

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>

ddelnano added a commit to ddelnano/pixie that referenced this pull request

Jan 16, 2025
…ect missing kernel headers (pixie-io#2065)"

This reverts commit 3c9c4bd.

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>

aimichelle pushed a commit that referenced this pull request

Jan 16, 2025
…ect missing kernel headers (#2065)" (#2081)

Summary: Revert "Use `px/agent_status_diagnostics` script within px cli
to detect missing kernel headers (#2065)"

This reverts commit 3c9c4bd. While
testing the latest cloud release, I noticed that the functionality added
in #2065 can cause `px deploy` to hang indefinitely at the "Wait for
healthcheck" step. The deploy will finish successfully, but it's a poor
user experience.

There seems to be a goroutine blocking issue that is dependent on
cluster size. On 1 and 2 node clusters that I tested #2065 on, the issue
doesn't surface. However, the issue reproduces reliably on the larger
clusters that have pixie deployed to. Let's revert and release a new cli
version once this is tracked down.

Relevant Issues: #2051

Type of change: /kind bugfix

Test Plan: N/A

Changelog Message: Reverted the recent advanced diagnostics added during
`px deploy` as in some cases it can cause that caused `px deploy` to
hang

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>

ddelnano added a commit to ddelnano/pixie that referenced this pull request

Jan 17, 2025
…ing kernel headers (pixie-io#2065)

Summary: Use `px/agent_status_diagnostics` script within px cli to
detect missing kernel headers

This PR leverages the script added in pixie-io#2064 to detect missing kernel
headers during cli deploys and `px collect-logs` commands. This solves
2/3 of the use cases I was hoping to identify for pixie-io#2051 (the last being
helm installs).

A recent example of this problem is
pixie-io#1986, where a Go TLS tracing
bug went undiagnosed for months (August to December). Amazon Linux
2023's headers are different enough that it breaks Go TLS tracing when
pixie's pre-packaged headers are used. The tooling in this PR would have
provided a few opportunities for this to be caught.

Relevant Issues: pixie-io#2051

Type of change: /kind feature

Test Plan: Verified the following scenarios
<details><summary>Test cases</summary>

- [x] `px collect-logs` works against a cloud that doesn't have a
`px/agent_status_diagnostics` script
```
$ bazel run -c opt  --stamp src/pixie_cli:px -- collect-logs

WARN[0006] healthcheck script detected the following warnings:  error="Unable to detect if the cluster's nodes have the distro kernel headers installed (vizier too old to perform this check). Please ensure that the kernel headers are installed on all nodes."
Logs written to pixie_logs_20241223165214.zip

# zip file contains px/agent_status output
$ cat px_agent_diagnostics.txt
{"_tableName_":"output","agent_id":"07fb4d26-3b53-4ba7-9bb7-f2cb10a1e63d","asid":79,"hostname":"gke-dev-ddelnano1-default-pool-b099382d-30mu","ip_address":"","agent_state":"AGENT_STATE_HEALTHY","create_time":"2024-12-18T12:43:44.41952403Z","last_heartbeat_ns":4303060450,"kernel_headers_installed":true}
```
- [x] `px collect-logs` works against a cloud that does have a
`px/agent_status_diagnostics` script
```
$ bazel run  src/pixie_cli:px -- collect-logs
INFO: Analyzed target //src/pixie_cli:px (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
Target //src/pixie_cli:px up-to-date:
  bazel-bin/src/pixie_cli/px_/px
INFO: Elapsed time: 4.240s, Critical Path: 3.89s
INFO: 3 processes: 1 internal, 2 linux-sandbox.
INFO: Build completed successfully, 3 total actions
INFO: Running command line: bazel-bin/src/pixie_cli/px_/px collect-logs
Pixie CLI
*******************************
* ENV VARS
*        PX_CLOUD_ADDR=testing.getcosmic.ai:443
*******************************
Logs written to pixie_logs_20241218164734.zip

$ cat px_agent_diagnostics.txt
{"_tableName_":"output","headers_installed_percent":1}
```
- [x] `px collect-logs` identifies when kernel headers are missing when
`px/agent_status_diagnostics` present
```
$ Logs written to pixie_logs_20241223165214.zip
$ bazel run -c opt  --stamp src/pixie_cli:px -- --bundle https://csmc-io.github.io/pxl-scripts/pxl_scripts/bundle.json collect-logs
[ ... ]
WARN[0012] healthcheck script detected the following warnings:  error="Detected missing kernel headers on your cluster's nodes. This may cause issues with the Pixie agent. Please install kernel headers on all nodes."

$ cat px_agent_diagnostics.txt
{"_tableName_":"output","headers_installed_percent":0.5}
```

- [x] Artificially forcing context deadline (timeout) results in an
error
```
$ git diff
diff --git a/src/pixie_cli/pkg/vizier/script.go b/src/pixie_cli/pkg/vizier/script.go
index 7d3b7e0..c957b8943 100644
--- a/src/pixie_cli/pkg/vizier/script.go
+++ b/src/pixie_cli/pkg/vizier/script.go
@@ -317,7 +317,7 @@ func RunSimpleHealthCheckScript(br *script.BundleManager, cloudAddr string, clus
                execScript = br.MustGetScript(script.AgentStatusScript)
        }

-       ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
+       ctx, cancel := context.WithTimeout(context.Background(), 1*time.Second)

$ bazel run  src/pixie_cli:px -- collect-logs

WARN[0012]src/pixie_cli/pkg/vizier/logs.go:135 px.dev/pixie/src/pixie_cli/pkg/vizier.(*LogCollector).CollectPixieLogs() failed to run health check script             error="context deadline exceeded"
Logs written to pixie_logs_20241218165033.zip
```
- [x] `px collect-logs` prompts auth flow when credentials don't match
current cloud
```
$ PX_CLOUD_ADDR=new-cloud bazel run  src/pixie_cli:px -- collect-logs
*******************************
* ENV VARS
*        PX_CLOUD_ADDR=new-cloud
*******************************
Failed to authenticate. Please retry `px auth login`.
```

- [x] `px deploy` on pre v0.14.14 (older) vizier with existing bundle
warns that kernel headers should be installed
```
# Additional flags provided to speed up vizier bootstrapping
$ bazel run -c opt --stamp src/pixie_cli:px -- deploy --pem_flags='PL_STIRLING_SOURCES=kNone' --deploy_key='<deploy key>' --deploy_olm=false --olm_namespace=olm --bundle=https://csmc-io.github.io/pxl-scripts/pxl_scripts/bundle.json
```

- [x] `px deploy` on pre v0.14.14 (older) vizier with latest bundle
warns that kernel headers should be installed
```
# Additional flags provided to speed up vizier bootstrapping
$ bazel run -c opt --stamp src/pixie_cli:px -- deploy --pem_flags='PL_STIRLING_SOURCES=kNone' --deploy_key='<deploy key>' --deploy_olm=false --olm_namespace=olm --bundle=https://csmc-io.github.io/pxl-scripts/pxl_scripts/bundle.json

[ ... ]
Waiting for Pixie to pass healthcheck
 ✔    Wait for PEMs/Kelvin
 ✔    Wait for PEMs/Kelvin
 ✕    Wait for healthcheck  ERR: Unable to detect if the cluster's nodes have the distro kernel headers installed (vizier too old to perform this check). Please ensure that the kernel headers are installed on all nodes.
Pixie healthcheck detected the following warnings: error=Unable to detect if the cluster's nodes have the distro kernel headers installed (vizier too old to perform this check). Please ensure that the kernel headers are installed on all nodes.

[ ...]
```

- [x] `px deploy` on v0.14.14 vizier with latest bundle warns
appropriate when kernel headers are missing
```
$ bazel run -c opt --stamp src/pixie_cli:px -- deploy --pem_flags='PL_STIRLING_SOURCES=kNone' --deploy_key=<deploy key> --bundle=https://csmc-io.github.io/pxl-scripts/pxl_scripts/bundle.json -v 0.14.14-pre-r1.0

[ ... ]
Waiting for Pixie to pass healthcheck
 ✔    Wait for PEMs/Kelvin
 ✕    Wait for healthcheck  ERR: Detected missing kernel headers on your cluster's nodes. This may cause issues with the Pixie agent. Please install kernel headers on all nodes.
Pixie healthcheck detected the following warnings: error=Detected missing kernel headers on your cluster's nodes. This may cause issues with the Pixie agent. Please install kernel headers on all nodes.
```

</details>

Changelog Message: Enhanced the `px` cli's `deploy` and `collect-logs`
commands to surface when kernel headers aren't installed. This is a
common source of bugs that can only be addressed by installing your
distro's kernel headers.

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
(cherry picked from commit 3c9c4bd)

aimichelle pushed a commit that referenced this pull request

Jan 17, 2025
… missing kernel headers (#2091)

Summary: Re-introduce use of `px/agent_status_diagnostics` in px cli to
detect missing kernel headers

The original version of this was reverted since it resulted in hung `px
deploy` and `px collect-logs` commands on larger clusters. This PR
reintroduces the change with the fixes necessary to prevent the previous
issue and is best reviewed commit by commit as outlined below:

Commit 1: Cherry-pick of #2065
Commit 2: Fix for goroutine deadlock
Commit 3: Add bundle flag to the `px collect-logs` command
Commit 4: Introduce `PX_LOG_FILE` env var for redirecting `px` log
output to a file -- useful for debugging the cli since its terminal
spinners complicate logging to stdout

Commits 1, 3 and 4 should be self explanatory. As for Commit 2, the
goroutine deadlock occurred from the `streamCh` channel consumer.

The previous version read a single value from the `streamCh` channel,
parsed the result and
[terminated](2ec63c8#diff-4da8f48b4c664d330cff34e70f907d6015289797c832587b0b14004875ef0831R363)
its goroutine. Thus future sends to the `streamCh` channel could block
and prevent the pipe receiving the pxl script results to be fully
consumed. Since the stream adapter writes to the pipe, it couldn't flush
all of its results and the deadlock occurred.

The original testing was performed on clusters with 1 and 2 nodes -- max
of 2 PEMs and 2 results from `px/agent_status`. This deadlock issue
didn't surface in those situations because `streamCh` was a buffered
channel with capacity of 1 and the consumer would read a single record
before terminating. This meant that the pipe reader would hit EOF before
it would initiate a channel send that would deadlock as outlined below:

2 Node cluster situation:
1. `px` cli executes `px/agent_status` as `px/agent_status_diagnostics`
is not in the canonical bundle yet
2. streamCh producer sends 1st PEMs result -- streamCh at capacity
3. streamCh consumer reads the value and exits -- streamCh ready to
accept 1 value
4. streamCh producer sends 2nd and final PEM result -- streamCh at
capacity and future sends would block!
5. Program exits since pxl script is complete

Relevant Issues: #2051

Type of change: /kind feature

Test Plan: Verified that the deadlock no longer occurs on clusters with
3-6 nodes
- [x] Used the
[following](https://github.com/user-attachments/files/18457105/deadlocked-goroutines.txt)
pprof goroutine stack dump to understand the deadlock described above --
see blocked goroutine on `streamCh` channel send on `script.go:337`
- [x] Re-tested all of the scenarios from #2065

Changelog Message: Re-introduce enhanced diagnostics for `px deploy` and
`px collect-logs` commands used to detect common sources of environment
incompatibilities

---------

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>

ddelnano added a commit to ddelnano/pixie that referenced this pull request

Aug 6, 2025
Summary: Add px/agent_status_diagnostics pxl script

This PR adds a new pxl script that helps identify if linux headers are
missing on any PEMs. This can be extended in the future, but the first
use cause will be to execute the script during `px deploy` and `px
collect-logs`.

Relevant Issues: pixie-io#2051

Type of change: /kind feature

Test Plan: Ran the script in the UI and tested it as part of the
validation for pixie-io#2065

![Screen Shot 2024-12-18 at 10 51 15
AM](https://github.com/user-attachments/assets/bd4ab6de-ad92-4af3-8714-b044a7df7d65)

Changelog Message: Add `px/agent_status_diagnostics` pxl script for
checking common issues

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
GitOrigin-RevId: a1a1d0e

ddelnano added a commit to ddelnano/pixie that referenced this pull request

Aug 6, 2025
…ing kernel headers (pixie-io#2065)

Summary: Use `px/agent_status_diagnostics` script within px cli to
detect missing kernel headers

This PR leverages the script added in pixie-io#2064 to detect missing kernel
headers during cli deploys and `px collect-logs` commands. This solves
2/3 of the use cases I was hoping to identify for pixie-io#2051 (the last being
helm installs).

A recent example of this problem is
pixie-io#1986, where a Go TLS tracing
bug went undiagnosed for months (August to December). Amazon Linux
2023's headers are different enough that it breaks Go TLS tracing when
pixie's pre-packaged headers are used. The tooling in this PR would have
provided a few opportunities for this to be caught.

Relevant Issues: pixie-io#2051

Type of change: /kind feature

Test Plan: Verified the following scenarios
<details><summary>Test cases</summary>

- [x] `px collect-logs` works against a cloud that doesn't have a
`px/agent_status_diagnostics` script
```
$ bazel run -c opt  --stamp src/pixie_cli:px -- collect-logs

WARN[0006] healthcheck script detected the following warnings:  error="Unable to detect if the cluster's nodes have the distro kernel headers installed (vizier too old to perform this check). Please ensure that the kernel headers are installed on all nodes."
Logs written to pixie_logs_20241223165214.zip

# zip file contains px/agent_status output
$ cat px_agent_diagnostics.txt
{"_tableName_":"output","agent_id":"07fb4d26-3b53-4ba7-9bb7-f2cb10a1e63d","asid":79,"hostname":"gke-dev-ddelnano1-default-pool-b099382d-30mu","ip_address":"","agent_state":"AGENT_STATE_HEALTHY","create_time":"2024-12-18T12:43:44.41952403Z","last_heartbeat_ns":4303060450,"kernel_headers_installed":true}
```
- [x] `px collect-logs` works against a cloud that does have a
`px/agent_status_diagnostics` script
```
$ bazel run  src/pixie_cli:px -- collect-logs
INFO: Analyzed target //src/pixie_cli:px (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
Target //src/pixie_cli:px up-to-date:
  bazel-bin/src/pixie_cli/px_/px
INFO: Elapsed time: 4.240s, Critical Path: 3.89s
INFO: 3 processes: 1 internal, 2 linux-sandbox.
INFO: Build completed successfully, 3 total actions
INFO: Running command line: bazel-bin/src/pixie_cli/px_/px collect-logs
Pixie CLI
*******************************
* ENV VARS
*        PX_CLOUD_ADDR=testing.getcosmic.ai:443
*******************************
Logs written to pixie_logs_20241218164734.zip

$ cat px_agent_diagnostics.txt
{"_tableName_":"output","headers_installed_percent":1}
```
- [x] `px collect-logs` identifies when kernel headers are missing when
`px/agent_status_diagnostics` present
```
$ Logs written to pixie_logs_20241223165214.zip
$ bazel run -c opt  --stamp src/pixie_cli:px -- --bundle https://csmc-io.github.io/pxl-scripts/pxl_scripts/bundle.json collect-logs
[ ... ]
WARN[0012] healthcheck script detected the following warnings:  error="Detected missing kernel headers on your cluster's nodes. This may cause issues with the Pixie agent. Please install kernel headers on all nodes."

$ cat px_agent_diagnostics.txt
{"_tableName_":"output","headers_installed_percent":0.5}
```

- [x] Artificially forcing context deadline (timeout) results in an
error
```
$ git diff
diff --git a/src/pixie_cli/pkg/vizier/script.go b/src/pixie_cli/pkg/vizier/script.go
index 7d3b7e0..c957b8943 100644
--- a/src/pixie_cli/pkg/vizier/script.go
+++ b/src/pixie_cli/pkg/vizier/script.go
@@ -317,7 +317,7 @@ func RunSimpleHealthCheckScript(br *script.BundleManager, cloudAddr string, clus
                execScript = br.MustGetScript(script.AgentStatusScript)
        }

-       ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
+       ctx, cancel := context.WithTimeout(context.Background(), 1*time.Second)

$ bazel run  src/pixie_cli:px -- collect-logs

WARN[0012]src/pixie_cli/pkg/vizier/logs.go:135 px.dev/pixie/src/pixie_cli/pkg/vizier.(*LogCollector).CollectPixieLogs() failed to run health check script             error="context deadline exceeded"
Logs written to pixie_logs_20241218165033.zip
```
- [x] `px collect-logs` prompts auth flow when credentials don't match
current cloud
```
$ PX_CLOUD_ADDR=new-cloud bazel run  src/pixie_cli:px -- collect-logs
*******************************
* ENV VARS
*        PX_CLOUD_ADDR=new-cloud
*******************************
Failed to authenticate. Please retry `px auth login`.
```

- [x] `px deploy` on pre v0.14.14 (older) vizier with existing bundle
warns that kernel headers should be installed
```
# Additional flags provided to speed up vizier bootstrapping
$ bazel run -c opt --stamp src/pixie_cli:px -- deploy --pem_flags='PL_STIRLING_SOURCES=kNone' --deploy_key='<deploy key>' --deploy_olm=false --olm_namespace=olm --bundle=https://csmc-io.github.io/pxl-scripts/pxl_scripts/bundle.json
```

- [x] `px deploy` on pre v0.14.14 (older) vizier with latest bundle
warns that kernel headers should be installed
```
# Additional flags provided to speed up vizier bootstrapping
$ bazel run -c opt --stamp src/pixie_cli:px -- deploy --pem_flags='PL_STIRLING_SOURCES=kNone' --deploy_key='<deploy key>' --deploy_olm=false --olm_namespace=olm --bundle=https://csmc-io.github.io/pxl-scripts/pxl_scripts/bundle.json

[ ... ]
Waiting for Pixie to pass healthcheck
 ✔    Wait for PEMs/Kelvin
 ✔    Wait for PEMs/Kelvin
 ✕    Wait for healthcheck  ERR: Unable to detect if the cluster's nodes have the distro kernel headers installed (vizier too old to perform this check). Please ensure that the kernel headers are installed on all nodes.
Pixie healthcheck detected the following warnings: error=Unable to detect if the cluster's nodes have the distro kernel headers installed (vizier too old to perform this check). Please ensure that the kernel headers are installed on all nodes.

[ ...]
```

- [x] `px deploy` on v0.14.14 vizier with latest bundle warns
appropriate when kernel headers are missing
```
$ bazel run -c opt --stamp src/pixie_cli:px -- deploy --pem_flags='PL_STIRLING_SOURCES=kNone' --deploy_key=<deploy key> --bundle=https://csmc-io.github.io/pxl-scripts/pxl_scripts/bundle.json -v 0.14.14-pre-r1.0

[ ... ]
Waiting for Pixie to pass healthcheck
 ✔    Wait for PEMs/Kelvin
 ✕    Wait for healthcheck  ERR: Detected missing kernel headers on your cluster's nodes. This may cause issues with the Pixie agent. Please install kernel headers on all nodes.
Pixie healthcheck detected the following warnings: error=Detected missing kernel headers on your cluster's nodes. This may cause issues with the Pixie agent. Please install kernel headers on all nodes.
```

</details>

Changelog Message: Enhanced the `px` cli's `deploy` and `collect-logs`
commands to surface when kernel headers aren't installed. This is a
common source of bugs that can only be addressed by installing your
distro's kernel headers.

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
GitOrigin-RevId: 3c9c4bd

ddelnano added a commit to ddelnano/pixie that referenced this pull request

Aug 6, 2025
…ect missing kernel headers (pixie-io#2065)" (pixie-io#2081)

Summary: Revert "Use `px/agent_status_diagnostics` script within px cli
to detect missing kernel headers (pixie-io#2065)"

This reverts commit 3c9c4bd. While
testing the latest cloud release, I noticed that the functionality added
in pixie-io#2065 can cause `px deploy` to hang indefinitely at the "Wait for
healthcheck" step. The deploy will finish successfully, but it's a poor
user experience.

There seems to be a goroutine blocking issue that is dependent on
cluster size. On 1 and 2 node clusters that I tested pixie-io#2065 on, the issue
doesn't surface. However, the issue reproduces reliably on the larger
clusters that have pixie deployed to. Let's revert and release a new cli
version once this is tracked down.

Relevant Issues: pixie-io#2051

Type of change: /kind bugfix

Test Plan: N/A

Changelog Message: Reverted the recent advanced diagnostics added during
`px deploy` as in some cases it can cause that caused `px deploy` to
hang

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
GitOrigin-RevId: fb9de19

ddelnano added a commit to ddelnano/pixie that referenced this pull request

Aug 6, 2025
… missing kernel headers (pixie-io#2091)

Summary: Re-introduce use of `px/agent_status_diagnostics` in px cli to
detect missing kernel headers

The original version of this was reverted since it resulted in hung `px
deploy` and `px collect-logs` commands on larger clusters. This PR
reintroduces the change with the fixes necessary to prevent the previous
issue and is best reviewed commit by commit as outlined below:

Commit 1: Cherry-pick of pixie-io#2065
Commit 2: Fix for goroutine deadlock
Commit 3: Add bundle flag to the `px collect-logs` command
Commit 4: Introduce `PX_LOG_FILE` env var for redirecting `px` log
output to a file -- useful for debugging the cli since its terminal
spinners complicate logging to stdout

Commits 1, 3 and 4 should be self explanatory. As for Commit 2, the
goroutine deadlock occurred from the `streamCh` channel consumer.

The previous version read a single value from the `streamCh` channel,
parsed the result and
[terminated](pixie-io@2ec63c8#diff-4da8f48b4c664d330cff34e70f907d6015289797c832587b0b14004875ef0831R363)
its goroutine. Thus future sends to the `streamCh` channel could block
and prevent the pipe receiving the pxl script results to be fully
consumed. Since the stream adapter writes to the pipe, it couldn't flush
all of its results and the deadlock occurred.

The original testing was performed on clusters with 1 and 2 nodes -- max
of 2 PEMs and 2 results from `px/agent_status`. This deadlock issue
didn't surface in those situations because `streamCh` was a buffered
channel with capacity of 1 and the consumer would read a single record
before terminating. This meant that the pipe reader would hit EOF before
it would initiate a channel send that would deadlock as outlined below:

2 Node cluster situation:
1. `px` cli executes `px/agent_status` as `px/agent_status_diagnostics`
is not in the canonical bundle yet
2. streamCh producer sends 1st PEMs result -- streamCh at capacity
3. streamCh consumer reads the value and exits -- streamCh ready to
accept 1 value
4. streamCh producer sends 2nd and final PEM result -- streamCh at
capacity and future sends would block!
5. Program exits since pxl script is complete

Relevant Issues: pixie-io#2051

Type of change: /kind feature

Test Plan: Verified that the deadlock no longer occurs on clusters with
3-6 nodes
- [x] Used the
[following](https://github.com/user-attachments/files/18457105/deadlocked-goroutines.txt)
pprof goroutine stack dump to understand the deadlock described above --
see blocked goroutine on `streamCh` channel send on `script.go:337`
- [x] Re-tested all of the scenarios from pixie-io#2065

Changelog Message: Re-introduce enhanced diagnostics for `px deploy` and
`px collect-logs` commands used to detect common sources of environment
incompatibilities

---------

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
GitOrigin-RevId: 042e356

ddelnano added a commit to k8sstormcenter/pixie that referenced this pull request

Feb 25, 2026
Summary: Add px/agent_status_diagnostics pxl script

This PR adds a new pxl script that helps identify if linux headers are
missing on any PEMs. This can be extended in the future, but the first
use cause will be to execute the script during `px deploy` and `px
collect-logs`.

Relevant Issues: pixie-io#2051

Type of change: /kind feature

Test Plan: Ran the script in the UI and tested it as part of the
validation for pixie-io#2065

![Screen Shot 2024-12-18 at 10 51 15
AM](https://github.com/user-attachments/assets/bd4ab6de-ad92-4af3-8714-b044a7df7d65)


Changelog Message: Add `px/agent_status_diagnostics` pxl script for
checking common issues

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>

ddelnano added a commit to k8sstormcenter/pixie that referenced this pull request

Feb 25, 2026
…ing kernel headers (pixie-io#2065)

Summary: Use `px/agent_status_diagnostics` script within px cli to
detect missing kernel headers

This PR leverages the script added in pixie-io#2064 to detect missing kernel
headers during cli deploys and `px collect-logs` commands. This solves
2/3 of the use cases I was hoping to identify for pixie-io#2051 (the last being
helm installs).

A recent example of this problem is
pixie-io#1986, where a Go TLS tracing
bug went undiagnosed for months (August to December). Amazon Linux
2023's headers are different enough that it breaks Go TLS tracing when
pixie's pre-packaged headers are used. The tooling in this PR would have
provided a few opportunities for this to be caught.

Relevant Issues: pixie-io#2051

Type of change: /kind feature

Test Plan: Verified the following scenarios
<details><summary>Test cases</summary>

- [x] `px collect-logs` works against a cloud that doesn't have a
`px/agent_status_diagnostics` script
```
$ bazel run -c opt  --stamp src/pixie_cli:px -- collect-logs

WARN[0006] healthcheck script detected the following warnings:  error="Unable to detect if the cluster's nodes have the distro kernel headers installed (vizier too old to perform this check). Please ensure that the kernel headers are installed on all nodes."
Logs written to pixie_logs_20241223165214.zip

# zip file contains px/agent_status output
$ cat px_agent_diagnostics.txt
{"_tableName_":"output","agent_id":"07fb4d26-3b53-4ba7-9bb7-f2cb10a1e63d","asid":79,"hostname":"gke-dev-ddelnano1-default-pool-b099382d-30mu","ip_address":"","agent_state":"AGENT_STATE_HEALTHY","create_time":"2024-12-18T12:43:44.41952403Z","last_heartbeat_ns":4303060450,"kernel_headers_installed":true}
```
- [x] `px collect-logs` works against a cloud that does have a
`px/agent_status_diagnostics` script
```
$ bazel run  src/pixie_cli:px -- collect-logs
INFO: Analyzed target //src/pixie_cli:px (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
Target //src/pixie_cli:px up-to-date:
  bazel-bin/src/pixie_cli/px_/px
INFO: Elapsed time: 4.240s, Critical Path: 3.89s
INFO: 3 processes: 1 internal, 2 linux-sandbox.
INFO: Build completed successfully, 3 total actions
INFO: Running command line: bazel-bin/src/pixie_cli/px_/px collect-logs
Pixie CLI
*******************************
* ENV VARS
*        PX_CLOUD_ADDR=testing.getcosmic.ai:443
*******************************
Logs written to pixie_logs_20241218164734.zip

$ cat px_agent_diagnostics.txt
{"_tableName_":"output","headers_installed_percent":1}
```
- [x] `px collect-logs` identifies when kernel headers are missing when
`px/agent_status_diagnostics` present
```
$ Logs written to pixie_logs_20241223165214.zip
$ bazel run -c opt  --stamp src/pixie_cli:px -- --bundle https://csmc-io.github.io/pxl-scripts/pxl_scripts/bundle.json collect-logs
[ ... ]
WARN[0012] healthcheck script detected the following warnings:  error="Detected missing kernel headers on your cluster's nodes. This may cause issues with the Pixie agent. Please install kernel headers on all nodes."

$ cat px_agent_diagnostics.txt
{"_tableName_":"output","headers_installed_percent":0.5}
```

- [x] Artificially forcing context deadline (timeout) results in an
error
```
$ git diff
diff --git a/src/pixie_cli/pkg/vizier/script.go b/src/pixie_cli/pkg/vizier/script.go
index 7d3b7e0..c957b8943 100644
--- a/src/pixie_cli/pkg/vizier/script.go
+++ b/src/pixie_cli/pkg/vizier/script.go
@@ -317,7 +317,7 @@ func RunSimpleHealthCheckScript(br *script.BundleManager, cloudAddr string, clus
                execScript = br.MustGetScript(script.AgentStatusScript)
        }

-       ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
+       ctx, cancel := context.WithTimeout(context.Background(), 1*time.Second)

$ bazel run  src/pixie_cli:px -- collect-logs

WARN[0012]src/pixie_cli/pkg/vizier/logs.go:135 px.dev/pixie/src/pixie_cli/pkg/vizier.(*LogCollector).CollectPixieLogs() failed to run health check script             error="context deadline exceeded"
Logs written to pixie_logs_20241218165033.zip
```
- [x] `px collect-logs` prompts auth flow when credentials don't match
current cloud
```
$ PX_CLOUD_ADDR=new-cloud bazel run  src/pixie_cli:px -- collect-logs
*******************************
* ENV VARS
*        PX_CLOUD_ADDR=new-cloud
*******************************
Failed to authenticate. Please retry `px auth login`.
```

- [x] `px deploy` on pre v0.14.14 (older) vizier with existing bundle
warns that kernel headers should be installed
```
# Additional flags provided to speed up vizier bootstrapping
$ bazel run -c opt --stamp src/pixie_cli:px -- deploy --pem_flags='PL_STIRLING_SOURCES=kNone' --deploy_key='<deploy key>' --deploy_olm=false --olm_namespace=olm --bundle=https://csmc-io.github.io/pxl-scripts/pxl_scripts/bundle.json
```

- [x] `px deploy` on pre v0.14.14 (older) vizier with latest bundle
warns that kernel headers should be installed
```
# Additional flags provided to speed up vizier bootstrapping
$ bazel run -c opt --stamp src/pixie_cli:px -- deploy --pem_flags='PL_STIRLING_SOURCES=kNone' --deploy_key='<deploy key>' --deploy_olm=false --olm_namespace=olm --bundle=https://csmc-io.github.io/pxl-scripts/pxl_scripts/bundle.json

[ ... ]
Waiting for Pixie to pass healthcheck
 ✔    Wait for PEMs/Kelvin
 ✔    Wait for PEMs/Kelvin
 ✕    Wait for healthcheck  ERR: Unable to detect if the cluster's nodes have the distro kernel headers installed (vizier too old to perform this check). Please ensure that the kernel headers are installed on all nodes.
Pixie healthcheck detected the following warnings: error=Unable to detect if the cluster's nodes have the distro kernel headers installed (vizier too old to perform this check). Please ensure that the kernel headers are installed on all nodes.

[ ...]
```

- [x] `px deploy` on v0.14.14 vizier with latest bundle warns
appropriate when kernel headers are missing
```
$ bazel run -c opt --stamp src/pixie_cli:px -- deploy --pem_flags='PL_STIRLING_SOURCES=kNone' --deploy_key=<deploy key> --bundle=https://csmc-io.github.io/pxl-scripts/pxl_scripts/bundle.json -v 0.14.14-pre-r1.0

[ ... ]
Waiting for Pixie to pass healthcheck
 ✔    Wait for PEMs/Kelvin
 ✕    Wait for healthcheck  ERR: Detected missing kernel headers on your cluster's nodes. This may cause issues with the Pixie agent. Please install kernel headers on all nodes.
Pixie healthcheck detected the following warnings: error=Detected missing kernel headers on your cluster's nodes. This may cause issues with the Pixie agent. Please install kernel headers on all nodes.
```

</details>

Changelog Message: Enhanced the `px` cli's `deploy` and `collect-logs`
commands to surface when kernel headers aren't installed. This is a
common source of bugs that can only be addressed by installing your
distro's kernel headers.

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>

ddelnano added a commit to k8sstormcenter/pixie that referenced this pull request

Feb 25, 2026
…ect missing kernel headers (pixie-io#2065)" (pixie-io#2081)

Summary: Revert "Use `px/agent_status_diagnostics` script within px cli
to detect missing kernel headers (pixie-io#2065)"

This reverts commit 3c9c4bd. While
testing the latest cloud release, I noticed that the functionality added
in pixie-io#2065 can cause `px deploy` to hang indefinitely at the "Wait for
healthcheck" step. The deploy will finish successfully, but it's a poor
user experience.

There seems to be a goroutine blocking issue that is dependent on
cluster size. On 1 and 2 node clusters that I tested pixie-io#2065 on, the issue
doesn't surface. However, the issue reproduces reliably on the larger
clusters that have pixie deployed to. Let's revert and release a new cli
version once this is tracked down.

Relevant Issues: pixie-io#2051

Type of change: /kind bugfix

Test Plan: N/A

Changelog Message: Reverted the recent advanced diagnostics added during
`px deploy` as in some cases it can cause that caused `px deploy` to
hang

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>

ddelnano added a commit to k8sstormcenter/pixie that referenced this pull request

Feb 25, 2026
… missing kernel headers (pixie-io#2091)

Summary: Re-introduce use of `px/agent_status_diagnostics` in px cli to
detect missing kernel headers

The original version of this was reverted since it resulted in hung `px
deploy` and `px collect-logs` commands on larger clusters. This PR
reintroduces the change with the fixes necessary to prevent the previous
issue and is best reviewed commit by commit as outlined below:

Commit 1: Cherry-pick of pixie-io#2065
Commit 2: Fix for goroutine deadlock
Commit 3: Add bundle flag to the `px collect-logs` command
Commit 4: Introduce `PX_LOG_FILE` env var for redirecting `px` log
output to a file -- useful for debugging the cli since its terminal
spinners complicate logging to stdout

Commits 1, 3 and 4 should be self explanatory. As for Commit 2, the
goroutine deadlock occurred from the `streamCh` channel consumer.

The previous version read a single value from the `streamCh` channel,
parsed the result and
[terminated](pixie-io@2ec63c8#diff-4da8f48b4c664d330cff34e70f907d6015289797c832587b0b14004875ef0831R363)
its goroutine. Thus future sends to the `streamCh` channel could block
and prevent the pipe receiving the pxl script results to be fully
consumed. Since the stream adapter writes to the pipe, it couldn't flush
all of its results and the deadlock occurred.

The original testing was performed on clusters with 1 and 2 nodes -- max
of 2 PEMs and 2 results from `px/agent_status`. This deadlock issue
didn't surface in those situations because `streamCh` was a buffered
channel with capacity of 1 and the consumer would read a single record
before terminating. This meant that the pipe reader would hit EOF before
it would initiate a channel send that would deadlock as outlined below:

2 Node cluster situation:
1. `px` cli executes `px/agent_status` as `px/agent_status_diagnostics`
is not in the canonical bundle yet
2. streamCh producer sends 1st PEMs result -- streamCh at capacity
3. streamCh consumer reads the value and exits -- streamCh ready to
accept 1 value
4. streamCh producer sends 2nd and final PEM result -- streamCh at
capacity and future sends would block!
5. Program exits since pxl script is complete

Relevant Issues: pixie-io#2051

Type of change: /kind feature

Test Plan: Verified that the deadlock no longer occurs on clusters with
3-6 nodes
- [x] Used the
[following](https://github.com/user-attachments/files/18457105/deadlocked-goroutines.txt)
pprof goroutine stack dump to understand the deadlock described above --
see blocked goroutine on `streamCh` channel send on `script.go:337`
- [x] Re-tested all of the scenarios from pixie-io#2065

Changelog Message: Re-introduce enhanced diagnostics for `px deploy` and
`px collect-logs` commands used to detect common sources of environment
incompatibilities

---------

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>

ddelnano added a commit to k8sstormcenter/pixie that referenced this pull request

Feb 25, 2026
Summary: Add px/agent_status_diagnostics pxl script

This PR adds a new pxl script that helps identify if linux headers are
missing on any PEMs. This can be extended in the future, but the first
use cause will be to execute the script during `px deploy` and `px
collect-logs`.

Relevant Issues: pixie-io#2051

Type of change: /kind feature

Test Plan: Ran the script in the UI and tested it as part of the
validation for pixie-io#2065

![Screen Shot 2024-12-18 at 10 51 15
AM](https://github.com/user-attachments/assets/bd4ab6de-ad92-4af3-8714-b044a7df7d65)


Changelog Message: Add `px/agent_status_diagnostics` pxl script for
checking common issues

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>

ddelnano added a commit to k8sstormcenter/pixie that referenced this pull request

Feb 25, 2026
…ing kernel headers (pixie-io#2065)

Summary: Use `px/agent_status_diagnostics` script within px cli to
detect missing kernel headers

This PR leverages the script added in pixie-io#2064 to detect missing kernel
headers during cli deploys and `px collect-logs` commands. This solves
2/3 of the use cases I was hoping to identify for pixie-io#2051 (the last being
helm installs).

A recent example of this problem is
pixie-io#1986, where a Go TLS tracing
bug went undiagnosed for months (August to December). Amazon Linux
2023's headers are different enough that it breaks Go TLS tracing when
pixie's pre-packaged headers are used. The tooling in this PR would have
provided a few opportunities for this to be caught.

Relevant Issues: pixie-io#2051

Type of change: /kind feature

Test Plan: Verified the following scenarios
<details><summary>Test cases</summary>

- [x] `px collect-logs` works against a cloud that doesn't have a
`px/agent_status_diagnostics` script
```
$ bazel run -c opt  --stamp src/pixie_cli:px -- collect-logs

WARN[0006] healthcheck script detected the following warnings:  error="Unable to detect if the cluster's nodes have the distro kernel headers installed (vizier too old to perform this check). Please ensure that the kernel headers are installed on all nodes."
Logs written to pixie_logs_20241223165214.zip

# zip file contains px/agent_status output
$ cat px_agent_diagnostics.txt
{"_tableName_":"output","agent_id":"07fb4d26-3b53-4ba7-9bb7-f2cb10a1e63d","asid":79,"hostname":"gke-dev-ddelnano1-default-pool-b099382d-30mu","ip_address":"","agent_state":"AGENT_STATE_HEALTHY","create_time":"2024-12-18T12:43:44.41952403Z","last_heartbeat_ns":4303060450,"kernel_headers_installed":true}
```
- [x] `px collect-logs` works against a cloud that does have a
`px/agent_status_diagnostics` script
```
$ bazel run  src/pixie_cli:px -- collect-logs
INFO: Analyzed target //src/pixie_cli:px (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
Target //src/pixie_cli:px up-to-date:
  bazel-bin/src/pixie_cli/px_/px
INFO: Elapsed time: 4.240s, Critical Path: 3.89s
INFO: 3 processes: 1 internal, 2 linux-sandbox.
INFO: Build completed successfully, 3 total actions
INFO: Running command line: bazel-bin/src/pixie_cli/px_/px collect-logs
Pixie CLI
*******************************
* ENV VARS
*        PX_CLOUD_ADDR=testing.getcosmic.ai:443
*******************************
Logs written to pixie_logs_20241218164734.zip

$ cat px_agent_diagnostics.txt
{"_tableName_":"output","headers_installed_percent":1}
```
- [x] `px collect-logs` identifies when kernel headers are missing when
`px/agent_status_diagnostics` present
```
$ Logs written to pixie_logs_20241223165214.zip
$ bazel run -c opt  --stamp src/pixie_cli:px -- --bundle https://csmc-io.github.io/pxl-scripts/pxl_scripts/bundle.json collect-logs
[ ... ]
WARN[0012] healthcheck script detected the following warnings:  error="Detected missing kernel headers on your cluster's nodes. This may cause issues with the Pixie agent. Please install kernel headers on all nodes."

$ cat px_agent_diagnostics.txt
{"_tableName_":"output","headers_installed_percent":0.5}
```

- [x] Artificially forcing context deadline (timeout) results in an
error
```
$ git diff
diff --git a/src/pixie_cli/pkg/vizier/script.go b/src/pixie_cli/pkg/vizier/script.go
index 7d3b7e0..c957b8943 100644
--- a/src/pixie_cli/pkg/vizier/script.go
+++ b/src/pixie_cli/pkg/vizier/script.go
@@ -317,7 +317,7 @@ func RunSimpleHealthCheckScript(br *script.BundleManager, cloudAddr string, clus
                execScript = br.MustGetScript(script.AgentStatusScript)
        }

-       ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
+       ctx, cancel := context.WithTimeout(context.Background(), 1*time.Second)

$ bazel run  src/pixie_cli:px -- collect-logs

WARN[0012]src/pixie_cli/pkg/vizier/logs.go:135 px.dev/pixie/src/pixie_cli/pkg/vizier.(*LogCollector).CollectPixieLogs() failed to run health check script             error="context deadline exceeded"
Logs written to pixie_logs_20241218165033.zip
```
- [x] `px collect-logs` prompts auth flow when credentials don't match
current cloud
```
$ PX_CLOUD_ADDR=new-cloud bazel run  src/pixie_cli:px -- collect-logs
*******************************
* ENV VARS
*        PX_CLOUD_ADDR=new-cloud
*******************************
Failed to authenticate. Please retry `px auth login`.
```

- [x] `px deploy` on pre v0.14.14 (older) vizier with existing bundle
warns that kernel headers should be installed
```
# Additional flags provided to speed up vizier bootstrapping
$ bazel run -c opt --stamp src/pixie_cli:px -- deploy --pem_flags='PL_STIRLING_SOURCES=kNone' --deploy_key='<deploy key>' --deploy_olm=false --olm_namespace=olm --bundle=https://csmc-io.github.io/pxl-scripts/pxl_scripts/bundle.json
```

- [x] `px deploy` on pre v0.14.14 (older) vizier with latest bundle
warns that kernel headers should be installed
```
# Additional flags provided to speed up vizier bootstrapping
$ bazel run -c opt --stamp src/pixie_cli:px -- deploy --pem_flags='PL_STIRLING_SOURCES=kNone' --deploy_key='<deploy key>' --deploy_olm=false --olm_namespace=olm --bundle=https://csmc-io.github.io/pxl-scripts/pxl_scripts/bundle.json

[ ... ]
Waiting for Pixie to pass healthcheck
 ✔    Wait for PEMs/Kelvin
 ✔    Wait for PEMs/Kelvin
 ✕    Wait for healthcheck  ERR: Unable to detect if the cluster's nodes have the distro kernel headers installed (vizier too old to perform this check). Please ensure that the kernel headers are installed on all nodes.
Pixie healthcheck detected the following warnings: error=Unable to detect if the cluster's nodes have the distro kernel headers installed (vizier too old to perform this check). Please ensure that the kernel headers are installed on all nodes.

[ ...]
```

- [x] `px deploy` on v0.14.14 vizier with latest bundle warns
appropriate when kernel headers are missing
```
$ bazel run -c opt --stamp src/pixie_cli:px -- deploy --pem_flags='PL_STIRLING_SOURCES=kNone' --deploy_key=<deploy key> --bundle=https://csmc-io.github.io/pxl-scripts/pxl_scripts/bundle.json -v 0.14.14-pre-r1.0

[ ... ]
Waiting for Pixie to pass healthcheck
 ✔    Wait for PEMs/Kelvin
 ✕    Wait for healthcheck  ERR: Detected missing kernel headers on your cluster's nodes. This may cause issues with the Pixie agent. Please install kernel headers on all nodes.
Pixie healthcheck detected the following warnings: error=Detected missing kernel headers on your cluster's nodes. This may cause issues with the Pixie agent. Please install kernel headers on all nodes.
```

</details>

Changelog Message: Enhanced the `px` cli's `deploy` and `collect-logs`
commands to surface when kernel headers aren't installed. This is a
common source of bugs that can only be addressed by installing your
distro's kernel headers.

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>

ddelnano added a commit to k8sstormcenter/pixie that referenced this pull request

Feb 25, 2026
…ect missing kernel headers (pixie-io#2065)" (pixie-io#2081)

Summary: Revert "Use `px/agent_status_diagnostics` script within px cli
to detect missing kernel headers (pixie-io#2065)"

This reverts commit 3c9c4bd. While
testing the latest cloud release, I noticed that the functionality added
in pixie-io#2065 can cause `px deploy` to hang indefinitely at the "Wait for
healthcheck" step. The deploy will finish successfully, but it's a poor
user experience.

There seems to be a goroutine blocking issue that is dependent on
cluster size. On 1 and 2 node clusters that I tested pixie-io#2065 on, the issue
doesn't surface. However, the issue reproduces reliably on the larger
clusters that have pixie deployed to. Let's revert and release a new cli
version once this is tracked down.

Relevant Issues: pixie-io#2051

Type of change: /kind bugfix

Test Plan: N/A

Changelog Message: Reverted the recent advanced diagnostics added during
`px deploy` as in some cases it can cause that caused `px deploy` to
hang

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>

ddelnano added a commit to k8sstormcenter/pixie that referenced this pull request

Feb 25, 2026
… missing kernel headers (pixie-io#2091)

Summary: Re-introduce use of `px/agent_status_diagnostics` in px cli to
detect missing kernel headers

The original version of this was reverted since it resulted in hung `px
deploy` and `px collect-logs` commands on larger clusters. This PR
reintroduces the change with the fixes necessary to prevent the previous
issue and is best reviewed commit by commit as outlined below:

Commit 1: Cherry-pick of pixie-io#2065
Commit 2: Fix for goroutine deadlock
Commit 3: Add bundle flag to the `px collect-logs` command
Commit 4: Introduce `PX_LOG_FILE` env var for redirecting `px` log
output to a file -- useful for debugging the cli since its terminal
spinners complicate logging to stdout

Commits 1, 3 and 4 should be self explanatory. As for Commit 2, the
goroutine deadlock occurred from the `streamCh` channel consumer.

The previous version read a single value from the `streamCh` channel,
parsed the result and
[terminated](pixie-io@2ec63c8#diff-4da8f48b4c664d330cff34e70f907d6015289797c832587b0b14004875ef0831R363)
its goroutine. Thus future sends to the `streamCh` channel could block
and prevent the pipe receiving the pxl script results to be fully
consumed. Since the stream adapter writes to the pipe, it couldn't flush
all of its results and the deadlock occurred.

The original testing was performed on clusters with 1 and 2 nodes -- max
of 2 PEMs and 2 results from `px/agent_status`. This deadlock issue
didn't surface in those situations because `streamCh` was a buffered
channel with capacity of 1 and the consumer would read a single record
before terminating. This meant that the pipe reader would hit EOF before
it would initiate a channel send that would deadlock as outlined below:

2 Node cluster situation:
1. `px` cli executes `px/agent_status` as `px/agent_status_diagnostics`
is not in the canonical bundle yet
2. streamCh producer sends 1st PEMs result -- streamCh at capacity
3. streamCh consumer reads the value and exits -- streamCh ready to
accept 1 value
4. streamCh producer sends 2nd and final PEM result -- streamCh at
capacity and future sends would block!
5. Program exits since pxl script is complete

Relevant Issues: pixie-io#2051

Type of change: /kind feature

Test Plan: Verified that the deadlock no longer occurs on clusters with
3-6 nodes
- [x] Used the
[following](https://github.com/user-attachments/files/18457105/deadlocked-goroutines.txt)
pprof goroutine stack dump to understand the deadlock described above --
see blocked goroutine on `streamCh` channel send on `script.go:337`
- [x] Re-tested all of the scenarios from pixie-io#2065

Changelog Message: Re-introduce enhanced diagnostics for `px deploy` and
`px collect-logs` commands used to detect common sources of environment
incompatibilities

---------

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>

k8sstormcenter-buildbot pushed a commit to k8sstormcenter/pixie that referenced this pull request

Feb 26, 2026
Summary: Add px/agent_status_diagnostics pxl script

This PR adds a new pxl script that helps identify if linux headers are
missing on any PEMs. This can be extended in the future, but the first
use cause will be to execute the script during `px deploy` and `px
collect-logs`.

Relevant Issues: pixie-io#2051

Type of change: /kind feature

Test Plan: Ran the script in the UI and tested it as part of the
validation for pixie-io#2065

![Screen Shot 2024-12-18 at 10 51 15
AM](https://github.com/user-attachments/assets/bd4ab6de-ad92-4af3-8714-b044a7df7d65)

Changelog Message: Add `px/agent_status_diagnostics` pxl script for
checking common issues

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
GitOrigin-RevId: a1a1d0e

k8sstormcenter-buildbot pushed a commit to k8sstormcenter/pixie that referenced this pull request

Feb 26, 2026
…ing kernel headers (pixie-io#2065)

Summary: Use `px/agent_status_diagnostics` script within px cli to
detect missing kernel headers

This PR leverages the script added in pixie-io#2064 to detect missing kernel
headers during cli deploys and `px collect-logs` commands. This solves
2/3 of the use cases I was hoping to identify for pixie-io#2051 (the last being
helm installs).

A recent example of this problem is
pixie-io#1986, where a Go TLS tracing
bug went undiagnosed for months (August to December). Amazon Linux
2023's headers are different enough that it breaks Go TLS tracing when
pixie's pre-packaged headers are used. The tooling in this PR would have
provided a few opportunities for this to be caught.

Relevant Issues: pixie-io#2051

Type of change: /kind feature

Test Plan: Verified the following scenarios
<details><summary>Test cases</summary>

- [x] `px collect-logs` works against a cloud that doesn't have a
`px/agent_status_diagnostics` script
```
$ bazel run -c opt  --stamp src/pixie_cli:px -- collect-logs

WARN[0006] healthcheck script detected the following warnings:  error="Unable to detect if the cluster's nodes have the distro kernel headers installed (vizier too old to perform this check). Please ensure that the kernel headers are installed on all nodes."
Logs written to pixie_logs_20241223165214.zip

# zip file contains px/agent_status output
$ cat px_agent_diagnostics.txt
{"_tableName_":"output","agent_id":"07fb4d26-3b53-4ba7-9bb7-f2cb10a1e63d","asid":79,"hostname":"gke-dev-ddelnano1-default-pool-b099382d-30mu","ip_address":"","agent_state":"AGENT_STATE_HEALTHY","create_time":"2024-12-18T12:43:44.41952403Z","last_heartbeat_ns":4303060450,"kernel_headers_installed":true}
```
- [x] `px collect-logs` works against a cloud that does have a
`px/agent_status_diagnostics` script
```
$ bazel run  src/pixie_cli:px -- collect-logs
INFO: Analyzed target //src/pixie_cli:px (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
Target //src/pixie_cli:px up-to-date:
  bazel-bin/src/pixie_cli/px_/px
INFO: Elapsed time: 4.240s, Critical Path: 3.89s
INFO: 3 processes: 1 internal, 2 linux-sandbox.
INFO: Build completed successfully, 3 total actions
INFO: Running command line: bazel-bin/src/pixie_cli/px_/px collect-logs
Pixie CLI
*******************************
* ENV VARS
*        PX_CLOUD_ADDR=testing.getcosmic.ai:443
*******************************
Logs written to pixie_logs_20241218164734.zip

$ cat px_agent_diagnostics.txt
{"_tableName_":"output","headers_installed_percent":1}
```
- [x] `px collect-logs` identifies when kernel headers are missing when
`px/agent_status_diagnostics` present
```
$ Logs written to pixie_logs_20241223165214.zip
$ bazel run -c opt  --stamp src/pixie_cli:px -- --bundle https://csmc-io.github.io/pxl-scripts/pxl_scripts/bundle.json collect-logs
[ ... ]
WARN[0012] healthcheck script detected the following warnings:  error="Detected missing kernel headers on your cluster's nodes. This may cause issues with the Pixie agent. Please install kernel headers on all nodes."

$ cat px_agent_diagnostics.txt
{"_tableName_":"output","headers_installed_percent":0.5}
```

- [x] Artificially forcing context deadline (timeout) results in an
error
```
$ git diff
diff --git a/src/pixie_cli/pkg/vizier/script.go b/src/pixie_cli/pkg/vizier/script.go
index 7d3b7e0..c957b8943 100644
--- a/src/pixie_cli/pkg/vizier/script.go
+++ b/src/pixie_cli/pkg/vizier/script.go
@@ -317,7 +317,7 @@ func RunSimpleHealthCheckScript(br *script.BundleManager, cloudAddr string, clus
                execScript = br.MustGetScript(script.AgentStatusScript)
        }

-       ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
+       ctx, cancel := context.WithTimeout(context.Background(), 1*time.Second)

$ bazel run  src/pixie_cli:px -- collect-logs

WARN[0012]src/pixie_cli/pkg/vizier/logs.go:135 px.dev/pixie/src/pixie_cli/pkg/vizier.(*LogCollector).CollectPixieLogs() failed to run health check script             error="context deadline exceeded"
Logs written to pixie_logs_20241218165033.zip
```
- [x] `px collect-logs` prompts auth flow when credentials don't match
current cloud
```
$ PX_CLOUD_ADDR=new-cloud bazel run  src/pixie_cli:px -- collect-logs
*******************************
* ENV VARS
*        PX_CLOUD_ADDR=new-cloud
*******************************
Failed to authenticate. Please retry `px auth login`.
```

- [x] `px deploy` on pre v0.14.14 (older) vizier with existing bundle
warns that kernel headers should be installed
```
# Additional flags provided to speed up vizier bootstrapping
$ bazel run -c opt --stamp src/pixie_cli:px -- deploy --pem_flags='PL_STIRLING_SOURCES=kNone' --deploy_key='<deploy key>' --deploy_olm=false --olm_namespace=olm --bundle=https://csmc-io.github.io/pxl-scripts/pxl_scripts/bundle.json
```

- [x] `px deploy` on pre v0.14.14 (older) vizier with latest bundle
warns that kernel headers should be installed
```
# Additional flags provided to speed up vizier bootstrapping
$ bazel run -c opt --stamp src/pixie_cli:px -- deploy --pem_flags='PL_STIRLING_SOURCES=kNone' --deploy_key='<deploy key>' --deploy_olm=false --olm_namespace=olm --bundle=https://csmc-io.github.io/pxl-scripts/pxl_scripts/bundle.json

[ ... ]
Waiting for Pixie to pass healthcheck
 ✔    Wait for PEMs/Kelvin
 ✔    Wait for PEMs/Kelvin
 ✕    Wait for healthcheck  ERR: Unable to detect if the cluster's nodes have the distro kernel headers installed (vizier too old to perform this check). Please ensure that the kernel headers are installed on all nodes.
Pixie healthcheck detected the following warnings: error=Unable to detect if the cluster's nodes have the distro kernel headers installed (vizier too old to perform this check). Please ensure that the kernel headers are installed on all nodes.

[ ...]
```

- [x] `px deploy` on v0.14.14 vizier with latest bundle warns
appropriate when kernel headers are missing
```
$ bazel run -c opt --stamp src/pixie_cli:px -- deploy --pem_flags='PL_STIRLING_SOURCES=kNone' --deploy_key=<deploy key> --bundle=https://csmc-io.github.io/pxl-scripts/pxl_scripts/bundle.json -v 0.14.14-pre-r1.0

[ ... ]
Waiting for Pixie to pass healthcheck
 ✔    Wait for PEMs/Kelvin
 ✕    Wait for healthcheck  ERR: Detected missing kernel headers on your cluster's nodes. This may cause issues with the Pixie agent. Please install kernel headers on all nodes.
Pixie healthcheck detected the following warnings: error=Detected missing kernel headers on your cluster's nodes. This may cause issues with the Pixie agent. Please install kernel headers on all nodes.
```

</details>

Changelog Message: Enhanced the `px` cli's `deploy` and `collect-logs`
commands to surface when kernel headers aren't installed. This is a
common source of bugs that can only be addressed by installing your
distro's kernel headers.

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>
GitOrigin-RevId: 3c9c4bd

entlein pushed a commit to k8sstormcenter/pixie that referenced this pull request

Jun 2, 2026
Summary: Add px/agent_status_diagnostics pxl script

This PR adds a new pxl script that helps identify if linux headers are
missing on any PEMs. This can be extended in the future, but the first
use cause will be to execute the script during `px deploy` and `px
collect-logs`.

Relevant Issues: pixie-io#2051

Type of change: /kind feature

Test Plan: Ran the script in the UI and tested it as part of the
validation for pixie-io#2065

![Screen Shot 2024-12-18 at 10 51 15
AM](https://github.com/user-attachments/assets/bd4ab6de-ad92-4af3-8714-b044a7df7d65)


Changelog Message: Add `px/agent_status_diagnostics` pxl script for
checking common issues

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>

entlein pushed a commit to k8sstormcenter/pixie that referenced this pull request

Jun 2, 2026
…ing kernel headers (pixie-io#2065)

Summary: Use `px/agent_status_diagnostics` script within px cli to
detect missing kernel headers

This PR leverages the script added in pixie-io#2064 to detect missing kernel
headers during cli deploys and `px collect-logs` commands. This solves
2/3 of the use cases I was hoping to identify for pixie-io#2051 (the last being
helm installs).

A recent example of this problem is
pixie-io#1986, where a Go TLS tracing
bug went undiagnosed for months (August to December). Amazon Linux
2023's headers are different enough that it breaks Go TLS tracing when
pixie's pre-packaged headers are used. The tooling in this PR would have
provided a few opportunities for this to be caught.

Relevant Issues: pixie-io#2051

Type of change: /kind feature

Test Plan: Verified the following scenarios
<details><summary>Test cases</summary>

- [x] `px collect-logs` works against a cloud that doesn't have a
`px/agent_status_diagnostics` script
```
$ bazel run -c opt  --stamp src/pixie_cli:px -- collect-logs

WARN[0006] healthcheck script detected the following warnings:  error="Unable to detect if the cluster's nodes have the distro kernel headers installed (vizier too old to perform this check). Please ensure that the kernel headers are installed on all nodes."
Logs written to pixie_logs_20241223165214.zip

# zip file contains px/agent_status output
$ cat px_agent_diagnostics.txt
{"_tableName_":"output","agent_id":"07fb4d26-3b53-4ba7-9bb7-f2cb10a1e63d","asid":79,"hostname":"gke-dev-ddelnano1-default-pool-b099382d-30mu","ip_address":"","agent_state":"AGENT_STATE_HEALTHY","create_time":"2024-12-18T12:43:44.41952403Z","last_heartbeat_ns":4303060450,"kernel_headers_installed":true}
```
- [x] `px collect-logs` works against a cloud that does have a
`px/agent_status_diagnostics` script
```
$ bazel run  src/pixie_cli:px -- collect-logs
INFO: Analyzed target //src/pixie_cli:px (0 packages loaded, 0 targets configured).
INFO: Found 1 target...
Target //src/pixie_cli:px up-to-date:
  bazel-bin/src/pixie_cli/px_/px
INFO: Elapsed time: 4.240s, Critical Path: 3.89s
INFO: 3 processes: 1 internal, 2 linux-sandbox.
INFO: Build completed successfully, 3 total actions
INFO: Running command line: bazel-bin/src/pixie_cli/px_/px collect-logs
Pixie CLI
*******************************
* ENV VARS
*        PX_CLOUD_ADDR=testing.getcosmic.ai:443
*******************************
Logs written to pixie_logs_20241218164734.zip

$ cat px_agent_diagnostics.txt
{"_tableName_":"output","headers_installed_percent":1}
```
- [x] `px collect-logs` identifies when kernel headers are missing when
`px/agent_status_diagnostics` present
```
$ Logs written to pixie_logs_20241223165214.zip
$ bazel run -c opt  --stamp src/pixie_cli:px -- --bundle https://csmc-io.github.io/pxl-scripts/pxl_scripts/bundle.json collect-logs
[ ... ]
WARN[0012] healthcheck script detected the following warnings:  error="Detected missing kernel headers on your cluster's nodes. This may cause issues with the Pixie agent. Please install kernel headers on all nodes."

$ cat px_agent_diagnostics.txt
{"_tableName_":"output","headers_installed_percent":0.5}
```

- [x] Artificially forcing context deadline (timeout) results in an
error
```
$ git diff
diff --git a/src/pixie_cli/pkg/vizier/script.go b/src/pixie_cli/pkg/vizier/script.go
index 7d3b7e0..c957b8943 100644
--- a/src/pixie_cli/pkg/vizier/script.go
+++ b/src/pixie_cli/pkg/vizier/script.go
@@ -317,7 +317,7 @@ func RunSimpleHealthCheckScript(br *script.BundleManager, cloudAddr string, clus
                execScript = br.MustGetScript(script.AgentStatusScript)
        }

-       ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
+       ctx, cancel := context.WithTimeout(context.Background(), 1*time.Second)

$ bazel run  src/pixie_cli:px -- collect-logs

WARN[0012]src/pixie_cli/pkg/vizier/logs.go:135 px.dev/pixie/src/pixie_cli/pkg/vizier.(*LogCollector).CollectPixieLogs() failed to run health check script             error="context deadline exceeded"
Logs written to pixie_logs_20241218165033.zip
```
- [x] `px collect-logs` prompts auth flow when credentials don't match
current cloud
```
$ PX_CLOUD_ADDR=new-cloud bazel run  src/pixie_cli:px -- collect-logs
*******************************
* ENV VARS
*        PX_CLOUD_ADDR=new-cloud
*******************************
Failed to authenticate. Please retry `px auth login`.
```

- [x] `px deploy` on pre v0.14.14 (older) vizier with existing bundle
warns that kernel headers should be installed
```
# Additional flags provided to speed up vizier bootstrapping
$ bazel run -c opt --stamp src/pixie_cli:px -- deploy --pem_flags='PL_STIRLING_SOURCES=kNone' --deploy_key='<deploy key>' --deploy_olm=false --olm_namespace=olm --bundle=https://csmc-io.github.io/pxl-scripts/pxl_scripts/bundle.json
```

- [x] `px deploy` on pre v0.14.14 (older) vizier with latest bundle
warns that kernel headers should be installed
```
# Additional flags provided to speed up vizier bootstrapping
$ bazel run -c opt --stamp src/pixie_cli:px -- deploy --pem_flags='PL_STIRLING_SOURCES=kNone' --deploy_key='<deploy key>' --deploy_olm=false --olm_namespace=olm --bundle=https://csmc-io.github.io/pxl-scripts/pxl_scripts/bundle.json

[ ... ]
Waiting for Pixie to pass healthcheck
 ✔    Wait for PEMs/Kelvin
 ✔    Wait for PEMs/Kelvin
 ✕    Wait for healthcheck  ERR: Unable to detect if the cluster's nodes have the distro kernel headers installed (vizier too old to perform this check). Please ensure that the kernel headers are installed on all nodes.
Pixie healthcheck detected the following warnings: error=Unable to detect if the cluster's nodes have the distro kernel headers installed (vizier too old to perform this check). Please ensure that the kernel headers are installed on all nodes.

[ ...]
```

- [x] `px deploy` on v0.14.14 vizier with latest bundle warns
appropriate when kernel headers are missing
```
$ bazel run -c opt --stamp src/pixie_cli:px -- deploy --pem_flags='PL_STIRLING_SOURCES=kNone' --deploy_key=<deploy key> --bundle=https://csmc-io.github.io/pxl-scripts/pxl_scripts/bundle.json -v 0.14.14-pre-r1.0

[ ... ]
Waiting for Pixie to pass healthcheck
 ✔    Wait for PEMs/Kelvin
 ✕    Wait for healthcheck  ERR: Detected missing kernel headers on your cluster's nodes. This may cause issues with the Pixie agent. Please install kernel headers on all nodes.
Pixie healthcheck detected the following warnings: error=Detected missing kernel headers on your cluster's nodes. This may cause issues with the Pixie agent. Please install kernel headers on all nodes.
```

</details>

Changelog Message: Enhanced the `px` cli's `deploy` and `collect-logs`
commands to surface when kernel headers aren't installed. This is a
common source of bugs that can only be addressed by installing your
distro's kernel headers.

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>

entlein pushed a commit to k8sstormcenter/pixie that referenced this pull request

Jun 2, 2026
…ect missing kernel headers (pixie-io#2065)" (pixie-io#2081)

Summary: Revert "Use `px/agent_status_diagnostics` script within px cli
to detect missing kernel headers (pixie-io#2065)"

This reverts commit 4606b55. While
testing the latest cloud release, I noticed that the functionality added
in pixie-io#2065 can cause `px deploy` to hang indefinitely at the "Wait for
healthcheck" step. The deploy will finish successfully, but it's a poor
user experience.

There seems to be a goroutine blocking issue that is dependent on
cluster size. On 1 and 2 node clusters that I tested pixie-io#2065 on, the issue
doesn't surface. However, the issue reproduces reliably on the larger
clusters that have pixie deployed to. Let's revert and release a new cli
version once this is tracked down.

Relevant Issues: pixie-io#2051

Type of change: /kind bugfix

Test Plan: N/A

Changelog Message: Reverted the recent advanced diagnostics added during
`px deploy` as in some cases it can cause that caused `px deploy` to
hang

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>

entlein pushed a commit to k8sstormcenter/pixie that referenced this pull request

Jun 2, 2026
… missing kernel headers (pixie-io#2091)

Summary: Re-introduce use of `px/agent_status_diagnostics` in px cli to
detect missing kernel headers

The original version of this was reverted since it resulted in hung `px
deploy` and `px collect-logs` commands on larger clusters. This PR
reintroduces the change with the fixes necessary to prevent the previous
issue and is best reviewed commit by commit as outlined below:

Commit 1: Cherry-pick of pixie-io#2065
Commit 2: Fix for goroutine deadlock
Commit 3: Add bundle flag to the `px collect-logs` command
Commit 4: Introduce `PX_LOG_FILE` env var for redirecting `px` log
output to a file -- useful for debugging the cli since its terminal
spinners complicate logging to stdout

Commits 1, 3 and 4 should be self explanatory. As for Commit 2, the
goroutine deadlock occurred from the `streamCh` channel consumer.

The previous version read a single value from the `streamCh` channel,
parsed the result and
[terminated](pixie-io@2ec63c8#diff-4da8f48b4c664d330cff34e70f907d6015289797c832587b0b14004875ef0831R363)
its goroutine. Thus future sends to the `streamCh` channel could block
and prevent the pipe receiving the pxl script results to be fully
consumed. Since the stream adapter writes to the pipe, it couldn't flush
all of its results and the deadlock occurred.

The original testing was performed on clusters with 1 and 2 nodes -- max
of 2 PEMs and 2 results from `px/agent_status`. This deadlock issue
didn't surface in those situations because `streamCh` was a buffered
channel with capacity of 1 and the consumer would read a single record
before terminating. This meant that the pipe reader would hit EOF before
it would initiate a channel send that would deadlock as outlined below:

2 Node cluster situation:
1. `px` cli executes `px/agent_status` as `px/agent_status_diagnostics`
is not in the canonical bundle yet
2. streamCh producer sends 1st PEMs result -- streamCh at capacity
3. streamCh consumer reads the value and exits -- streamCh ready to
accept 1 value
4. streamCh producer sends 2nd and final PEM result -- streamCh at
capacity and future sends would block!
5. Program exits since pxl script is complete

Relevant Issues: pixie-io#2051

Type of change: /kind feature

Test Plan: Verified that the deadlock no longer occurs on clusters with
3-6 nodes
- [x] Used the
[following](https://github.com/user-attachments/files/18457105/deadlocked-goroutines.txt)
pprof goroutine stack dump to understand the deadlock described above --
see blocked goroutine on `streamCh` channel send on `script.go:337`
- [x] Re-tested all of the scenarios from pixie-io#2065

Changelog Message: Re-introduce enhanced diagnostics for `px deploy` and
`px collect-logs` commands used to detect common sources of environment
incompatibilities

---------

Signed-off-by: Dom Del Nano <ddelnano@gmail.com>