Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(host_metrics source): add a new collector for tcp stats #22057

Merged
merged 6 commits into from
Jan 15, 2025

Conversation

aryan9600
Copy link
Contributor

@aryan9600 aryan9600 commented Dec 18, 2024

Summary

Add a new tcp collector to the host_metrics source to expose information about the systems's TCP stack. It exposes three metrics:

  • tcp_connections_total: The total number of TCP connections. It includes the state of the connection as a tag.
  • tcp_tx_queued_bytes_total: The sum of the number of bytes in the send queue across all connections.
  • tcp_rx_queued_bytes_total: The sum of the number of bytes in the receive queue across all connections.

These metrics are only available for Linux systems as it uses the netlink subsystem.

Change Type

  • Bug fix
  • New feature
  • Non-functional (chore, refactoring, docs)
  • Performance

Is this a breaking change?

  • Yes
  • No

How did you test this PR?

  • Unit tests for the code interacting with netlink and for the code writing to the metrics buffer.
  • Manual testing by running the change locally with the below config file and comparing the output with ss -a -t on both Linux x86 and arm systems.
sources:
  host_metrics:
    type: host_metrics
    collectors:
      - tcp
sinks:
  console:
    type: console
    inputs:
      - host_metrics
    encoding:
      codec: text

Does this PR include user facing changes?

  • Yes. Please add a changelog fragment based on our guidelines.
  • No. A maintainer will apply the "no-changelog" label to this PR.

Checklist

  • Please read our Vector contributor resources.
  • If this PR introduces changes Vector dependencies (modifies Cargo.lock), please
    run dd-rust-license-tool write to regenerate the license inventory and commit the changes (if any). More details here.

References

closes #21972

@aryan9600 aryan9600 requested review from a team as code owners December 18, 2024 17:43
@bits-bot
Copy link

bits-bot commented Dec 18, 2024

CLA assistant check
All committers have signed the CLA.

@github-actions github-actions bot added domain: sources Anything related to the Vector's sources domain: external docs Anything related to Vector's external, public documentation labels Dec 18, 2024
Copy link
Member

@jszwedko jszwedko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @aryan9600 ! Exposing these new metrics seem useful. What do you think of putting them under the existing network collector though? Exporting as network_tcp_*.

@aryan9600
Copy link
Contributor Author

What do you think of putting them under the existing network collector though?

i wanted to keep it as a separate collector, as these metrics are only available for linux. also, to avoid any changes to the output of users when they upgrade to the newer version with these changes. if its a blocker, then i can move it to the network collector, no issue. let me know :)

@jszwedko
Copy link
Member

What do you think of putting them under the existing network collector though?

i wanted to keep it as a separate collector, as these metrics are only available for linux. also, to avoid any changes to the output of users when they upgrade to the newer version with these changes. if its a blocker, then i can move it to the network collector, no issue. let me know :)

I think we already have a precedent of collectors having OS-specific metrics (e.g. filesystem_used_ratio only exists for Linux) so I think they'd be a good fit under collector even if they only exist for Linux.

@aryan9600 aryan9600 changed the title feat(host_metrics): add a new tcp collector feat(host_metrics): modify the network collector to expose TCP stats Dec 20, 2024
@aryan9600 aryan9600 requested a review from jszwedko December 20, 2024 19:09
@aryan9600 aryan9600 changed the title feat(host_metrics): modify the network collector to expose TCP stats feat(host_metrics source): modify the network collector to expose TCP stats Dec 21, 2024
Copy link
Member

@pront pront left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @aryan9600, thank you for this PR!

I noticed big change in Cargo.lock, I wonder if you can roll that file back and apply individual updates for the newly added crates only.

@aryan9600 aryan9600 requested a review from pront January 3, 2025 13:34
@pront pront self-assigned this Jan 3, 2025
@aryan9600
Copy link
Contributor Author

a problem with having these TCP metrics inside the network collector is that the latter allows to filter the output metrics using network devices, but the former doesn't support this filtering. this is due to the fact that netlink doesn't return the network interface in the response for SOCK_DIAG messages. we could modify the tests and docs to accommodate for this or create a new collector for these metrics. open to any other suggestions.
cc: @pront @jszwedko

@pront
Copy link
Member

pront commented Jan 6, 2025

Hi @aryan9600, one test is failing. I will take another look at this PR once that is fixed. You can run any test locally with cargo test.

Copy link
Member

@pront pront left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/ci-run-unit-mac

@pront pront force-pushed the host-metrics-tcp branch from 6419d0b to 9faccc7 Compare January 7, 2025 16:45
@pront
Copy link
Member

pront commented Jan 7, 2025

Running this locally on macOS I get multiple compilation errors originating netlink-packet-sock-diag.

error[E0425]: cannot find value `AF_AX25` in crate `libc`
  --> .../.cargo/registry/src/index.crates.io-6f17d22bba15001f/netlink-packet-sock-diag-0.4.2/src/constants.rs:10:31
   |
10 | pub const AF_AX25: u8 = libc::AF_AX25 as u8;
   |                               ^^^^^^^ not found in `libc`

@jszwedko
Copy link
Member

jszwedko commented Jan 7, 2025

a problem with having these TCP metrics inside the network collector is that the latter allows to filter the output metrics using network devices, but the former doesn't support this filtering. this is due to the fact that netlink doesn't return the network interface in the response for SOCK_DIAG messages. we could modify the tests and docs to accommodate for this or create a new collector for these metrics. open to any other suggestions.

Ah, good point, that could be actually be a reason to separate the collector if it is not filterable by network device.

@aryan9600
Copy link
Contributor Author

@jszwedko ack, moving these metrics back into its own dedicated collector.

@aryan9600
Copy link
Contributor Author

Running this locally on macOS I get multiple compilation errors originating netlink-packet-sock-diag.

@pront this should be fixed.

@aryan9600 aryan9600 requested a review from pront January 8, 2025 17:58
@aryan9600
Copy link
Contributor Author

tests are passing now, previous CI run failed because third party licenses were not up to date. this has been fixed now.

@aryan9600 aryan9600 changed the title feat(host_metrics source): modify the network collector to expose TCP stats feat(host_metrics source): add a new collector for tcp stats Jan 9, 2025
Copy link
Member

@jszwedko jszwedko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 on the UX. I'll leave the code review to @pront

Copy link
Member

@pront pront left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of comments on diagnostics but otherwise this is good to go 🚀

src/sources/host_metrics/tcp.rs Outdated Show resolved Hide resolved
9 => Ok(TcpState::LastAck),
10 => Ok(TcpState::Listen),
11 => Ok(TcpState::Closing),
_ => Err(TcpError::InvalidTcpState { state: value }),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: we could pass the string representation of state here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm a bit confused, what would be the string representation of state here, since value doesn't correspond to a valid tcp state?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah my bad, I see that this is a catch all for states that don't map to a string.

Copy link
Member

@pront pront left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing one last check on my side because last time we tried to generate linux specific platform docs it broke component generation on macOS. This something for us (Vector team) to figure out. PR looks great otherwise.

9 => Ok(TcpState::LastAck),
10 => Ok(TcpState::Listen),
11 => Ok(TcpState::Closing),
_ => Err(TcpError::InvalidTcpState { state: value }),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah my bad, I see that this is a catch all for states that don't map to a string.

@pront
Copy link
Member

pront commented Jan 10, 2025

make generate-component-docs on macOS:

image

Raw: diff.txt

@jszwedko: do you happen to recall how we dealt with this in the past?

@jszwedko
Copy link
Member

make generate-component-docs on macOS:

image Raw: [diff.txt](https://github.com/user-attachments/files/18378930/diff.txt)

@jszwedko: do you happen to recall how we dealt with this in the past?

I think maybe the cue files were manually modified rather than using make generate-component-docs. I believe the diff you have is accurate and should be applied 🤔

@jszwedko
Copy link
Member

Though I do see the "Only available on Linux" bit of the doc comment is dropped. That might be an issue in the scaffolding that generates the cue file based on the structs.

@pront
Copy link
Member

pront commented Jan 10, 2025

make generate-component-docs on macOS:
image
Raw: diff.txt
@jszwedko: do you happen to recall how we dealt with this in the past?

I think maybe the cue files were manually modified rather than using make generate-component-docs. I believe the diff you have is accurate and should be applied 🤔

Thanks @jszwedko. @aryan9600 please take a look at the above.

@pront
Copy link
Member

pront commented Jan 10, 2025

Though I do see the "Only available on Linux" bit of the doc comment is dropped. That might be an issue in the scaffolding that generates the cue file based on the structs.

I will take a look at this.

@aryan9600
Copy link
Contributor Author

make generate-comonent-docs fails for me on macOS:

❯ make generate-component-docs
cargo build 
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 1.11s
target/debug/vector generate-schema > /tmp/vector-config-schema.json 2>/dev/null
cargo vdev build component-docs /tmp/vector-config-schema.json \

/Users/sanskarjaiswal/Development/vector/scripts/generate-component-docs.rb:1753:in `<main>': undefined method `filter_map' for #<Hash:0x0000000102867628> (NoMethodError)
Did you mean?  filter
Error: command: cd "/Users/sanskarjaiswal/Development/vector" && "/Users/sanskarjaiswal/Development/vector/scripts/generate-component-docs.rb" "/tmp/vector-config-schema.js
on"
  failed with exit code: 1
make: *** [generate-component-docs] Error 1

should i just apply the diff manually? but the diff drops the note about the metrics about linux only. i'm a bit confused about how to resolve this cleanly.

@pront
Copy link
Member

pront commented Jan 13, 2025

make generate-comonent-docs fails for me on macOS:

❯ make generate-component-docs
cargo build 
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 1.11s
target/debug/vector generate-schema > /tmp/vector-config-schema.json 2>/dev/null
cargo vdev build component-docs /tmp/vector-config-schema.json \

/Users/sanskarjaiswal/Development/vector/scripts/generate-component-docs.rb:1753:in `<main>': undefined method `filter_map' for #<Hash:0x0000000102867628> (NoMethodError)
Did you mean?  filter
Error: command: cd "/Users/sanskarjaiswal/Development/vector" && "/Users/sanskarjaiswal/Development/vector/scripts/generate-component-docs.rb" "/tmp/vector-config-schema.js
on"
  failed with exit code: 1
make: *** [generate-component-docs] Error 1

should i just apply the diff manually? but the diff drops the note about the metrics about linux only. i'm a bit confused about how to resolve this cleanly.

I didn't have time to look into this but what Jesse mentioned is going to be a problem (even after we apply the patch).

@jszwedko
Copy link
Member

make generate-comonent-docs fails for me on macOS:

I believe that error occurs if you are running Ruby 2. Could try with Ruby 3?

@pront
Copy link
Member

pront commented Jan 13, 2025

should i just apply the diff manually? but the diff drops the note about the metrics about linux only. i'm a bit confused about how to resolve this cleanly.

I didn't have time to look into this but what Jesse mentioned is going to be a problem (even after we apply the patch).

I tried make generate-component-docs (macOS) on the latest master and I get:

image

Ref: #22120

This is the same issue, the options are removed when the command is ran on a platform that doesn't support them.

Solving this is out of scope for this PR. We will temporarily assume the website is the unix/linux features are ON. In the future, some ideas for addressing this properly:

  • generate docs per platform
  • process the configuration as text and generate docs during runtime

Add a new `tcp` collector  to the `host_metrics` source to expose
information about the systems's TCP stack. It exposes three metrics:
* `tcp_connections_total`: The total number of TCP connections. It
  includes the `state` of the connection as a tag.
* `tcp_tx_queued_bytes_total`: The sum of the number of bytes in the
   send queue across all connections.
* `tcp_rx_queued_bytes_total`: The sum of the number of bytes in the
  receive queue across all connections.

The collector is only enabled for Linux as it uses the netlink
subsystem.

Signed-off-by: Sanskar Jaiswal <[email protected]>
Signed-off-by: Sanskar Jaiswal <[email protected]>
@aryan9600
Copy link
Contributor Author

@pront @jszwedko i was able to fix the examples and formatting, but i wasn't able to get the "Only available in Linux" statement generated in the cue file.

@pront
Copy link
Member

pront commented Jan 13, 2025

@pront @jszwedko i was able to fix the examples and formatting, but i wasn't able to get the "Only available in Linux" statement generated in the cue file.

We will address this issue. You can just run make generate-component-docs and commit the result (if you already did, then no action needed on your side and consider this PR unblocked)

@pront pront added this pull request to the merge queue Jan 15, 2025
Merged via the queue into vectordotdev:master with commit 2d7cbd8 Jan 15, 2025
40 checks passed
@pront
Copy link
Member

pront commented Jan 15, 2025

Thank you @aryan9600 for driving this to completion despite the docs hiccups. I created #22198 to address the problems we discovered here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain: external docs Anything related to Vector's external, public documentation domain: sources Anything related to the Vector's sources
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a collector to track TCP connections in the host_metrics source
5 participants