Hi, does anyone has any idea why sensors related t...
# ask-metaflow
f
Hi, does anyone has any idea why sensors related to argo events would keep crashing? Looking at sensor pod log, we see something like this in 🧵
Copy code
{
  "level": "error",
  "ts": 1734597287.55633,
  "logger": "argo-events.sensor",
  "caller": "sensor/sensor_jetstream.go:65",
  "msg": "failed to Create Key/Value Store for sensor pdata-user-cschorn-syncdbtofiftyone, err: context deadline exceeded",
  "sensorName": "pdata-user-cschorn-syncdbtofiftyone",
  "stacktrace": "<http://github.com/argoproj/argo-events/eventbus/jetstream/sensor.(*SensorJetstream).Initialize|github.com/argoproj/argo-events/eventbus/jetstream/sensor.(*SensorJetstream).Initialize>\n\t/home/runner/work/argo-events/argo-events/eventbus/jetstream/sensor/sensor_jetstream.go:65\ngithub.com/argoproj/argo-events/sensors.(*SensorContext).listenEvents.func1\n\t/home/runner/work/argo-events/argo-events/sensors/listener.go:120\ngithub.com/argoproj/argo-events/common.DoWithRetry.func1\n\t/home/runner/work/argo-events/argo-events/common/retry.go:106\nk8s.io/apimachinery/pkg/util/wait.ConditionFunc.WithContext.func1\n\t/home/runner/go/pkg/mod/k8s.io/apimachinery@v0.24.3/pkg/util/wait/wait.go:220\nk8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtectionWithContext\n\t/home/runner/go/pkg/mod/k8s.io/apimachinery@v0.24.3/pkg/util/wait/wait.go:233\nk8s.io/apimachinery/pkg/util/wait.runConditionWithCrashProtection\n\t/home/runner/go/pkg/mod/k8s.io/apimachinery@v0.24.3/pkg/util/wait/wait.go:226\nk8s.io/apimachinery/pkg/util/wait.ExponentialBackoff\n\t/home/runner/go/pkg/mod/k8s.io/apimachinery@v0.24.3/pkg/util/wait/wait.go:421\ngithub.com/argoproj/argo-events/common.DoWithRetry\n\t/home/runner/work/argo-events/argo-events/common/retry.go:105\ngithub.com/argoproj/argo-events/sensors.(*SensorContext).listenEvents\n\t/home/runner/work/argo-events/argo-events/sensors/listener.go:119\ngithub.com/argoproj/argo-events/sensors.(*SensorContext).Start.func1\n\t/home/runner/work/argo-events/argo-events/sensors/listener.go:76"
}
Not sure why it's timing out.
l
Is your eventbus healthy?
f
Seems so
Copy code
k get pods | grep eventbus
eventbus-default-js-0                                                                       3/3     Running                  7 (8m27s ago)     96d
eventbus-default-js-1                                                                       3/3     Running                  2 (37m ago)       23h
eventbus-default-js-2                                                                       3/3     Running                  3 (17m ago)       23h
No you are right I see some issue in pod
Copy code
Normal   Created    17m (x4 over 23h)   kubelet  Created container main
  Normal   Started    17m (x4 over 23h)   kubelet  Started container main
  Normal   Pulled     17m (x3 over 64m)   kubelet  Container image "nats:2.9.15" already present on machine
  Warning  Unhealthy  16m (x29 over 23h)  kubelet  Startup probe failed: HTTP probe failed with statuscode: 503
What would the cause of unhealthy eventbus?
We have about 46 sensors running
l
Hard to tell, sometimes it could be because of underprovisioning, other times we’ve experienced it has been due to some obscure bug in jetstream
What do the logs say?
f
Now it seems the sensor is running
I see lot of these eventbus logs
Copy code
Defaulted container "main" out of: main, reloader, metrics
[165] 2024/12/19 09:05:38.891511 [ERR] 10.10.109.111:6555 - cid:346 - TLS handshake error: tls: first record does not look like a TLS handshake
[165] 2024/12/19 09:05:39.868848 [ERR] 10.10.55.211:63135 - rid:347 - TLS route handshake error: tls: first record does not look like a TLS handshake
[165] 2024/12/19 09:05:39.868879 [INF] 10.10.55.211:63135 - rid:347 - Router connection closed: TLS Handshake Failure
We are actually still using nats
Copy code
...
  Normal   Pulled     17m (x3 over 64m)   kubelet  Container image "nats:2.9.15" already present on machine
  ...