Hi, is there a recommended way to raise alerts in ...
# dev-metaflow
s
Hi, is there a recommended way to raise alerts in CloudWatch when any of our Metaflow AWS batch jobs fail? We don’t want to create a new alarm one by one, but CloudWatch won’t allow an alarm to be created from aggregated metrics such as wildcard search of ExecutionsFailed from state machines. The problem is we have many Metaflow jobs and new ones are added constantly. We want to make sure we’re alerted to failures.
1
a
Interesting. @User or @fresh-laptop-72652 may have some ideas around monitoring step-functions state machines via cloudwatch alarms.
One approach (the hammer approach) would be to run a lambda periodically scanning for failed executions.
u
You can configure statemachine transition events to go to Cloudwatch Logs; from there, ingestion into whatever log aggregator you use (or using the built in cloudwatch log insights perhaps, though I don't have much experience with that) should let you alert on the relevant events.
💯 2
s
Thanks!
f
+1 for Matt’s idea. I’d also add that personally, I think it could quickly become quite overwhelming depending on the size of team and number of jobs you’re running. As part of deploying SFNs to production, you could add cloudwatch alarms on the execution status (per SFN) so the alerts can be sent via slack webhook to an individual/channel or email. There’s been some talk of releasing an
@notify
decorator to streamline the AWS bits at the time of deployment, and last I checked there’s some discussion around having flow-level alerts versus step-level (e.g. a batch job). For what you’re describing, I’d be extra cautious since you’re talking about having alerts for individual step executions/batch jobs – that’s asking for spam, especially with how easy metaflow makes it to horizontally scale with
foreach
fanouts 🙂