Hi is there a recommended way to raise alerts in CloudWatch Outerbounds #dev-metaflow

Hi, is there a recommended way to raise alerts in ...

sparse-airplane-39047

07/25/2022, 12:48 AM

Hi, is there a recommended way to raise alerts in CloudWatch when any of our Metaflow AWS batch jobs fail? We don’t want to create a new alarm one by one, but CloudWatch won’t allow an alarm to be created from aggregated metrics such as wildcard search of ExecutionsFailed from state machines. The problem is we have many Metaflow jobs and new ones are added constantly. We want to make sure we’re alerted to failures.

✅ 1

ancient-application-36103

07/25/2022, 2:13 AM

Interesting. @User or @fresh-laptop-72652 may have some ideas around monitoring step-functions state machines via cloudwatch alarms.

ancient-application-36103

07/25/2022, 2:14 AM

One approach (the hammer approach) would be to run a lambda periodically scanning for failed executions.

user

07/25/2022, 2:37 AM

You can configure statemachine transition events to go to Cloudwatch Logs; from there, ingestion into whatever log aggregator you use (or using the built in cloudwatch log insights perhaps, though I don't have much experience with that) should let you alert on the relevant events.

💯 2

sparse-airplane-39047

07/25/2022, 3:46 AM

Thanks!

fresh-laptop-72652

07/25/2022, 4:26 AM

+1 for Matt’s idea. I’d also add that personally, I think it could quickly become quite overwhelming depending on the size of team and number of jobs you’re running. As part of deploying SFNs to production, you could add cloudwatch alarms on the execution status (per SFN) so the alerts can be sent via slack webhook to an individual/channel or email. There’s been some talk of releasing an

@notify

decorator to streamline the AWS bits at the time of deployment, and last I checked there’s some discussion around having flow-level alerts versus step-level (e.g. a batch job). For what you’re describing, I’d be extra cautious since you’re talking about having alerts for individual step executions/batch jobs – that’s asking for spam, especially with how easy metaflow makes it to horizontally scale with

foreach

fanouts 🙂

Open in Slack

Previous Next