Hey folks :wave: I was looking at the the GitHub <...
# dev-metaflow
d
Hey folks 👋 I was looking at the the GitHub issue for Custom @notify decorator to send notifications by email/Slack. Wanted to see if other organisations have similar needs and would love to hear requirements/thoughts around this.
👋 1
2
u
Seems like the more general answer to what is described in the linked issue would be the ability to create a step that is run only on failure/success of the entire flow; whether these steps notify people, and how, would be up to the flow implementer. Maybe easiest way is by supporting some keyword step names (akin to start and end today) like
on_failure
and
on_success
. And they are just steps like any other, except like
end
they cannot themselves have subsequent steps, and also you never have to explicitly reference them in a `next`; if defined, the orchestrator knows where to include them in the DAG.
i
we do, but we've rolled our own solution. we just call an alert function in the end step
d
@User Yeah great suggestion. I was thinking kind of similar to what you suggested, and then
on_failure
and
on_success
are steps that run at the end and leave it for orchestrator to include in the DAG.
@important-jewelry-43189 Yeah that’s a great idea. I was talking to Savin what how we can include this in Metaflow since a lot of folks would be needing this or are using their custom solution. What kind of integrations do you have at the moment? Is it an email or slack alert?
i
I wrote it using IFTTT webhooks. email, slack alerts channel, and android alerts through the IFTTT app.
👍 1
we include the pathspec and a link to the relevant cloudwatch logs. it would be great if this was flow-wide, though, so we could do things like report elapsed time and send alerts for failed pipelines. that's something I miss from Kedro
d
We do something kind of similar with our data pipelines and would be great to have something similar in that Metaflow. So with the flow wide info, would you want things like cloudwatch log groups for each steps or just failures? And do you want a detailed breakdown how much time each step took or just start and end time of the flow (to monitor things like SLAs and report on that?)
u
@important-jewelry-43189 You are not guaranteed to hit your end step if one of your upstream steps fail.
💯 1
u
E.g., if there's an OOM.
u
In a DAG with on_failure, every step in the DAG would point to a single failure step that executes before ending the flow. In some sense, the current
end
step is sufficient as an "on success" step, provided you don't include other flow logic in it.
👍 1
a
Our current design was to include a custom aws lambda together with the Metaflow’s infrastructure. this lambda gets triggered whenever there is a failed status in the AWS Batch job via eventbridge. Lambda then sends the alert (slack/email/…) Hence, there is no @notify decorator needed. But loving the idea suggested so far ..
👍 2
i
@User right -- that's what I was saying I miss from Kedro. Kedro let's you define hooks that can run before/after/on failure and can be targeted to different pipeline/step phases. https://kedro.readthedocs.io/en/stable/07_extend_kedro/02_hooks.html
So instead of manually defining something like an on_failure step for each pipeline you could define an on_pipeline_error hook that would automatically modify the behavior of all your project's pipelines
@damp-lizard-48279 I'm a metaflow noob (still in the middle of porting from Kedro) but our kedro alerts just link to the logs for the whole pipeline and we drill in using the Cloudwatch ui. We recorded total pipeline runtime and specific processing steps using tqdm. However, our tqdm progress bars are getting surpressed by metaflow. I haven't digged in to how metaflow logging works yet
s
There are a few ways to achieve the semantics proposed by @User and @important-jewelry-43189 - we can have a flow level
@notify
decorator which behaves similar to the step level
@catch
decorator - In case the step succeeds and is named
end
we emit a success notification; in case the step fails, we can trigger a retry and execute the fallback code and emit a failure notification.
👍 1
A good question is what kind of notification mechanisms do you use today? Should we start with slack messages - which should be relatively easy to implement.
👍 1
Pinging @rough-terabyte-71304 as well, for his input.
@User https://outerbounds-community.slack.com/archives/C020U025QJK/p1623280683110900?thread_ts=1623277512.108500&amp;cid=C020U025QJK That's a great idea - however it will change the structure of the DAG on Step Functions a bit making it not so visually appealing in the console UI. Not a major issue though.
@important-jewelry-43189 That's an interesting idea -
@on_success
/
@on_failure
- however we will have to think through what other use cases it may be useful for in the future if we go down this path.
@damp-lizard-48279 Would you like to start a quick doc outlining the use cases (user stories) and we can quickly come to a conclusion regarding the UX that needs to be supported. We can then brainstorm how to best implement the proposed UX.
d
@square-wire-39606 Yes, I’ll collate all the use cases/ideas and share the document.
s
Awesome!
u
@square-wire-39606 yes, including a conditionally executed "error" step that all other steps point to does complicate that graph rendering a bit, but we have some internal DAGs like this (even some quite large ones) and it's not too bad.
s
Nice!
d
Here is the document with some options and UX questions that Savin has added as well, feel free to add more things and thoughts to it. https://docs.google.com/document/d/1QPX2JhiVf1XPAs2u5j3jdBCGzi_NWfDMjl-LhSvvEx8/edit?usp=sharing
i
example of our log messages. IFTTT automatically shortens the cloudwatch links.
d
Hey everyone, thanks for all your input. Me and Savin have added some UX decisions keeping in mind all the inputs you folks have provided. Please have a look and shout out if you think something could be improved/changed. 🙂
👍 2