Hey folks wave I was looking at the the GitHub <https github Outerbounds #dev-metaflow

Hey folks :wave: I was looking at the the GitHub <...

damp-lizard-48279

06/09/2021, 10:25 PM

Hey folks 👋 I was looking at the the GitHub issue for Custom @notify decorator to send notifications by email/Slack. Wanted to see if other organisations have similar needs and would love to hear requirements/thoughts around this.

👋 1

✨ 2

user

06/09/2021, 10:35 PM

Seems like the more general answer to what is described in the linked issue would be the ability to create a step that is run only on failure/success of the entire flow; whether these steps notify people, and how, would be up to the flow implementer. Maybe easiest way is by supporting some keyword step names (akin to start and end today) like

on_failure

and

on_success

. And they are just steps like any other, except like

end

they cannot themselves have subsequent steps, and also you never have to explicitly reference them in a `next`; if defined, the orchestrator knows where to include them in the DAG.

important-jewelry-43189

06/09/2021, 10:46 PM

we do, but we've rolled our own solution. we just call an alert function in the end step

damp-lizard-48279

06/09/2021, 10:54 PM

@User Yeah great suggestion. I was thinking kind of similar to what you suggested, and then

on_failure

and

on_success

are steps that run at the end and leave it for orchestrator to include in the DAG.

damp-lizard-48279

06/09/2021, 10:56 PM

@important-jewelry-43189 Yeah that’s a great idea. I was talking to Savin what how we can include this in Metaflow since a lot of folks would be needing this or are using their custom solution. What kind of integrations do you have at the moment? Is it an email or slack alert?

important-jewelry-43189

06/09/2021, 10:57 PM

I wrote it using IFTTT webhooks. email, slack alerts channel, and android alerts through the IFTTT app.

👍 1

important-jewelry-43189

06/09/2021, 11:01 PM

we include the pathspec and a link to the relevant cloudwatch logs. it would be great if this was flow-wide, though, so we could do things like report elapsed time and send alerts for failed pipelines. that's something I miss from Kedro

damp-lizard-48279

06/09/2021, 11:07 PM

We do something kind of similar with our data pipelines and would be great to have something similar in that Metaflow. So with the flow wide info, would you want things like cloudwatch log groups for each steps or just failures? And do you want a detailed breakdown how much time each step took or just start and end time of the flow (to monitor things like SLAs and report on that?)

user

06/09/2021, 11:15 PM

@important-jewelry-43189 You are not guaranteed to hit your end step if one of your upstream steps fail.

💯 1

user

06/09/2021, 11:16 PM

E.g., if there's an OOM.

user

06/09/2021, 11:18 PM

In a DAG with on_failure, every step in the DAG would point to a single failure step that executes before ending the flow. In some sense, the current

end

step is sufficient as an "on success" step, provided you don't include other flow logic in it.

👍 1

adventurous-gigabyte-81428

06/10/2021, 6:21 AM

Our current design was to include a custom aws lambda together with the Metaflow’s infrastructure. this lambda gets triggered whenever there is a failed status in the AWS Batch job via eventbridge. Lambda then sends the alert (slack/email/…) Hence, there is no @notify decorator needed. But loving the idea suggested so far ..

👍 2

important-jewelry-43189

06/10/2021, 4:15 PM

@User right -- that's what I was saying I miss from Kedro. Kedro let's you define hooks that can run before/after/on failure and can be targeted to different pipeline/step phases. https://kedro.readthedocs.io/en/stable/07_extend_kedro/02_hooks.html

important-jewelry-43189

06/10/2021, 4:17 PM

So instead of manually defining something like an on_failure step for each pipeline you could define an on_pipeline_error hook that would automatically modify the behavior of all your project's pipelines

important-jewelry-43189

06/10/2021, 4:33 PM

@damp-lizard-48279 I'm a metaflow noob (still in the middle of porting from Kedro) but our kedro alerts just link to the logs for the whole pipeline and we drill in using the Cloudwatch ui. We recorded total pipeline runtime and specific processing steps using tqdm. However, our tqdm progress bars are getting surpressed by metaflow. I haven't digged in to how metaflow logging works yet

square-wire-39606

06/14/2021, 8:44 PM

There are a few ways to achieve the semantics proposed by @User and @important-jewelry-43189 - we can have a flow level

@notify

decorator which behaves similar to the step level

@catch

decorator - In case the step succeeds and is named

end

we emit a success notification; in case the step fails, we can trigger a retry and execute the fallback code and emit a failure notification.

👍 1

square-wire-39606

06/14/2021, 8:45 PM

A good question is what kind of notification mechanisms do you use today? Should we start with slack messages - which should be relatively easy to implement.

👍 1

square-wire-39606

06/14/2021, 8:46 PM

Pinging @rough-terabyte-71304 as well, for his input.

square-wire-39606

06/14/2021, 8:49 PM

@User https://outerbounds-community.slack.com/archives/C020U025QJK/p1623280683110900?thread_ts=1623277512.108500&cid=C020U025QJK That's a great idea - however it will change the structure of the DAG on Step Functions a bit making it not so visually appealing in the console UI. Not a major issue though.

square-wire-39606

06/14/2021, 8:51 PM

@important-jewelry-43189 That's an interesting idea -

@on_success

@on_failure

- however we will have to think through what other use cases it may be useful for in the future if we go down this path.

square-wire-39606

06/14/2021, 8:52 PM

@damp-lizard-48279 Would you like to start a quick doc outlining the use cases (user stories) and we can quickly come to a conclusion regarding the UX that needs to be supported. We can then brainstorm how to best implement the proposed UX.

damp-lizard-48279

06/14/2021, 10:42 PM

@square-wire-39606 Yes, I’ll collate all the use cases/ideas and share the document.

square-wire-39606

06/14/2021, 10:49 PM

Awesome!

user

06/14/2021, 10:51 PM

@square-wire-39606 yes, including a conditionally executed "error" step that all other steps point to does complicate that graph rendering a bit, but we have some internal DAGs like this (even some quite large ones) and it's not too bad.

square-wire-39606

06/14/2021, 10:57 PM

Nice!

damp-lizard-48279

06/16/2021, 12:27 AM

Here is the document with some options and UX questions that Savin has added as well, feel free to add more things and thoughts to it. https://docs.google.com/document/d/1QPX2JhiVf1XPAs2u5j3jdBCGzi_NWfDMjl-LhSvvEx8/edit?usp=sharing

important-jewelry-43189

06/16/2021, 1:55 AM

example of our log messages. IFTTT automatically shortens the cloudwatch links.

damp-lizard-48279

06/21/2021, 12:34 AM

Hey everyone, thanks for all your input. Me and Savin have added some UX decisions keeping in mind all the inputs you folks have provided. Please have a look and shout out if you think something could be improved/changed. 🙂

👍 2

Open in Slack

Previous Next