Hi < U01N9QPNV0E> I ve been working in a Metaflow branch on Outerbounds #dev-metaflow

Hi <@U01N9QPNV0E>, I've been working in a Metaflow...

agreeable-toddler-27168

05/25/2021, 2:52 AM

Hi @User, I've been working in a Metaflow branch on problems maybe related to your msg above: https://github.com/celsiustx/metaflow/tree/dsl/metaflow/api The README has more detail, but the tl;dr is: • Run flows from the command-line: `metaflow flow <file>:<class> run …`; the CLI is reworked so that is the entrypoint for a flow (rather than

python <file>

+ a

__main__

handler). Other subcommands (

help

show

, etc) also work. • Define multiple flows in one file (make them more like regular Python classes) • Test them using regular

pytest

unittests (including end-to-end tests that shell out to the CLI to run the flow, and verify results from the store afterward) • Compose flows: right now inheritance works (you can put a group of steps in one

FlowSpec

class and "mix it in" to other flows before or after it), which is one way to compose flows. I have a few ideas for composing via imperative commands in the FlowSpec class definition. • Get rid of some {step,flow}-definition boilerplate, especially:

self.start

self.next

self.end

. "next" info is moved into the next step's decorator, to separate

@step

logic from higher-level control/data flow. Check out the examples. I am very new to this project, and really just started on this path because my company (Celsius) asked me to help them with 4 flows they wrote and have "pipelined" together in an outer python file that just `check_call`s each one in sequence (which loses some of the good parts of Metaflow, like

resume

, and seems generally like not a good setup). As such, I'm interested in all feedback on whether these are good ideas, if any of them are already implemented, whether I should keep working in these directions, etc. Thanks!

❤️ 3

👍 2

agreeable-toddler-27168

05/25/2021, 2:53 AM

Also, I discussed a slightly earlier snapshot of this work with @average-beach-28850 a few weeks ago. I've been meaning to post here or on GitHub about it, so your message/doc were a good prompt 🙂 thanks

straight-shampoo-11124

05/25/2021, 4:58 AM

woah, lots of great ideas here! 🌈

🙌 1

straight-shampoo-11124

05/25/2021, 4:59 AM

it'll take a while to digest it all 🙂

👍 1

straight-shampoo-11124

05/25/2021, 5:00 AM

I'm definitely curious to hear what others folks on this channel think about these ideas too. Feel free to chime in on this thread

👍 1

straight-shampoo-11124

05/25/2021, 5:00 AM

I will take a deeper look towards the end of the week and get back to you

👍 1

little-apartment-49355

05/25/2021, 5:00 AM

Why remove the start and end ?

little-apartment-49355

05/25/2021, 5:01 AM

I really love the

__main__

independent flow runs. It is really neat to call flows that way.

👍 1

little-apartment-49355

05/25/2021, 5:03 AM

I am a little confused about how to read a flow without

start

and

end

? Maybe I am missing something?

agreeable-toddler-27168

05/25/2021, 12:28 PM

@little-apartment-49355 my main impetus for removing start/end as explicit, required steps was to support composition via inheritance. The

Flow

metaclass synthesizes trivial start/end, and the actual steps are just arranged in the order they're declared (but you can pass arguments to the step decorator to explicitly set or override which upstream step(s) it depends on). Does that make sense?

⭐ 2

hallowed-wolf-39595

05/25/2021, 1:42 PM

I like the composition, unit testability, getting rid of self.next

👍 1

agreeable-toddler-27168

05/25/2021, 2:05 PM

yea, it's possible that composing flows inline inside other flows (rather than via inheritance) will be the preferred route for reusing flows, and in that case the `start`/`end` semantics/requirements can be left alone. Those reqs feel a bit over-bearing today, to me; at a minimum, synthesizing a trivial

Copy code

def end(self): pass

for the user seems like an ergonomics improvement. Whether the first step must always be named

start

, my feelings are less clear on

agreeable-toddler-27168

05/25/2021, 2:17 PM

Or, a different take on it: should a flow be allowed to have <2 steps? Currently there must be ≥2, and the first and last have fixed names (

start

end

). That seems like a strange constraint to put on an otherwise simple data model, where: 1. a

@step

is a Python function 2. a flow is a DAG of steps, represented by a Python class My instinct is to peel back stuff that gets in the way of working with that simpler model. That was also a motivation for decoupling a flow from [the file it's defined in] (i.e. getting rid of required

__main__

, etc.).

little-apartment-49355

05/25/2021, 9:10 PM

@agreeable-toddler-27168: Yes it does makes sense!. It's a very interesting way to have composition while transitioning everything to an entirely decorator-based design. But I was curious to know how to make parallel branches when we write this way? Ideally the

self.next(self.a, self.b)

allows calling

separately defined

@step

s while maintaining the semantics in the code about the branches. Here the order of arranging

@step

s defines execution hierarchy making it a little confusing on how to do branching.

agreeable-toddler-27168

05/25/2021, 9:15 PM

yes, good q, I don't think I've implemented it yet, but my idea was that the succeeding steps would be declared like

Copy code

class MyFlow(metaclass=Flow):
  @step
  def one(self):
    # initialize something that multiple steps will then operate on
    self.n = 111

  @step(prev='one')
  def doubleN(self):
    self.n2 = self.n * 2

  @step(prev='one')
  def tripleN(self):
    self.n3 = self.n * 3

  @join
  def join(self):
    print(f'Computed: {self.n} * 2 == {self.n2}, {self.n} * 3 == {self.n3}')

maybe that join would need to declare its dependencies, like

@join('doubleN','tripleN')

, or maybe MF can "just figure it out". Too much magic can be bad, ofc, but from where we are starting I am really interested in how far we can go in the "just figure it out" direction 🙂

agreeable-toddler-27168

05/25/2021, 9:16 PM

that same

@step(prev='…')

formulation would also be usable by regular linear steps, if you didn't want to (or couldn't, of some reason) declare them in execution order

agreeable-toddler-27168

05/25/2021, 9:20 PM

I guess a key thing I should have mentioned is that I am only trying to intercept and mess with things during graph construction. The graphs themselves are not that complicated, generally speaking, but the current way of constructing them involves these hurdles and limitations I feel are unnecessary (or at least, there should be multiple interfaces for constructing them). Other approaches (e.g. more decorator-based like this one, or even wilder stuff like a DSL that does away with using classes to represent flows, and just operates on Python functions directly) should not have much trouble letting users generate any graph that is possible today. I'm additionally trying to make the decorators infer graph structure, to reduce boilerplate, but as a baseline it should be possible to be more explicit+verbose with the decorators, to avoid confusion.

little-apartment-49355

05/26/2021, 4:35 PM

@agreeable-toddler-27168: Wow!. this is quite detailed and comprehensive. I really like how

@join

becomes a simple decorator. The stringified naming has one small caveat which is that I can't rely heavily on my IDE's code completions when typing out graph transitions and but that's not that big a tradeoff. I was thinking about how to type using this methodology and wanted your insights on how much can it pushed because this convention requires dealing with some level of ambiguity. I also think your

@join

proposal can potentially make some constructs easier to use when building large pipelines. Consider below example flow:

Copy code

class MyFlow(metaclass=Flow):
    @step(prev='start')
    def one(self):
        # initialize something that multiple steps will then operate on
        self.n = 111

    @step(prev='start')
    def two(self):
        self.k = 12


    @step(prev='one')
    def doubleN(self):
        self.n2 = self.n * 2

    @step(prev='two')
    def tripleK(self):
        self.k3 = self.k * 3

    @step(prev='tripleK')
    def tripleK2(self):
        self.k3 = self.k3 * 3

    @step(prev='tripleK')
    def tripleK3(self):
        self.k3 = self.k3 * 3
    

    @join('tripleK3','tripleK2','doubleN')
    def join(self,inputs):
        # Can this be possible?

In this above flow,
one
and
two
are branches from
start
but as there is no
prev
associated, it needs to be inferred from the entire DAG. The image attached shows the DAG. The main question is that currently, I have to do a join for every branch I made before I can join the main branch made by

one

and

two

i.e. I have to join

tripleK2

and

tripleK3

first. With this convention, we can explicitly make it very clear what artifacts we need to join. This makes it convenient to skip the step in red I made in the figure and directly join all branches stemming from the same base. Do I make sense? Do you think it would be a good tweek? I am asking about this because with many `foreach`'s I have to keep joining the branches. This convention makes it simple to specify the scope of the join and allows me to merge artifacts in one go. This is currently not possible with MF's design as it will throw an error as the step in red is needed due to design constraints.

agreeable-toddler-27168

05/26/2021, 4:40 PM

interesting, I'm parsing this, but one note: I think you want:

Copy code

@step(prev='start')
def two(self):
    self.k = 12

toward the beginning, right? As written,

two

will follow

one

linearly (if so, feel free to edit your msg and i'll delete this one 😉)

little-apartment-49355

05/26/2021, 4:42 PM

Oh!. I assumed that as I wrote

prev=one

and

prev=two

in the

tripleK

and

tripleN

steps, I thought we take care of the ambiguity of

one

and

two

being branches from

start

. 😛

agreeable-toddler-27168

05/26/2021, 4:44 PM

I think those later parts are fine. I'm just noting that, at least the way I have implemented it now, the graph-parser assumes that a no-args

@step

means "linear dep from the previous step in the file", so in this case it would go

start

→

one

→

two

, where you want

start

→ {

one

two

}

little-apartment-49355

05/26/2021, 4:46 PM

Should

prev

be like a keyword that has to be defined or be left completely undefined?

agreeable-toddler-27168

05/26/2021, 4:50 PM

I don't think Metaflow can automatically infer that

one

and

two

can be siblings here (as opposed to

one

→

two

). That would require knowing which

self

attrs get set in the body of each function, which in general I don't think we can do (I think it reduces to the halting problem 🙂)

agreeable-toddler-27168

05/26/2021, 4:54 PM

To try to be rigorous, here is the way I imagine it working: 1. start by assuming every

@step

decorator explicitly has

prev='…'

keyword which specifies the step it follows 2. whenever a `@step`'s

prev

is just the step before it in the file (as in simple, linear flows), it can be omitted, and MF will infer the

prev

value It's possible there are other/smarter ways to go, but this is what I had in mind, and I think it's unambiguous what should happen in any given case. Apologies if I'm not understanding your question.

⭐ 1

agreeable-toddler-27168

05/26/2021, 4:55 PM

This is all unrelated from your actual main question, which was whether we can effectively have a

@join

collapse multiple "levels" of steps, where currently multiple levels of joins are required today. I think the answer to that is "yes": that is not so hard to do from a graph-construction PoV, as your illustration indicates.

little-apartment-49355

05/26/2021, 4:56 PM

I thought because MF first compiles the graph before scheduling. I thought we can infer the prev's in the decorators and using those we infer if

one

and

two

are siblings. Was I understanding it incorrectly?

little-apartment-49355

05/26/2021, 5:00 PM

The rigorous explanation actually answers my question on the definition of

one

and

two

. My question on

one

and

two

was to only test the level of ambiguity the methodology of typing can support. And I think its totally fine if

prev=start

needs to be mentioned.

👍 1

agreeable-toddler-27168

05/26/2021, 5:04 PM

You are understanding mostly correctly. Consider these two flows:

Copy code

class MyFlow1(metaclass=Flow):
    @step
    def one(self):
        self.n = 111

    @step
    def two(self):
        self.k = 12

class MyFlow2(metaclass=Flow):
    @step
    def one(self):
        self.n = 111

    @step
    def two(self):
        self.n = self.n * 2

It seems like you want MF to notice that the graph in

MyFlow1

can be

start

→ {

one

two

}, while the graph in

MyFlow2

must be

start

→

one

→

two

, right? I am saying that is hard (and maybe impossible). I know that MF can infer

start

→

one

→

two

from both of these. Inferring

start

→ {

one

two

} from

MyFlow1

requires knowing which data attrs get set on

self

one

and

two

. I believe I can design flows where that can't be done statically, for example:

Copy code

class MyFlow3(metaclass=Flow):
    @step
    def one(self):
        self.n = 111

    @step
    def two(self):
        if datetime.now().seconds % 2 == 0:
            self.n = self.n * 2  # has to happen after `one`
        else:
            self.k = 12  # can happen concurrently with `one`

Maybe using current time is cheating… I have flows that fetch stuff over the internet, or have other side effects. I can imagine MF parsing this and being conservative about what attrs might be modified, but then we're biting off a whole static-analysis problem that can get really tricky.

little-apartment-49355

05/26/2021, 5:07 PM

Ah, I see!. This makes total sense on how it's reducing to a halting problem 😅. I didn't consider the artifacts in themselves and just saw the step abstraction for the DAG.

agreeable-toddler-27168

05/26/2021, 5:08 PM

cool, sorry that was such a digression. I think it's an important design observation though: we are not inferring (can not infer?) graph shape from anything in the body of the steps

agreeable-toddler-27168

05/26/2021, 5:10 PM

I'm actually still injecting a

self.next

call into the end of every function body, to mimic what already happens, but I believe MF should not require that. Some higher level, which is already calling each step, should know which step is next and call it, without relying on the end of each step to call the next one.

little-apartment-49355

05/26/2021, 5:16 PM

This was actually quite insightful and I think your rigorous explanation should be noted as an important design constraint when building DAGs using this flavor of typing.

agreeable-toddler-27168

05/26/2021, 5:16 PM

great, thanks for hashing it out with me

user

06/04/2021, 4:48 AM

These ideas are interesting. I am not sure how I feel about the implicit start/end steps and the implicit linear ordering (I generally hate magic). I do like the idea of lifting the graph specification out of the user step code entirely, and I do like being able to define sub-DAGs as mix-ins.

👍 1

Open in Slack

Previous Next