Hi <@U01N9QPNV0E>, I've been working in a Metaflow...
# dev-metaflow
a
Hi @User, I've been working in a Metaflow branch on problems maybe related to your msg above: https://github.com/celsiustx/metaflow/tree/dsl/metaflow/api The README has more detail, but the tl;dr is: • Run flows from the command-line: `metaflow flow <file>:<class> run …`; the CLI is reworked so that is the entrypoint for a flow (rather than
python <file>
+ a
__main__
handler). Other subcommands (
help
,
show
, etc) also work. • Define multiple flows in one file (make them more like regular Python classes) • Test them using regular
pytest
unittests (including end-to-end tests that shell out to the CLI to run the flow, and verify results from the store afterward) • Compose flows: right now inheritance works (you can put a group of steps in one
FlowSpec
class and "mix it in" to other flows before or after it), which is one way to compose flows. I have a few ideas for composing via imperative commands in the FlowSpec class definition. • Get rid of some {step,flow}-definition boilerplate, especially:
self.start
,
self.next
,
self.end
. "next" info is moved into the next step's decorator, to separate
@step
logic from higher-level control/data flow. Check out the examples. I am very new to this project, and really just started on this path because my company (Celsius) asked me to help them with 4 flows they wrote and have "pipelined" together in an outer python file that just `check_call`s each one in sequence (which loses some of the good parts of Metaflow, like
resume
, and seems generally like not a good setup). As such, I'm interested in all feedback on whether these are good ideas, if any of them are already implemented, whether I should keep working in these directions, etc. Thanks!
❤️ 3
👍 2
Also, I discussed a slightly earlier snapshot of this work with @average-beach-28850 a few weeks ago. I've been meaning to post here or on GitHub about it, so your message/doc were a good prompt 🙂 thanks
s
woah, lots of great ideas here! 🌈
🙌 1
it'll take a while to digest it all 🙂
👍 1
I'm definitely curious to hear what others folks on this channel think about these ideas too. Feel free to chime in on this thread
👍 1
I will take a deeper look towards the end of the week and get back to you
👍 1
l
Why remove the start and end ?
I really love the
__main__
independent flow runs. It is really neat to call flows that way.
👍 1
I am a little confused about how to read a flow without
start
and
end
? Maybe I am missing something?
a
@little-apartment-49355 my main impetus for removing start/end as explicit, required steps was to support composition via inheritance. The
Flow
metaclass synthesizes trivial start/end, and the actual steps are just arranged in the order they're declared (but you can pass arguments to the step decorator to explicitly set or override which upstream step(s) it depends on). Does that make sense?
2
h
I like the composition, unit testability, getting rid of self.next
👍 1
a
yea, it's possible that composing flows inline inside other flows (rather than via inheritance) will be the preferred route for reusing flows, and in that case the `start`/`end` semantics/requirements can be left alone. Those reqs feel a bit over-bearing today, to me; at a minimum, synthesizing a trivial
Copy code
def end(self): pass
for the user seems like an ergonomics improvement. Whether the first step must always be named
start
, my feelings are less clear on
Or, a different take on it: should a flow be allowed to have <2 steps? Currently there must be ≥2, and the first and last have fixed names (
start
,
end
). That seems like a strange constraint to put on an otherwise simple data model, where: 1. a
@step
is a Python function 2. a flow is a DAG of steps, represented by a Python class My instinct is to peel back stuff that gets in the way of working with that simpler model. That was also a motivation for decoupling a flow from [the file it's defined in] (i.e. getting rid of required
__main__
, etc.).
l
@agreeable-toddler-27168: Yes it does makes sense!. It's a very interesting way to have composition while transitioning everything to an entirely decorator-based design. But I was curious to know how to make parallel branches when we write this way? Ideally the
self.next(self.a, self.b)
allows calling
n
separately defined
@step
s while maintaining the semantics in the code about the branches. Here the order of arranging
@step
s defines execution hierarchy making it a little confusing on how to do branching.
a
yes, good q, I don't think I've implemented it yet, but my idea was that the succeeding steps would be declared like
Copy code
class MyFlow(metaclass=Flow):
  @step
  def one(self):
    # initialize something that multiple steps will then operate on
    self.n = 111

  @step(prev='one')
  def doubleN(self):
    self.n2 = self.n * 2

  @step(prev='one')
  def tripleN(self):
    self.n3 = self.n * 3

  @join
  def join(self):
    print(f'Computed: {self.n} * 2 == {self.n2}, {self.n} * 3 == {self.n3}')
maybe that join would need to declare its dependencies, like
@join('doubleN','tripleN')
, or maybe MF can "just figure it out". Too much magic can be bad, ofc, but from where we are starting I am really interested in how far we can go in the "just figure it out" direction 🙂
that same
@step(prev='…')
formulation would also be usable by regular linear steps, if you didn't want to (or couldn't, of some reason) declare them in execution order
I guess a key thing I should have mentioned is that I am only trying to intercept and mess with things during graph construction. The graphs themselves are not that complicated, generally speaking, but the current way of constructing them involves these hurdles and limitations I feel are unnecessary (or at least, there should be multiple interfaces for constructing them). Other approaches (e.g. more decorator-based like this one, or even wilder stuff like a DSL that does away with using classes to represent flows, and just operates on Python functions directly) should not have much trouble letting users generate any graph that is possible today. I'm additionally trying to make the decorators infer graph structure, to reduce boilerplate, but as a baseline it should be possible to be more explicit+verbose with the decorators, to avoid confusion.
l
@agreeable-toddler-27168: Wow!. this is quite detailed and comprehensive. I really like how
@join
becomes a simple decorator. The stringified naming has one small caveat which is that I can't rely heavily on my IDE's code completions when typing out graph transitions and but that's not that big a tradeoff. I was thinking about how to type using this methodology and wanted your insights on how much can it pushed because this convention requires dealing with some level of ambiguity. I also think your
@join
proposal can potentially make some constructs easier to use when building large pipelines. Consider below example flow:
Copy code
class MyFlow(metaclass=Flow):
    @step(prev='start')
    def one(self):
        # initialize something that multiple steps will then operate on
        self.n = 111

    @step(prev='start')
    def two(self):
        self.k = 12


    @step(prev='one')
    def doubleN(self):
        self.n2 = self.n * 2

    @step(prev='two')
    def tripleK(self):
        self.k3 = self.k * 3

    @step(prev='tripleK')
    def tripleK2(self):
        self.k3 = self.k3 * 3

    @step(prev='tripleK')
    def tripleK3(self):
        self.k3 = self.k3 * 3
    

    @join('tripleK3','tripleK2','doubleN')
    def join(self,inputs):
        # Can this be possible?
In this above flow,
one
and
two
are branches from
start
but as there is no
prev
associated, it needs to be inferred from the entire DAG.
The image attached shows the DAG. The main question is that currently, I have to do a join for every branch I made before I can join the main branch made by
one
and
two
i.e. I have to join
tripleK2
and
tripleK3
first. With this convention, we can explicitly make it very clear what artifacts we need to join. This makes it convenient to skip the step in red I made in the figure and directly join all branches stemming from the same base. Do I make sense? Do you think it would be a good tweek? I am asking about this because with many `foreach`'s I have to keep joining the branches. This convention makes it simple to specify the scope of the join and allows me to merge artifacts in one go. This is currently not possible with MF's design as it will throw an error as the step in red is needed due to design constraints.
a
interesting, I'm parsing this, but one note: I think you want:
Copy code
@step(prev='start')
def two(self):
    self.k = 12
toward the beginning, right? As written,
two
will follow
one
linearly (if so, feel free to edit your msg and i'll delete this one 😉)
l
Oh!. I assumed that as I wrote
prev=one
and
prev=two
in the
tripleK
and
tripleN
steps, I thought we take care of the ambiguity of
one
and
two
being branches from
start
. 😛
a
I think those later parts are fine. I'm just noting that, at least the way I have implemented it now, the graph-parser assumes that a no-args
@step
means "linear dep from the previous step in the file", so in this case it would go
start
one
two
, where you want
start
→ {
one
,
two
}
l
Should
prev
be like a keyword that has to be defined or be left completely undefined?
a
I don't think Metaflow can automatically infer that
one
and
two
can be siblings here (as opposed to
one
two
). That would require knowing which
self
attrs get set in the body of each function, which in general I don't think we can do (I think it reduces to the halting problem 🙂)
To try to be rigorous, here is the way I imagine it working: 1. start by assuming every
@step
decorator explicitly has
prev='…'
keyword which specifies the step it follows 2. whenever a `@step`'s
prev
is just the step before it in the file (as in simple, linear flows), it can be omitted, and MF will infer the
prev
value It's possible there are other/smarter ways to go, but this is what I had in mind, and I think it's unambiguous what should happen in any given case. Apologies if I'm not understanding your question.
1
This is all unrelated from your actual main question, which was whether we can effectively have a
@join
collapse multiple "levels" of steps, where currently multiple levels of joins are required today. I think the answer to that is "yes": that is not so hard to do from a graph-construction PoV, as your illustration indicates.
l
I thought because MF first compiles the graph before scheduling. I thought we can infer the prev's in the decorators and using those we infer if
one
and
two
are siblings. Was I understanding it incorrectly?
The rigorous explanation actually answers my question on the definition of
one
and
two
. My question on
one
and
two
was to only test the level of ambiguity the methodology of typing can support. And I think its totally fine if
prev=start
needs to be mentioned.
👍 1
a
You are understanding mostly correctly. Consider these two flows:
Copy code
class MyFlow1(metaclass=Flow):
    @step
    def one(self):
        self.n = 111

    @step
    def two(self):
        self.k = 12

class MyFlow2(metaclass=Flow):
    @step
    def one(self):
        self.n = 111

    @step
    def two(self):
        self.n = self.n * 2
It seems like you want MF to notice that the graph in
MyFlow1
can be
start
→ {
one
,
two
}, while the graph in
MyFlow2
must be
start
one
two
, right? I am saying that is hard (and maybe impossible). I know that MF can infer
start
one
two
from both of these. Inferring
start
→ {
one
,
two
} from
MyFlow1
requires knowing which data attrs get set on
self
in
one
and
two
. I believe I can design flows where that can't be done statically, for example:
Copy code
class MyFlow3(metaclass=Flow):
    @step
    def one(self):
        self.n = 111

    @step
    def two(self):
        if datetime.now().seconds % 2 == 0:
            self.n = self.n * 2  # has to happen after `one`
        else:
            self.k = 12  # can happen concurrently with `one`
Maybe using current time is cheating… I have flows that fetch stuff over the internet, or have other side effects. I can imagine MF parsing this and being conservative about what attrs might be modified, but then we're biting off a whole static-analysis problem that can get really tricky.
l
Ah, I see!. This makes total sense on how it's reducing to a halting problem 😅. I didn't consider the artifacts in themselves and just saw the step abstraction for the DAG.
a
cool, sorry that was such a digression. I think it's an important design observation though: we are not inferring (can not infer?) graph shape from anything in the body of the steps
I'm actually still injecting a
self.next
call into the end of every function body, to mimic what already happens, but I believe MF should not require that. Some higher level, which is already calling each step, should know which step is next and call it, without relying on the end of each step to call the next one.
l
This was actually quite insightful and I think your rigorous explanation should be noted as an important design constraint when building DAGs using this flavor of typing.
a
great, thanks for hashing it out with me
u
These ideas are interesting. I am not sure how I feel about the implicit start/end steps and the implicit linear ordering (I generally hate magic). I do like the idea of lifting the graph specification out of the user step code entirely, and I do like being able to define sub-DAGs as mix-ins.
👍 1