agreeable-toddler-27168
05/25/2021, 2:52 AMpython <file>
+ a __main__
handler). Other subcommands (help
, show
, etc) also work.
• Define multiple flows in one file (make them more like regular Python classes)
• Test them using regular pytest
unittests (including end-to-end tests that shell out to the CLI to run the flow, and verify results from the store afterward)
• Compose flows: right now inheritance works (you can put a group of steps in one FlowSpec
class and "mix it in" to other flows before or after it), which is one way to compose flows. I have a few ideas for composing via imperative commands in the FlowSpec class definition.
• Get rid of some {step,flow}-definition boilerplate, especially: self.start
, self.next
, self.end
. "next" info is moved into the next step's decorator, to separate @step
logic from higher-level control/data flow. Check out the examples.
I am very new to this project, and really just started on this path because my company (Celsius) asked me to help them with 4 flows they wrote and have "pipelined" together in an outer python file that just `check_call`s each one in sequence (which loses some of the good parts of Metaflow, like resume
, and seems generally like not a good setup).
As such, I'm interested in all feedback on whether these are good ideas, if any of them are already implemented, whether I should keep working in these directions, etc. Thanks!agreeable-toddler-27168
05/25/2021, 2:53 AMstraight-shampoo-11124
05/25/2021, 4:58 AMstraight-shampoo-11124
05/25/2021, 4:59 AMstraight-shampoo-11124
05/25/2021, 5:00 AMstraight-shampoo-11124
05/25/2021, 5:00 AMlittle-apartment-49355
05/25/2021, 5:00 AMlittle-apartment-49355
05/25/2021, 5:01 AM__main__
independent flow runs. It is really neat to call flows that way.little-apartment-49355
05/25/2021, 5:03 AMstart
and end
? Maybe I am missing something?agreeable-toddler-27168
05/25/2021, 12:28 PMFlow
metaclass synthesizes trivial start/end, and the actual steps are just arranged in the order they're declared (but you can pass arguments to the step decorator to explicitly set or override which upstream step(s) it depends on). Does that make sense?hallowed-wolf-39595
05/25/2021, 1:42 PMagreeable-toddler-27168
05/25/2021, 2:05 PMdef end(self): pass
for the user seems like an ergonomics improvement.
Whether the first step must always be named start
, my feelings are less clear onagreeable-toddler-27168
05/25/2021, 2:17 PMstart
, end
). That seems like a strange constraint to put on an otherwise simple data model, where:
1. a @step
is a Python function
2. a flow is a DAG of steps, represented by a Python class
My instinct is to peel back stuff that gets in the way of working with that simpler model. That was also a motivation for decoupling a flow from [the file it's defined in] (i.e. getting rid of required __main__
, etc.).little-apartment-49355
05/25/2021, 9:10 PMself.next(self.a, self.b)
allows calling n
separately defined @step
s while maintaining the semantics in the code about the branches. Here the order of arranging @step
s defines execution hierarchy making it a little confusing on how to do branching.agreeable-toddler-27168
05/25/2021, 9:15 PMclass MyFlow(metaclass=Flow):
@step
def one(self):
# initialize something that multiple steps will then operate on
self.n = 111
@step(prev='one')
def doubleN(self):
self.n2 = self.n * 2
@step(prev='one')
def tripleN(self):
self.n3 = self.n * 3
@join
def join(self):
print(f'Computed: {self.n} * 2 == {self.n2}, {self.n} * 3 == {self.n3}')
maybe that join would need to declare its dependencies, like @join('doubleN','tripleN')
, or maybe MF can "just figure it out".
Too much magic can be bad, ofc, but from where we are starting I am really interested in how far we can go in the "just figure it out" direction 🙂agreeable-toddler-27168
05/25/2021, 9:16 PM@step(prev='…')
formulation would also be usable by regular linear steps, if you didn't want to (or couldn't, of some reason) declare them in execution orderagreeable-toddler-27168
05/25/2021, 9:20 PMlittle-apartment-49355
05/26/2021, 4:35 PM@join
becomes a simple decorator. The stringified naming has one small caveat which is that I can't rely heavily on my IDE's code completions when typing out graph transitions and but that's not that big a tradeoff.
I was thinking about how to type using this methodology and wanted your insights on how much can it pushed because this convention requires dealing with some level of ambiguity. I also think your @join
proposal can potentially make some constructs easier to use when building large pipelines. Consider below example flow:
class MyFlow(metaclass=Flow):
@step(prev='start')
def one(self):
# initialize something that multiple steps will then operate on
self.n = 111
@step(prev='start')
def two(self):
self.k = 12
@step(prev='one')
def doubleN(self):
self.n2 = self.n * 2
@step(prev='two')
def tripleK(self):
self.k3 = self.k * 3
@step(prev='tripleK')
def tripleK2(self):
self.k3 = self.k3 * 3
@step(prev='tripleK')
def tripleK3(self):
self.k3 = self.k3 * 3
@join('tripleK3','tripleK2','doubleN')
def join(self,inputs):
# Can this be possible?
one
and two
are branches from start
but as there is no prev
associated, it needs to be inferred from the entire DAG.one
and two
i.e. I have to join tripleK2
and tripleK3
first.
With this convention, we can explicitly make it very clear what artifacts we need to join. This makes it convenient to skip the step in red I made in the figure and directly join all branches stemming from the same base. Do I make sense? Do you think it would be a good tweek?
I am asking about this because with many `foreach`'s I have to keep joining the branches. This convention makes it simple to specify the scope of the join and allows me to merge artifacts in one go. This is currently not possible with MF's design as it will throw an error as the step in red is needed due to design constraints.agreeable-toddler-27168
05/26/2021, 4:40 PM@step(prev='start')
def two(self):
self.k = 12
toward the beginning, right? As written, two
will follow one
linearly
(if so, feel free to edit your msg and i'll delete this one 😉)little-apartment-49355
05/26/2021, 4:42 PMprev=one
and prev=two
in the tripleK
and tripleN
steps, I thought we take care of the ambiguity of one
and two
being branches from start
. 😛agreeable-toddler-27168
05/26/2021, 4:44 PM@step
means "linear dep from the previous step in the file", so in this case it would go start
→ one
→ two
, where you want start
→ {one
,two
}little-apartment-49355
05/26/2021, 4:46 PMprev
be like a keyword that has to be defined or be left completely undefined?agreeable-toddler-27168
05/26/2021, 4:50 PMone
and two
can be siblings here (as opposed to one
→ two
). That would require knowing which self
attrs get set in the body of each function, which in general I don't think we can do (I think it reduces to the halting problem 🙂)agreeable-toddler-27168
05/26/2021, 4:54 PM@step
decorator explicitly has prev='…'
keyword which specifies the step it follows
2. whenever a `@step`'s prev
is just the step before it in the file (as in simple, linear flows), it can be omitted, and MF will infer the prev
value
It's possible there are other/smarter ways to go, but this is what I had in mind, and I think it's unambiguous what should happen in any given case.
Apologies if I'm not understanding your question.agreeable-toddler-27168
05/26/2021, 4:55 PM@join
collapse multiple "levels" of steps, where currently multiple levels of joins are required today. I think the answer to that is "yes": that is not so hard to do from a graph-construction PoV, as your illustration indicates.little-apartment-49355
05/26/2021, 4:56 PMone
and two
are siblings. Was I understanding it incorrectly?little-apartment-49355
05/26/2021, 5:00 PMone
and two
. My question on one
and two
was to only test the level of ambiguity the methodology of typing can support. And I think its totally fine if prev=start
needs to be mentioned.agreeable-toddler-27168
05/26/2021, 5:04 PMclass MyFlow1(metaclass=Flow):
@step
def one(self):
self.n = 111
@step
def two(self):
self.k = 12
class MyFlow2(metaclass=Flow):
@step
def one(self):
self.n = 111
@step
def two(self):
self.n = self.n * 2
It seems like you want MF to notice that the graph in MyFlow1
can be start
→ {one
,two
}, while the graph in MyFlow2
must be start
→ one
→ two
, right?
I am saying that is hard (and maybe impossible). I know that MF can infer start
→ one
→ two
from both of these. Inferring start
→ {one
,two
} from MyFlow1
requires knowing which data attrs get set on self
in one
and two
. I believe I can design flows where that can't be done statically, for example:
class MyFlow3(metaclass=Flow):
@step
def one(self):
self.n = 111
@step
def two(self):
if datetime.now().seconds % 2 == 0:
self.n = self.n * 2 # has to happen after `one`
else:
self.k = 12 # can happen concurrently with `one`
Maybe using current time is cheating… I have flows that fetch stuff over the internet, or have other side effects. I can imagine MF parsing this and being conservative about what attrs might be modified, but then we're biting off a whole static-analysis problem that can get really tricky.little-apartment-49355
05/26/2021, 5:07 PMagreeable-toddler-27168
05/26/2021, 5:08 PMagreeable-toddler-27168
05/26/2021, 5:10 PMself.next
call into the end of every function body, to mimic what already happens, but I believe MF should not require that. Some higher level, which is already calling each step, should know which step is next and call it, without relying on the end of each step to call the next one.little-apartment-49355
05/26/2021, 5:16 PMagreeable-toddler-27168
05/26/2021, 5:16 PMuser
06/04/2021, 4:48 AM