I’ve sent draft PR <#612> of the changes I’ve been...
# dev-metaflow
a
I’ve sent draft PR #612 of the changes I’ve been working on (mostly recently discussed here): new CLI and SDK, and simple flow composition via inheritance
👍 5
I’ve updated it since the discussions above, so hoping to move most discussion to the draft PR (and smaller PRs that I will factor out of it)
Also happy to discuss it further here.
s
thanks!
there are great ideas in this PR. Some ideas like flow composition and better support for
pytest
are definitely something we want to support in the future. Other ideas like getting rid of
self.next
are more controversial and they would break all existing flows or add complexity, if both styles were supported. Hence it is hard to consider merging this PR as is
we could cherry pick some diffs from the PR over time
a
OK, sorry if my explanation wasn’t clear, but currently nothing here is backwards-incompatible; you can still write flows the existing way, or you can import the new stuff from
metaflow.api
and write flows the new way. I’m also open to a different name for that package.
self.next
hasn’t gone away, it’s just hidden from the user when they use the new
metaflow.api
version of `FlowSpec`/`@step` (and new `@foreach`/`@join` that also live in the new package)
Check out test_simple_foreach.py to see one example of an old-style and new-style flow defined side-by-side, and a test that runs each one and asserts their data artifacts
lmk if this addresses some of your concerns, and I’ll update the PR description to make some of this more clear.
s
Hi @straight-shampoo-11124, I'm working with @agreeable-toddler-27168 and have very happily been using his metaflow additions. It'd be great if you and your team would consider including some or all of the changes in this PR, as I think it offers a refined experience while still maintaining the core philosophy and functionality that originally attracted us to metaflow. As Ryan has mentioned, this doesn't replace the existing metaflow, it's built on top of it and so is fully compatible with existing flows.
s
thanks for the +1! It is a big change so apologies that we haven’t been able to do more with it this far
I’ll update this thread soon re: the next steps
a
Thanks @straight-shampoo-11124 and @strong-secretary-3596 for bumping this. I still have some AIs around cleaning it up / splitting out pieces, but yea hopefully in the meantime we can discuss the changes at a high level as well 🙏
👍 1
s
would you be open to chatting about this over a Zoom call? It might be faster
a
definitely!
Noah and I are on ET, I’m generally pretty free, feel free to suggest some times that work for you.
We could even record it, maybe some good documentation will come out of it
👍 1
s
great! would 1pm ET work for you Tue or Wed next week?
a
yes, both wfm, if @strong-secretary-3596 can make one or the other let’s do that, otherwise you can pick one!
s
invite sent for Wednesday ☎️ Looking forward to it!
👍 1
c
Could someone please update this thread with what was decided about this PR ?
s
we had to move the meeting so no updates yet. I’ll update this thread once we know more
🙏 1
@creamy-magician-26843 is there a specific part of this PR (it contains many changes) that is of interest to you?
c
Yes, I am specifically interested in DAG reusability (flow inheritence)
s
cool! It is a common request so it is a high priority for us (listed as “flow composition” here https://docs.metaflow.org/introduction/roadmap)
c
Thanks !
🙌 1
a
Hey @straight-shampoo-11124, are we chatting now?
I don’t think we recorded the meeting; some of my belated notes/takeaways: • @straight-shampoo-11124 is open to
pytest
tests existing alongside existing
test/core
tests •
metaflow flow …
CLI seems like a good thing overall; some concern about letting there be 2 ways to invoke flows (
python <file>
still works, when
__main__
handler is present, as is always the case today). Will that confuse users? • Need to drill into which bits of #612 actually depend on one another; right now a lot of things are separate but in sequence, though some of them could be completely factored out and not require any of the other bits to go in first. I’ve filed #661, #662, #663, #665, and #666 today, to start that process. Composing Flows As currently written in #612, inheritance only works with “new-style” `metaflow.api.FlowSpec`s that omit
start
and
end
steps, because I punted on trying to handle steps with the same name being mixed in from multiple flows. An open question is whether/how to namespace steps (in a composed flow, after composition) based on the flow they came from. For example:
Copy code
class Flow1(FlowSpec):
  @step
  def start(self): self.next(self.end)
  @step
  def end(self): pass

class Flow2(FlowSpec):
  @step
  def start(self): self.next(self.end)
  @step
  def end(self): pass

class Flow12(Flow1, Flow2): pass
What should the 4 steps in
Flow12
be named? •
Flow1_start
, …,
Flow2_end
(underscore-delimited): probably easiest to do, but a bit hacky/brittle; steps can have underscores in their names already, so it’s not a robust way to namespace steps. •
Flow1/start
, …,
Flow2/end
(slash-delimited): seems cleanest, but might interact undesirably with existing assumptions around metastore directory structure. Or, perhaps it will Just Work, and introduce a new level of directory hierarchy that is never ambiguous? •
Flow1.start
, …,
Flow2.end
(dot-delimited): might give the best of all worlds, but might also break assumptions about step names. • or, maybe `Flow12`'s steps should actually appear to Metaflow as just {
Flow1
,
Flow2
} x {
start
,
end
}, and the
Flow12
metadata should just point to invocations of those `Flow1`/`Flow2` steps, somehow?
👍 2
Metaclasses, Python 2 In any case, my current approach (and most other versions of it I can imagine) rely on metaclasses (that I added to both new-style and existing `FlowSpec`s), which give a useful hook for computing things about flows at declaration time (as opposed to at run-time, as happens today). However, I think using metaclasses requires dropping Python 2 support, which introduces a potentially big blocker for this work. I guess I saw a Python2 metaclass shim in the code-base, but I can’t justify spending time making things work on Python2. I filed #668 about dropping Python 2, and #662 which drops it from the CI. Perhaps we can start a
3.x
branch and I can focus on landing things there 🤷. Next Steps In general, @straight-shampoo-11124 felt he and the other maintainers have some homework to do in thinking through the large possible design space around composing flows. There’s also a fair amount of stuff in #612 that can go in first (either because later stuff depends on it, so it has to go in first, or because it’s orthogonal, and can go in separately whenever). As mentioned above, I’ve started filing PRs with some of that stuff: #661, #662, #663, #665, #666. lmk if I missed or misrepresented anything @straight-shampoo-11124, and as always I’m interested in folks’ thoughts about these topics more generally. Perhaps we can move them to more focused GitHub issues as well.
(jfyi: slack broke up my message above and only put the first ≈half into the main channel, for those not following the thread directly)
a
@agreeable-toddler-27168 re: https://github.com/Netflix/metaflow/pull/663 - what would be the use case for including
.ipynb
files by default?
a
maybe it’s borne of a misunderstanding on my end about the way notebooks are typically used in Metaflow. We have a step that uses
papermill
to execute some notebooks on data artifacts (computed by previous steps in the flow); the notebooks in that case are “source code” analogous to
.py
files that would be included in the source package sent to Batch. We are able to get them included by augmenting the
METAFLOW_DEFAULT_PACKAGE_SUFFIXES
env var, but the tutorials’ dealings with notebooks seemed similar to this, so I thought it might be necessary there as well, and useful to add by default. Let me know if that’s not the case.
a
The tutorials use notebooks as a scratch space to access the state of the workflow - outside of the workflow execution. That's the reason why we don't package
.ipynb
files. Also, these files can be rather voluminous (given that they may contain execution state). Unless there is a different pressing concern for including notebooks by default, we should err on the side of caution and tightly limit what gets packaged by default.
a
gotcha, I see that now. I assumed they just mirrored the
.py
files as an alternate way to run the flows. My mistake, I’ll close that PR
a
No worries. Let me look into other PRs.
s
thanks for the summary, @agreeable-toddler-27168 - looks accurate! We are currently busy with a few big PR (
@card
#659 and
@kubernetes
#644 as well as refactorings originating from Netflix). After these get merged, thinking through the flow composition is a high priority which directly relates to your PR. Thanks for your patience 🙏