Hi < square wire 39606> I heard mention of an unbounded `for Outerbounds #dev-metaflow

Hi <@U01U2JMQW5A>, I heard mention of an unbounded...

cuddly-rocket-69327

06/23/2021, 10:02 PM

Hi @square-wire-39606, I heard mention of an unbounded

foreach

using

firecracker

. This could be a solution to: • consuming a stream (ex: listen to a stream and produce to a stream some inferences). • binding to a hyper param stream • running a task/step on millions of items, where the user doesn't have to think about partitions (example: Spark front page example

counts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)

. Although Metaflow only supports "map/foreach" and "join" not

reduceByKey

Curious, is there a work item or story for this? We may tackle this soon! cc @User @User @User

ancient-application-36103

06/23/2021, 11:30 PM

Hi @cuddly-rocket-69327! We recently introduced a functionality -

unbounded foreach

that allows us to offload the Metaflow for-each functionality to an external subsystem. In future, we can offer integrations to various HPO tools (Optuna, Ray Tune etc.) using this mechanism, for example. There isn't any integration with firecracker yet but it will be fun to prototype one -

@firecracker

to guarantee resource isolation on laptops. Can you expand on what use case you have in mind?

cuddly-rocket-69327

06/24/2021, 6:39 PM

unbounded foreach

is very cool! We'd like one where we could we could do a foreach on millions of items without launching a container for each item. Do you have an example Flow using

unbounded foreach

cuddly-rocket-69327

06/24/2021, 6:47 PM

Copy code

class HelloFlow(FlowSpec):
    @step
    def start(self):
       self.items = metaflow.input(path) # SparkDF / DaskDF / Ray / streaming DF 
       self.next(self.mapper, self.items)

    @mapper # may be uses firecracker
    @step
    def start(self):
       do_map(self.input)

Where Spark or Dask handle the partitions/splits (or mini-batch in streaming) that launch N parallel containers. And each item is light weight (

firecracker

?) instead of a new container for each item. Where

path

could be a Hive table, s3 path, or stream name that the data provider (Spark, Dask, Ray) supports.

cuddly-rocket-69327

06/24/2021, 6:51 PM

I saw discussion about Ray https://zillowgroup.slack.com/archives/C020U025QJK/p1624042922163000

Open in Slack

Previous Next