cuddly-rocket-69327
06/23/2021, 10:02 PMforeach
using firecracker
.
This could be a solution to:
• consuming a stream (ex: listen to a stream and produce to a stream some inferences).
• binding to a hyper param stream
• running a task/step on millions of items, where the user doesn't have to think about partitions (example: Spark front page example counts = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
. Although Metaflow only supports "map/foreach" and "join" not reduceByKey
Curious, is there a work item or story for this?
We may tackle this soon! cc @User @User @Userancient-application-36103
06/23/2021, 11:30 PMunbounded foreach
that allows us to offload the Metaflow for-each functionality to an external subsystem. In future, we can offer integrations to various HPO tools (Optuna, Ray Tune etc.) using this mechanism, for example.
There isn't any integration with firecracker yet but it will be fun to prototype one - @firecracker
to guarantee resource isolation on laptops.
Can you expand on what use case you have in mind?cuddly-rocket-69327
06/24/2021, 6:39 PMunbounded foreach
is very cool!
We'd like one where we could we could do a foreach on millions of items without launching a container for each item.
Do you have an example Flow using unbounded foreach
?cuddly-rocket-69327
06/24/2021, 6:47 PMclass HelloFlow(FlowSpec):
@step
def start(self):
self.items = metaflow.input(path) # SparkDF / DaskDF / Ray / streaming DF
self.next(self.mapper, self.items)
@mapper # may be uses firecracker
@step
def start(self):
do_map(self.input)
Where Spark or Dask handle the partitions/splits (or mini-batch in streaming) that launch N parallel containers.
And each item is light weight (firecracker
?) instead of a new container for each item.
Where path
could be a Hive table, s3 path, or stream name that the data provider (Spark, Dask, Ray) supports.cuddly-rocket-69327
06/24/2021, 6:51 PM