Howdy everyone! We are getting started with a PoC ...
# ask-metaflow
p
Howdy everyone! We are getting started with a PoC of metaflow and I'm curious if there are any user guides, tutorials, or general content specifically targeted at data scientists. From an MLE perspective, I think the pitch is clear, but I'm having difficulty articulating to our data scientists what their workflow changes would look like with metaflow or why they should care aside from automation of current manual functionality. Today, the primary DS workflow is connecting Jupyter notebooks to public cloud VMs via VSCode for running experiments.
1
b
I think I'm in a similar position, or perhaps slightly ahead as far as getting buy in and setting up some PoC for the team to get going. Happy to compare notes, if helpful? We just setup metaflow in our aws cloud using the cloudformation script to enable batch, and I feel like the team is aware of the use cases for metaflow and how it juxtaposes to our data and application. I highly recommend checking out the outerbounds sandbox if you haven't already!
b
I'm in a relatively similar boat, finishing up a POC and now having to demonstrate value to the company and the DS's within our team. My 2 cents: If your company is moving to Metaflow, then it's been decided that orchestration has become a priority. One key to their adoption of the tool is to make clear what the advantages of orchestration are to the organization as a whole. Just by using the tool they are contributing to the org's ML maturity. Next, the thing to make clear is that onboarding DS's to Metaflow is a real shift in their typical workflow. They'll need to spend time learning a new tool, a new API, and perhaps a slightly new way of working, but there are a lot of advantages in doing so. I would highlight here that Metaflow's API is much easier to learn than other orchestrators. The first comment in this reddit thread is a good example of why Metaflow better than Airflow for a DS. To answer directly what the benefits of moving from "connecting Jupyter notebooks to public cloud VMs via VSCode for running experiments" to using Metaflow: (not a exhaustive list): • Dependency management: Metaflow helps manage your dependencies with the
@conda
and
@pip
decorators • ML artifact management: It's very easy to save anything and fetch it later. Ex: data at any step of your pipeline, configs, hyperparams, models, etc. ◦ This feature can act as a base experiment management tool as well, removing the need to install and manage another tools in this vertical. • Observability into your runs: Instead of a run being just the cell outputs of a notebooks - which is pretty ephemeral, or only bound to someone's EC2 instance, Metaflow runs are persisted, available to anyone, and contain a lot of useful info & metadata, especially if you use the Metaflow UI. You can also attach visualizations & tables to your runs/experiments via
cards
which is useful to make things interpretable. ◦ Another way of seeing it is you can now share your runs, and people can "plug" into them and fetch any data they need from them • Scaling: scaling to large compute and GPUs is just 1 decorator away. You can also use branching as an easy way to run large compute in parallel. • While there's additional upfront work required for building Flows compared to purely staying in a Jupyter notebook env; once your dev work and Flows are done, 2 things happen: ◦ Your code is now packaged and is much easier to run and reproduce across the org. You instantly solve a lot of the reproducibility problems inherent to data science. ◦ Deploying runs to production is seamless and effortless. Dev Flow to Prod Flow is so low effort that you can empower your DS's to take care of that themselves, as opposed to handing off the code to an MLE for instance. • Continuous training/deployment - This one is generalizable to any orchestrator, but once you create a training Flow for instance, you can turn this in to a continuous training flow - fetching new data on an interval & comparing champion VS challenger models - quite easily. Let me know if you have feedback as i'm still new to Metaflow/Outerbounds and if you have other issues/painpoints or team is experiencing Also @brave-camera-54148 - would love to share some notes!
💯 1
this 1
b
Excellent! I will follow up next week, it's short for me so if I get distracted, then perhaps in 2 weeks. Thanks so much, this is a great beginning to the conversation!
We have recently implemented a low-maintenance Metaflow solution using AWS Batch and Step Functions due to the absence of an EKS setup. As part of our modernization efforts, we've transitioned our data warehouse to Snowflake, but we're still developing the optimal structure for it. The goal is to build a Data Lake alongside exploring alternative data warehousing solutions like Databricks. This has led to discussions about machine learning (ML) capabilities, particularly those offered by our current systems, but varying levels of understanding among team members have stalled progress. We also have recently started to modernize our data warehouse, which has been moved to Snowflake, however we don't quite have the structure that allows it to benefit optimally and are working towards a Data Lake in addition to an alternative Data Warehouse solution. This has caused us to consider the ML solutions that come bundled with the Data Warehouse (i.e. Snowflake, and we are considering moving to Databricks), however the conversation has become somewhat stalled here since there are many perspectives and varying levels of understanding about the various needs and functions of the systems involved. I have personally tried to keep awareness focused on the need for a Data Lake, which would benefit the ML/AI team as well as the Application needs, although there is some pushback as much of the marketing material around ML/AI presents solutions like RAG as a commodity. Although it is clear that there may be some low-hanging fruit, it's not the consensus of the team that most solutions from the larger players like AWS are at the level of commodity where you can just drop whatever random files you have into a folder and you're done. Beyond this point, we have found that may of the solutions that tout this magical no-work solution seem to be very focused and costly (i.e. as we found while exploring AWS Kendra). This has caused some discussion and revealed a bit of a paradigm that I have yet to overcome in that the out-of-the-box magical solutions that promise the moon, there seems to be an understanding that some amount of expense is acceptable, including solutions like Snowpark ML, or Databricks. However, when the discussion turns to Metaflow as a tool for ML/AI Engineers, the cost -which seems to be lower or on par- does not warrant similar concessions. This is not uncommon when comparing open-source to the more highly marketed solutions and IMHO falls into the strategy of, "no one ever got fired for recommending the Cadillac" approach in tech. However, with some determination and convincing from the ML/AI team, we have setup Metaflow internally and started initial work on a RAG implementation to POC with out application. Beyond the year long history of how we got to this point, the implementation we used was to run the cloud formation script from outerbounds. We have all spent a little time in the outerbounds sandbox demo environment and understand the potential and have worked through a couple minor permissions issues already. We have a few odd scenarios and when posting to this slack channel have found exceptional support, even though we still are only using the free installation (still keeping this plate in the air, but not priority). The shift in development to use FlowSpec has been very minor to those experienced with contemporary solutions, however, it is looking to be a bit steeper learning curve for others more familiar with R. In all, we are still pretty early in the induction process and expect to ramp up through the end of the year as we start to automate the solutions in production. In all, my thought is that an unexpected hurdle was selling the advantages of an ML orchestration tool that allows developers to transition from local environments to production with minimal to no effort, while at the same time allow for various IDEs, and ML platforms to be available. Having been in this field for around a decade, it's surprising to me that this is still even a value conversation. This point is clear to me with my own recommendation to simply purchase Outerbounds to further minimize the already extremely low maintenance on our resources, and to expand the support which is already impressive, and especially when compared to alternative solutions that appear to find favor when discussing expenses. Technically, we are using pgVector and lanceDB for our vector database (lanceDB is currently a solution on the back-burner as a serverless solution, or potentially cost-minimizing solution if needed). We have explored other solutions here -i.e. Pinecone-, however not found a compelling reason to move that direction. We have put TimeScaleDB on the radar since they developed pgVectorScale with impressive performance, and offer a managed solution. We are also using unstructured to pull apart our PDFs (or other content, eventually), and this has caused an interesting workflow with Metaflow, since there is no wheel package we cannot simply include it with the @pypi (or @conda) decorator. This caused us to build our primary image for this work to include in the @batch(image= decorator parameter, which works well, but took a little more time to setup. Now that we have this workflow, we have a good solution that we can move to production. At this point, while there is significant progress in implementing Metaflow, we've encountered challenges in conveying its value compared to more marketed solutions. Our focus remains on leveraging tools that minimize maintenance and maximize support, ultimately moving toward a production-ready solution. By the end of the year, we aim to fully automate our processes, overcoming hurdles related to team consensus and resource allocation.