We have recently implemented a low-maintenance Metaflow solution using AWS Batch and Step Functions due to the absence of an EKS setup. As part of our modernization efforts, we've transitioned our data warehouse to Snowflake, but we're still developing the optimal structure for it. The goal is to build a Data Lake alongside exploring alternative data warehousing solutions like Databricks. This has led to discussions about machine learning (ML) capabilities, particularly those offered by our current systems, but varying levels of understanding among team members have stalled progress.
We also have recently started to modernize our data warehouse, which has been moved to Snowflake, however we don't quite have the structure that allows it to benefit optimally and are working towards a Data Lake in addition to an alternative Data Warehouse solution. This has caused us to consider the ML solutions that come bundled with the Data Warehouse (i.e. Snowflake, and we are considering moving to Databricks), however the conversation has become somewhat stalled here since there are many perspectives and varying levels of understanding about the various needs and functions of the systems involved.
I have personally tried to keep awareness focused on the need for a Data Lake, which would benefit the ML/AI team as well as the Application needs, although there is some pushback as much of the marketing material around ML/AI presents solutions like RAG as a commodity. Although it is clear that there may be some low-hanging fruit, it's not the consensus of the team that most solutions from the larger players like AWS are at the level of commodity where you can just drop whatever random files you have into a folder and you're done.
Beyond this point, we have found that may of the solutions that tout this magical no-work solution seem to be very focused and costly (i.e. as we found while exploring AWS Kendra).
This has caused some discussion and revealed a bit of a paradigm that I have yet to overcome in that the out-of-the-box magical solutions that promise the moon, there seems to be an understanding that some amount of expense is acceptable, including solutions like Snowpark ML, or Databricks. However, when the discussion turns to Metaflow as a tool for ML/AI Engineers, the cost -which seems to be lower or on par- does not warrant similar concessions. This is not uncommon when comparing open-source to the more highly marketed solutions and IMHO falls into the strategy of, "no one ever got fired for recommending the Cadillac" approach in tech. However, with some determination and convincing from the ML/AI team, we have setup Metaflow internally and started initial work on a RAG implementation to POC with out application.
Beyond the year long history of how we got to this point, the implementation we used was to run the cloud formation script from outerbounds. We have all spent a little time in the outerbounds sandbox demo environment and understand the potential and have worked through a couple minor permissions issues already. We have a few odd scenarios and when posting to this slack channel have found exceptional support, even though we still are only using the free installation (still keeping this plate in the air, but not priority).
The shift in development to use FlowSpec has been very minor to those experienced with contemporary solutions, however, it is looking to be a bit steeper learning curve for others more familiar with R. In all, we are still pretty early in the induction process and expect to ramp up through the end of the year as we start to automate the solutions in production.
In all, my thought is that an unexpected hurdle was selling the advantages of an ML orchestration tool that allows developers to transition from local environments to production with minimal to no effort, while at the same time allow for various IDEs, and ML platforms to be available. Having been in this field for around a decade, it's surprising to me that this is still even a value conversation. This point is clear to me with my own recommendation to simply purchase Outerbounds to further minimize the already extremely low maintenance on our resources, and to expand the support which is already impressive, and especially when compared to alternative solutions that appear to find favor when discussing expenses.
Technically, we are using pgVector and lanceDB for our vector database (lanceDB is currently a solution on the back-burner as a serverless solution, or potentially cost-minimizing solution if needed). We have explored other solutions here -i.e. Pinecone-, however not found a compelling reason to move that direction. We have put TimeScaleDB on the radar since they developed pgVectorScale with impressive performance, and offer a managed solution. We are also using unstructured to pull apart our PDFs (or other content, eventually), and this has caused an interesting workflow with Metaflow, since there is no wheel package we cannot simply include it with the @pypi (or @conda) decorator. This caused us to build our primary image for this work to include in the @batch(image= decorator parameter, which works well, but took a little more time to setup. Now that we have this workflow, we have a good solution that we can move to production.
At this point, while there is significant progress in implementing Metaflow, we've encountered challenges in conveying its value compared to more marketed solutions. Our focus remains on leveraging tools that minimize maintenance and maximize support, ultimately moving toward a production-ready solution. By the end of the year, we aim to fully automate our processes, overcoming hurdles related to team consensus and resource allocation.