Hi Team, recently starting our Metaflow journey, w...
# ask-metaflow
c
Hi Team, recently starting our Metaflow journey, we're excited! Before adding all the bells and whistles, we try to integrate a basic version of Metaflow (S3, AWS Batch) to our existing ML platform. After setting up and configuring, we are encountering a Job name error while trying to trigger a test job. Couldn't find much online or in this slack related to that apart from a potential "dot in the user name" (link).
Copy code
2025-02-13 20:11:03.681 [1739448660858353/start/1 (pid 90836)] botocore.errorfactory.ClientException: An error occurred (ClientException) when calling the SubmitJob operation: Error executing request, Exception : Job name should match valid pattern, RequestId: 6c2102ce-3c0b-4f10-8e6a-37011dba3a35
2025-02-13 20:11:03.681 [1739448660858353/start/1 (pid 90836)] Data store error:
2025-02-13 20:11:03.735 [1739448660858353/start/1 (pid 90836)] No completed attempts of the task was found for task 'BranchFlow/1739448660858353/start/1'
2025-02-13 20:11:03.736 [1739448660858353/start/1 (pid 90836)] 
2025-02-13 20:11:03.771 [1739448660858353/start/1 (pid 90836)] Task failed.
Seems like Batch is not happy with the way metaflow names the job (BranchFlow/1739448660858353/start/1) but we can't find a way to make it work. Anyone experienced the same error? Thanks for the support
1
testing flow script
Copy code
from metaflow import FlowSpec, step, batch

class BranchFlow(FlowSpec):

    @batch()
    @step
    def start(self):
        self.next(self.a, self.b)

    @batch()
    @step
    def a(self):
        self.x = 1
        self.next(self.join)

    @batch()
    @step
    def b(self):
        self.x = 2
        self.next(self.join)

    @batch()   
    @step
    def join(self, inputs):
        print('a is %s' % inputs.a.x)
        print('b is %s' % inputs.b.x)
        print('total is %d' % sum(input.x for input in inputs))
        self.next(self.end)

    @batch()
    @step
    def end(self):
        pass

if __name__ == '__main__':
    BranchFlow()
Basic config with additional iam role and s3 buckets
Copy code
# Batch compute environment
resource "aws_batch_compute_environment" "main" {
  compute_environment_name = "metaflow-compute-env"

  compute_resources {
    max_vcpus = 16
    min_vcpus = 0

    security_group_ids = [var.private_ml_platform_sg_id]
    subnets            = var.private_subnet_ids

    allocation_strategy = "BEST_FIT_PROGRESSIVE"
    type = "EC2"
    instance_type = ["optimal"]

    instance_role = aws_iam_instance_profile.metaflow_ecs_task.arn
  }

  service_role = aws_iam_role.metaflow_batch_execution.arn
  type         = "MANAGED"
  depends_on   = [aws_iam_role_policy_attachment.aws_batch_service_role]

  tags = {
    CostCenter = var.tags_cost_center
    Project    = var.tags_project
    Service    = var.metaflow_tags_service
  }
}

# Batch job queue
resource "aws_batch_job_queue" "main" {
  name     = "metaflow-job-queue"
  state    = "ENABLED"
  priority = 1

  compute_environment_order {
    compute_environment = aws_batch_compute_environment.main.arn
    order = 1
  }

  tags = {
    CostCenter = var.tags_cost_center
    Project    = var.tags_project
    Service    = var.metaflow_tags_service
  }
}
a
Can you help me with the full terminal logs?
c
Here're the full logs
Copy code
(ds-churn-prediction-py3.11) ➜  ds-churn-prediction git:(develop) ✗ make run-metaflow-pipeline-test-batch
AWS_PROFILE=sandbox-remi-singapore METAFLOW_DEBUG=1 python -m src.pipeline.metaflow_pipeline.test_flow run
Metaflow 2.12.39 executing BranchFlow for user:remi.moise@ascenda.com
Validating your flow...
    The graph looks good!
Running pylint...
    Pylint not found, so extra checks are disabled.
2025-02-13 20:11:00.859 Workflow starting (run-id 1739448660858353):
2025-02-13 20:11:01.890 [1739448660858353/start/1 (pid 90836)] Task is starting.
2025-02-13 20:11:03.018 [1739448660858353/start/1 (pid 90836)] Traceback (most recent call last):
2025-02-13 20:11:03.018 [1739448660858353/start/1 (pid 90836)] File "/Users/remi.moise@ascenda.com/Library/Caches/pypoetry/virtualenvs/ds-churn-prediction-6VoP5-AC-py3.11/lib/python3.11/site-packages/metaflow/plugins/aws/batch/batch_cli.py", line 316, in step
2025-02-13 20:11:03.680 [1739448660858353/start/1 (pid 90836)] batch.launch_job(
2025-02-13 20:11:03.680 [1739448660858353/start/1 (pid 90836)] File "/Users/remi.moise@ascenda.com/Library/Caches/pypoetry/virtualenvs/ds-churn-prediction-6VoP5-AC-py3.11/lib/python3.11/site-packages/metaflow/plugins/aws/batch/batch.py", line 407, in launch_job
2025-02-13 20:11:03.680 [1739448660858353/start/1 (pid 90836)] self.job = job.execute()
2025-02-13 20:11:03.680 [1739448660858353/start/1 (pid 90836)] ^^^^^^^^^^^^^
2025-02-13 20:11:03.680 [1739448660858353/start/1 (pid 90836)] File "/Users/remi.moise@ascenda.com/Library/Caches/pypoetry/virtualenvs/ds-churn-prediction-6VoP5-AC-py3.11/lib/python3.11/site-packages/metaflow/plugins/aws/batch/batch_client.py", line 141, in execute
2025-02-13 20:11:03.680 [1739448660858353/start/1 (pid 90836)] response = self._client.submit_job(**self.payload)
2025-02-13 20:11:03.680 [1739448660858353/start/1 (pid 90836)] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-13 20:11:03.680 [1739448660858353/start/1 (pid 90836)] File "/Users/remi.moise@ascenda.com/Library/Caches/pypoetry/virtualenvs/ds-churn-prediction-6VoP5-AC-py3.11/lib/python3.11/site-packages/botocore/client.py", line 569, in _api_call
2025-02-13 20:11:03.680 [1739448660858353/start/1 (pid 90836)] return self._make_api_call(operation_name, kwargs)
2025-02-13 20:11:03.680 [1739448660858353/start/1 (pid 90836)] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-02-13 20:11:03.680 [1739448660858353/start/1 (pid 90836)] File "/Users/remi.moise@ascenda.com/Library/Caches/pypoetry/virtualenvs/ds-churn-prediction-6VoP5-AC-py3.11/lib/python3.11/site-packages/botocore/client.py", line 1023, in _make_api_call
2025-02-13 20:11:03.681 [1739448660858353/start/1 (pid 90836)] raise error_class(parsed_response, operation_name)
2025-02-13 20:11:03.681 [1739448660858353/start/1 (pid 90836)] botocore.errorfactory.ClientException: An error occurred (ClientException) when calling the SubmitJob operation: Error executing request, Exception : Job name should match valid pattern, RequestId: 6c2102ce-3c0b-4f10-8e6a-37011dba3a35
2025-02-13 20:11:03.681 [1739448660858353/start/1 (pid 90836)] Data store error:
2025-02-13 20:11:03.735 [1739448660858353/start/1 (pid 90836)] No completed attempts of the task was found for task 'BranchFlow/1739448660858353/start/1'
2025-02-13 20:11:03.736 [1739448660858353/start/1 (pid 90836)] 
2025-02-13 20:11:03.771 [1739448660858353/start/1 (pid 90836)] Task failed.
2025-02-13 20:11:03.846 Workflow failed.
2025-02-13 20:11:03.846 Terminating 0 active tasks...
2025-02-13 20:11:03.846 Flushing logs...
    Step failure:
    Step start (task-id 1) failed.

make: *** [run-metaflow-pipeline-test-batch] Error 1
Slightly more digestible
a
Can you do METAFLOW_USER=remi in your env
It seems that email addresses as usernames are not well tolerated
c
It's working thanks! 🙇
among us party 1