Hola! We have a metaflow set-up (using your terraf...
# ask-metaflow
f
Hola! We have a metaflow set-up (using your terraform template) and we have also configured an EKS cluster provided by our company (not your terraform set-up) to work with the terraform-provided
metaflow service
. I can run metaflow jobs natively in EKS and everything works well (see job example), and I can see in the logs (
cloudWatch
) for the service that everything runs.
Copy code
apiVersion: batch/v1
kind: Job
metadata:
  name: veo-ai-metaflow-native-jobs
  namespace: metaflow
spec:
  ttlSecondsAfterFinished: 100
spec:
  template:
    spec:
      serviceAccountName: metaflow
      containers:
        - name: <NAME>
          image: <IMAGE>
          command: ["bash", "-c"]
          args:
            - pip install kubernetes boto3 --ignore-installed &&
              python path/to/flow.py run --with kubernetes --max-num-splits 3000;
          env:
            - name: AWS_DEFAULT_REGION
              value: <REGION>
            - name: <METAFLOW_SERVICE_URL>
              value: <METAFLOW_SERVICE_URL>
            - name: METAFLOW_SERVICE_AUTH_KEY
              value: <METAFLOW_SERVICE_AUTH_KEY>
            - name: METAFLOW_DEFAULT_METADATA
              value: service
            - name: METAFLOW_ECS_S3_ACCESS_IAM_ROLE
              value: <METAFLOW_ECS_S3_ACCESS_IAM_ROLE>
            - name: METAFLOW_DATASTORE_SYSROOT_S3
              value: <METAFLOW_DATASTORE_SYSROOT_S3>
            - name: METAFLOW_DATATOOLS_SYSROOT_S3
              value: <METAFLOW_DATATOOLS_SYSROOT_S3>
            - name: METAFLOW_DEFAULT_DATASTORE
              value: s3
            - name: USERNAME
              value: alex
            - name: METAFLOW_RUNTIME_IN_CLUSTER
              value: 'yes' # $ Important environment variable.
            - name: METAFLOW_KUBERNETES_NAMESPACE
              value: metaflow
            - name: METAFLOW_KUBERNETES_SERVICE_ACCOUNT
              value: metaflow
      restartPolicy: Never
  backoffLimit: 4
However, if I try with
python path/to/flow.py run -with kubernetes ...
I get this:
Copy code
requests.exceptions.ConnectionError: HTTPConnectionPool(host='<http://metaflow-nlb-pn0ancjz-460897aea31b08ac.elb.eu-west-1.amazonaws.com|metaflow-nlb-pn0ancjz-460897aea31b08ac.elb.eu-west-1.amazonaws.com>', port=80): Max retries exceeded with url: /flows/SportBasicFlow/runs/287/steps/start/tasks/39602/metadata (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fa11f6bec70>: Failed to establish a new connection: [Errno 110] Connection timed out'))
The job is created in EKS and I can see that the steps of
setting up environment,
downloading code package
and etc are completed, but then fails at
task is starting
... am probably missing a port configuration or something. If you have any ideas on how to solve this it would be highly appreciated. Thanks a lot for the great work 💪
✅ 1