Hi all, we have recently started experiencing this...
# ask-metaflow
h
Hi all, we have recently started experiencing this issue at large (about 10 devs at the same time) and even after having doubled the memory of all the deployed services (metadata, backend and UI) it is still behaving as such. We have been on v1.3.11 of the UI image since April 9 and v2.4.11 of the metadata image since June 13 and we just started seeing this behavior over the last few days. https://www.loom.com/share/2be90482964c4695acb6904625cc41d1?sid=3da09cfa-76e0-4784-b4dc-3e4450551e1d
1
s
Interesting. @brave-lion-15961 any ideas?
👀 1
h
I counted the amount of flows that would scroll 3 times after refreshing the page anew and every time it seems to fail at 26, looks like there may be a length issue somewhere in js/main.4b6e41f8.js2836704
Copy code
[Error] TypeError: null is not an object (evaluating 'u.length') — index.tsx:41
	fl (main.4b6e41f8.js:2:104151)
	(anonymous function) (main.4b6e41f8.js:2:104668)
	Hi (main.4b6e41f8.js:2:85793)
	Ss (main.4b6e41f8.js:2:133610)
	ws (main.4b6e41f8.js:2:133178)
	bs (main.4b6e41f8.js:2:132714)
	(anonymous function) (main.4b6e41f8.js:2:144551)
	Eu (main.4b6e41f8.js:2:145064)
	iu (main.4b6e41f8.js:2:137524)
	E (main.4b6e41f8.js:2:175659)
	P (main.4b6e41f8.js:2:176193)
this is the first error that shows up in the javascript console
b
❤️ obsfucated code. That is helpful though, @hundreds-midnight-75494
h
lmk if I can help in any way @brave-lion-15961!
we have a few folks frustrated atm haha
b
I assume I can't get access to your page?
h
no, the service is on a private k8s cluster and we access it via tunneling. I am happy to jump on a huddle or zoom with you though
you can guide me through anything and I am happy to provide logs/artifacts of any sorts
b
I think, if I can find that line where the error is occurring, that I can reverse-engineer where
u
is coming from.
h
Copy code
PI = function(t) {
            var n = t.triggerEventsValue
              , r = t.className
              , o = t.showToolTip
              , i = void 0 === o || o
              , a = n.id
              , l = n.type
              , s = n.name
              , u = "run" === l ? a.split("/").slice(0, 2).join("/") : s
              , c = "run" === l ? "/" + u : void 0
              , d = Boolean(c)
              , f = u;
            u.length > 20 && (f = u.slice(0, 10) + "..." + u.slice(-10));
            var p = "label-tooltip-".concat(a);
            return e.createElement(OI, {
                key: a,
                "data-tip": !0,
                "data-for": p,
                className: r
            }, c && e.createElement(TI, {
                to: c
            }, e.createElement(RI, {
                name: "arrow",
                linkToRun: d
            }), i ? f : a), i && e.createElement(CI, {
                id: p
            }, u))
clicking through the error stack sent me there
b
That's very helpful
Are you using events?
Here is the unobsfucated code
Copy code
const Trigger: React.FC<Props> = ({ triggerEventsValue, className, showToolTip = true }) => {
  const { id, type, name } = triggerEventsValue;

  // Only handles triggers from runs
  const label = type === 'run' ? id.split('/').slice(0, 2).join('/') : name;
  const link = type === 'run' ? '/' + label : undefined;
  const linkToRun = Boolean(link);
  let displayLabel = label;

  // Truncate the label in the middle to fit to about MAX_LABEL_WIDTH characters.
  if (label.length > MAX_LABEL_WIDTH) {
    displayLabel = label.slice(0, MAX_LABEL_WIDTH / 2) + '...' + label.slice((-1 * MAX_LABEL_WIDTH) / 2);
  }
  const tooltipId = `label-tooltip-${id}`;
where
label
is the
u
referenced in your error
h
yes we are using events
b
I assume you have a trigger
id
that makes the
label
null
h
hmmm
how would I diagnose that?
b
I can make a patch to make that code more robust. If you can find that rogue
id
we can come up with a plan on how to show it
h
(I am mainly the infra guy supporting the devs creating those flows)
b
I would take a look at the network calls that have the triggers in them and find the one (that probably has id==='') at about 26*3 in
h
so I was able to pause on uncaught exception, assuming right before it fails and I can see the last one that shows up, but how would I go about finding the next one?
actually, I think I see the request URL in the network calls
Copy code
{
  "data": [
    {
      "id": 1915772,
      "flow_id": "AsanaFlow",
      "run_number": 6467,
      "run_id": null,
      "step_name": "start",
      "task_id": 99300,
      "task_name": null,
      "attempt_id": 0,
      "field_name": "runtime",
      "value": "dev",
      "type": "runtime",
      "user_name": "zoey",
      "ts_epoch": 1721751779286,
      "tags": [
        "attempt_id:0"
      ],
      "system_tags": null
    },
    {
      "id": 1915776,
      "flow_id": "AsanaFlow",
      "run_number": 6467,
      "run_id": null,
      "step_name": "start",
      "task_id": 99300,
      "task_name": null,
      "attempt_id": 0,
      "field_name": "runtime",
      "value": "dev",
      "type": "runtime",
      "user_name": "zoey",
      "ts_epoch": 1721751781601,
      "tags": [
        "attempt_id:0"
      ],
      "system_tags": null
    },
    {
      "id": 1915774,
      "flow_id": "AsanaFlow",
      "run_number": 6467,
      "run_id": null,
      "step_name": "start",
      "task_id": 99300,
      "task_name": null,
      "attempt_id": 0,
      "field_name": "user",
      "value": "zoey",
      "type": "user",
      "user_name": "zoey",
      "ts_epoch": 1721751779294,
      "tags": [
        "attempt_id:0"
      ],
      "system_tags": null
    },
    {
      "id": 1915778,
      "flow_id": "AsanaFlow",
      "run_number": 6467,
      "run_id": null,
      "step_name": "start",
      "task_id": 99300,
      "task_name": null,
      "attempt_id": 0,
      "field_name": "user",
      "value": "zoey",
      "type": "user",
      "user_name": "zoey",
      "ts_epoch": 1721751781607,
      "tags": [
        "attempt_id:0"
      ],
      "system_tags": null
    }
  ],
  "status": 200,
  "links": {
    "self": "<http://metaflow-ui-backend-service.metaflow.svc.cluster.local:8083/api/flows/AsanaFlow/runs/6467/metadata?step_name=start&_page=2>",
    "first": "<http://metaflow-ui-backend-service.metaflow.svc.cluster.local:8083/api/flows/AsanaFlow/runs/6467/metadata?step_name=start&_page=1>",
    "prev": "<http://metaflow-ui-backend-service.metaflow.svc.cluster.local:8083/api/flows/AsanaFlow/runs/6467/metadata?step_name=start&_page=1>",
    "next": null,
    "last": null
  },
  "pages": {
    "self": 2,
    "first": 1,
    "prev": 1,
    "next": null,
    "last": null
  },
  "query": {
    "step_name": "start",
    "_page": "2"
  }
}
that seems to be the raw json from that next api call that fails
would that be the run_id null stuff?
b
The data will come from a call of the form
/flows/${run.flow_id}/runs/${run.run_number}/metadata,
and will be in an object called 'execution-triggers'
h
yep
b
The network request should have returned before the js error occurs
h
it looks like I am getting a 200 from it yeah
what are you needing from it?
b
Something like
Copy code
{
      id: 'bob',
      name: 'table_updated',
      type: 'trigger',
      description: 'Table update trigger',
    },
I am guessing that
id
will be '' or null or undefined for one of them
In Summary: I can push out a new release for MFGUI that is more robust and doesn't bork on certain trigger ID's. If you are able to find the rogue ID, I can be more clever about what to show when we get that particular ID, otherwise the label will be '', which is probably not ideal.
Also, the rogue ID might uncover an issue in your system
h
I am unable to find the null id, happy to jump on a call with you and share what I am seeing though
b
OK - I am in another mtg now, I will huddle in 10 mins or so
👍 1
thankyou 1
h
the run that seems to always be causing the issue is the one I showed above where we have some of the run_id and task_name as null but nothing with an id of null
is there a way to remove a run from the database (assuming this is where those are coming from)
"value": "[{\"timestamp\": null, \"id\": null, \"name\": null, \"type\": \"event\"}]",
Copy code
{"id": 1914119,
"flow_id": "AnalysisFlow",
"run_number": 6458,
"run_id": "argo-regiontest-analysis-flow-6h4ng",
"step_name": "start",
"task_id": 99213,
"task_name": "t-94de5ded",
"attempt_id": 0,
"field_name": "execution-triggers",
"value": "[{\"timestamp\": null, \"id\": null, \"name\": null, \"type\": \"event\"}]",
"type": "execution-triggers",
"user_name": "argo-workflows",
"ts_epoch": 1721748707227,
"tags": ["attempt_id:0"
],
"system_tags": null
},
This is the latest metaflow-service. It references v1.3.13 of metaflow-ui. Let me know how you go
h
oh, the images are now on docker hub? 😮 we had been using the ecr repo
we have been using a separate ui image
b
ok
Are you able to get v1.3.13 of metaflow-ui now?
h
public.ecr.aws/outerbounds/metaflow_ui
let me see if it's available
doesn't look like it's been pushed to ecr
b
I'll make that happen
thankyou 1
h
I was under the impression that it was better to use a separate image for the deployment of the UI, did that change? we have 3 separate services deployed based on the original gcp examples I found
b
You're doing it right. There are many other ways of setting this up too.
thankyou 1
Can you check now?
h
I see it now!! deploying, will report back!
deployed successfully and no longer seeing the issue!!!
thank you so much @brave-lion-15961