Questions about two constraints mentioned in the documentation

steve_wu · May 19, 2025, 3:35pm

Hi there,

I’d like to clarify two constraints mentioned in the documentation:

1.10-second timeout after the first token is generated:
Is this timeout applied per LLM call (i.e., after each time we get the first token from an LLM), or is it measured from the first LLM call within a loop, even if the loop includes multiple LLM calls

“Only answer texts generated within 30 seconds are considered”:
Is this 30-second limit applied per individual query or per batch?

If it’s per query, does that mean with a batch size of 8, we effectively have up to 8 × 30 = 240 seconds to process the batch?
If it’s per batch, can we adjust the batch size (e.g., from the default of 8 in the starter toolkit) to optimize the runtime and parallelism during testing?

Thanks for your help!