Hi there,
I’d like to clarify two constraints mentioned in the documentation:
1.10-second timeout after the first token is generated:
Is this timeout applied per LLM call (i.e., after each time we get the first token from an LLM), or is it measured from the first LLM call within a loop, even if the loop includes multiple LLM calls
-
“Only answer texts generated within 30 seconds are considered”:
Is this 30-second limit applied per individual query or per batch?
- If it’s per query, does that mean with a batch size of 8, we effectively have up to 8 × 30 = 240 seconds to process the batch?
- If it’s per batch, can we adjust the batch size (e.g., from the default of 8 in the starter toolkit) to optimize the runtime and parallelism during testing?
Thanks for your help!