In the Overview, it’s mentioned that
Each example will have a time-out limit of 10 seconds.
Does this mean that after calling generate_answer()
for each query input, the final answer must be returned within 10 seconds?
In the Rules, it’s mentioned that
A time-out is applied after the first token was generated.
However, the baseline isn’t stream-based, so what does “first token” mean in this context?
Additionally, will the 10-second limit be recalculated and relaxed? I tested the full-precision llama7b baseline for task 3 on an RTX 3090. The embedding + retrieval takes 1-2 seconds, while the entire generate_answer()
function takes around 7 seconds.