Actual context length ?
README says 256k but config.json has "max_position_embeddings": 393216, so e.g. vLLM infers that as the max length. This is not a small difference.
Hi ! Thanks for bringing this up.
max_position_embeddings is definetly correct here as this is the one you can compute from the yarn config (scaling factor * original max pos embedding).
I've started investigating the issue as we also pass in the config max_seq_len = 256k that is the value we recommend and you found in the model card. However vLLM seems to enforce 393k based on the computation of the yarn config and not the value specified by max_position_embeddings nor max_seq_len. I'll push the investigations to know if it is desired or a bug and submit a PR/discussion with the vLLM team tomorrow morning.
Hey:
So after discussing with the vLLM team I understood that I was misunderstanding the scope of max_seq_len inside the codebase. They have the same semantics so having two different values makes little sense.
What's good is that this misconfiguration has 0 effect on the performance of the model. But this has an impact on memory allocation and is not what we desired.
I'm discussing with VLLM team the best approach to make a default value (if possible) and will update the params.json / README.md soon accordingly. In the meantime, you can pass to the serving command:--max-model-len 262144.
Edit: I updated the README and params.json, please make sure to pass --max-model-len 262144
All clear, thanks for the instruction and for investigating this.