-
Notifications
You must be signed in to change notification settings - Fork 266
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
♻️ [Tasks] JSON spec: text-generation #468
Conversation
* Best of | ||
*/ | ||
doSample?: boolean; | ||
bestOf?: number; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Narsil @OlivierDehaene I'm not sure what this parameter represents
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do_sample: bool -> Should the generation be greedy or not (it cannot be set to True if any other params is set, but sometimes you do not want to specify temp, nor top_p, nor anyhing, therefore setting do_sample=True is for that).
best_of: int -> Run n
sampling queries, return the best (in terms of total logprob). Clashes with streaming.
Given https://github.com/huggingface/moon-landing/pull/8723, do we really need the conversational spec? |
@osanseviero I think speccing the conversational API can still be useful for the widget and inference clients, wdyt? |
a8a59e3
to
3cad908
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry being late to the show here. I'm currently reviewing this PR :) (don't merge it too quickly 🙏)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried to list all the difference I've spotted between these specs, the TGI documentation and the Python client implementation. I think we should stick to the existing TGI parameters as they are already in use + "newer" than the transformers-based parameters.
@@ -3,7 +3,6 @@ | |||
* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(commenting on inference.ts
but more related to ./spec
)
TGI server has an OpenAPI documentation to document its inputs and outputs (see swagger ui and openapi.json). I've spotted a few differences compared to the specs here. Some of these properties are currently specific to TGI and not available in the "transformers-based inference" endpoints.
In TextGenerationParameters
:
max_new_tokens
(integer) => Maximum number of generated tokens.repetition_penalty
(number) => The parameter for repetition penalty. A value of 1.0 means no penalty. See this paper for more details.seed
(integer) => Random sampling seed.stop
(array of string) => Stop generating tokens if a member ofstop
is generated. (currently namedstop_sequences
in these specs /stop
in TGI => same purpose => to harmonize?)top_n_tokens
(integer) => ??? (no idea what this is. We don't have it either in Python client)
In the Python client, along with the inputs
(string) + parameters
(TextGenerationParameters) values in the request payload we also have a stream
(bool) value. If True the output type is different (same as /generate_stream
instead of /generate
in TGI). I don't see it documented here so I wonder if these docs might be a bit outdated (ping @Narsil you might know more?). In any case, this is an option currently working and used so worth keeping it.
In TextGenerationOutputDetails
:
best_of_sequences
(array of BestOfSequence) => Additional sequences when using thebest_of
parameter.
BestOfSequence
(currently missing):
generated_text
(string) => The generated text.finish_reason
(FinishReason) => The reason for the generation to finish, represented by a FinishReason value.generated_tokens
(integer) => The number of generated tokens in the sequence.seed
(integer) => The sampling seed if sampling was activated.prefill
(array of InputToken) => he decoder input tokens. Empty ifdecoder_input_details
is False.tokens
(array of Token) => The generated tokens.
BestOfSequence
object is actually very similar to TextGenerationOutputDetails
but 1 level of nesting below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I deliberately left out the stream
parameter because it was API/transport-specific (and not inference-specific)
I'm happy to revisit that if there's a consensus tho
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making the changes @SBrandeis! Re-reviewed it and looks good to me now to use it in the Python client. Fair-enough about not adding the stream
parameters (and let's revisit if we realize we really need it at some point).
TL;DR
text-generation
spec to match TGI APIAddconversational
spec, heavily inspired by TGI messages API (cc @Wauplin @osanseviero @Narsil )Relevant related work: Update conversational widget to use text-generation (+ removeconversational
task) #457 & https://github.com/huggingface/moon-landing/pull/8723