Tokenizer APIs Update (#7190)

* ToknizersAPIsUpdate * Address the feedback
dotnet · Jul 15, 2024 · 579fe03 · 579fe03
1 parent f5abe6a
commit 579fe03
Show file tree

Hide file tree

Showing 38 changed files with 4,831 additions and 2,817 deletions.
diff --git a/docs/code/microsoft-ml-tokenizers-migration-guide.md b/docs/code/microsoft-ml-tokenizers-migration-guide.md
@@ -8,13 +8,13 @@ This guide provides general guidance on how to migrate from various tokenizer li
 
 | Microsoft.DeepDev.TokenizerLib | Microsoft.ML.Tokenizers
 | --- | --- |
-| [TikTokenizer](https://github.com/microsoft/Tokenizer/blob/2c9ba5d343de52eb27521afef7c0c2f0f76c9c52/Tokenizer_C%23/TokenizerLib/TikTokenizer.cs#L20) | [Tokenizer](https://github.com/dotnet/machinelearning/blob/4d5317e8090e158dc7c3bc6c435926ccf1cbd8e2/src/Microsoft.ML.Tokenizers/Model/Tiktoken.cs#L41) |
-| [ITokenizer](https://github.com/microsoft/Tokenizer/blob/2c9ba5d343de52eb27521afef7c0c2f0f76c9c52/Tokenizer_C%23/TokenizerLib/ITokenizer.cs#L7) | [Tokenizer](https://github.com/dotnet/machinelearning/blob/4d5317e8090e158dc7c3bc6c435926ccf1cbd8e2/src/Microsoft.ML.Tokenizers/Tokenizer.cs#L29) |
-| [TokenizerBuilder](https://github.com/microsoft/Tokenizer/blob/2c9ba5d343de52eb27521afef7c0c2f0f76c9c52/Tokenizer_C%23/TokenizerLib/TokenizerBuilder.cs#L14) | [Tokenizer.CreateTiktokenForModel](https://github.com/dotnet/machinelearning/blob/4d5317e8090e158dc7c3bc6c435926ccf1cbd8e2/src/Microsoft.ML.Tokenizers/Tokenizer.cs#L324) embedded<br> [Tokenizer.CreateTiktokenForModel(Async/Stream)](https://github.com/dotnet/machinelearning/blob/4d5317e8090e158dc7c3bc6c435926ccf1cbd8e2/src/Microsoft.ML.Tokenizers/Tokenizer.cs#L241-L315) user provided file stream |
+| [TikTokenizer](https://github.com/microsoft/Tokenizer/blob/2c9ba5d343de52eb27521afef7c0c2f0f76c9c52/Tokenizer_C%23/TokenizerLib/TikTokenizer.cs#L20) | Tokenizer |
+| [ITokenizer](https://github.com/microsoft/Tokenizer/blob/2c9ba5d343de52eb27521afef7c0c2f0f76c9c52/Tokenizer_C%23/TokenizerLib/ITokenizer.cs#L7) | Tokenizer |
+| [TokenizerBuilder](https://github.com/microsoft/Tokenizer/blob/2c9ba5d343de52eb27521afef7c0c2f0f76c9c52/Tokenizer_C%23/TokenizerLib/TokenizerBuilder.cs#L14) | TiktokenTokenizer.CreateForModel <br> TiktokenTokenizer.CreateForModel(Async/Stream) user provided file stream |
 
 ### General Guidance
 
-- To avoid embedding the tokenizer's vocabulary files in the code assembly or downloading them at runtime when using one of the standard Tiktoken vocabulary files, utilize the [`CreateTiktokenForModel`](https://github.com/dotnet/machinelearning/blob/4d5317e8090e158dc7c3bc6c435926ccf1cbd8e2/src/Microsoft.ML.Tokenizers/Tokenizer.cs#L324) function. The [table](https://github.com/dotnet/machinelearning/blob/4d5317e8090e158dc7c3bc6c435926ccf1cbd8e2/src/Microsoft.ML.Tokenizers/Model/Tiktoken.cs#L683-L734) lists the mapping of model names to the corresponding vocabulary files used with each model. This table offers clarity regarding the vocabulary file linked with each model, alleviating users from the concern of carrying or downloading such vocabulary files if they utilize one of the models listed.
-- Avoid hard-coding tiktoken regexes and special tokens.  Instead use the appropriate Tiktoken.`CreateTiktokenForModel/Async` method to create the tokenizer using the model name, or a provided stream.
-- Avoid doing encoding if you need the token count or encoded Ids. Instead use `Tokenizer.CountTokens` for getting the token count and `Tokenizer.EncodeToIds` for getting the encode ids.
-- Avoid doing encoding if all you need is to truncate to a token budget.  Instead use `Tokenizer.IndexOfCount` or `LastIndexOfCount` to find the index to truncate from the start or end of a string, respectively.
+- To avoid embedding the tokenizer's vocabulary files in the code assembly or downloading them at runtime when using one of the standard Tiktoken vocabulary files, utilize the `TiktokenTokenizer.CreateForModel` function. The [table](https://github.com/dotnet/machinelearning/blob/4d5317e8090e158dc7c3bc6c435926ccf1cbd8e2/src/Microsoft.ML.Tokenizers/Model/Tiktoken.cs#L683-L734) lists the mapping of model names to the corresponding vocabulary files used with each model. This table offers clarity regarding the vocabulary file linked with each model, alleviating users from the concern of carrying or downloading such vocabulary files if they utilize one of the models listed.
+- Avoid hard-coding tiktoken regexes and special tokens.  Instead use the appropriate Tiktoken.`TiktokenTokenizer.CreateForModel/Async` method to create the tokenizer using the model name, or a provided stream.
+- Avoid doing encoding if you need the token count or encoded Ids. Instead use `TiktokenTokenizer.CountTokens` for getting the token count and `TiktokenTokenizer.EncodeToIds` for getting the encode ids.
+- Avoid doing encoding if all you need is to truncate to a token budget.  Instead use `TiktokenTokenizer.GetIndexByTokenCount` or `GetIndexByTokenCountFromEnd` to find the index to truncate from the start or end of a string, respectively.
diff --git a/src/Microsoft.ML.Tokenizers/EncodeResults.cs b/src/Microsoft.ML.Tokenizers/EncodeResults.cs
@@ -0,0 +1,30 @@
+// Licensed to the .NET Foundation under one or more agreements.
+// The .NET Foundation licenses this file to you under the MIT license.
+// See the LICENSE file in the project root for more information.
+
+using System.Collections.Generic;
+
+namespace Microsoft.ML.Tokenizers
+{
+    /// <summary>
+    /// The result of encoding a text.
+    /// </summary>
+    /// <typeparam name="T">The type of the tokens.</typeparam>
+    public struct EncodeResults<T>
+    {
+        /// <summary>
+        /// Gets or sets the list of tokens generated from the encoded text.
+        /// </summary>
+        public IReadOnlyList<T> Tokens { get; set; }
+
+        /// <summary>
+        /// Gets or sets the normalized text generated during the encoding process. This can be <see langword="null"/> if the encoding process does not normalize the input text.
+        /// </summary>
+        public string? NormalizedText { get; set; }
+
+        /// <summary>
+        /// Gets or sets the count of characters consumed from the input text.
+        /// </summary>
+        public int CharsConsumed { get; set; }
+    }
+}
diff --git a/src/Microsoft.ML.Tokenizers/EncodeSettings.cs b/src/Microsoft.ML.Tokenizers/EncodeSettings.cs
@@ -0,0 +1,29 @@
+// Licensed to the .NET Foundation under one or more agreements.
+// The .NET Foundation licenses this file to you under the MIT license.
+// See the LICENSE file in the project root for more information.
+
+namespace Microsoft.ML.Tokenizers
+{
+    /// <summary>
+    /// The settings used to encode a text.
+    /// </summary>
+    public struct EncodeSettings
+    {
+        public EncodeSettings() { MaxTokenCount = int.MaxValue; }
+        /// <summary>
+        /// Gets or sets a value indicating whether to consider the input normalization during encoding.
+        /// </summary>
+        public bool ConsiderNormalization { get; set; }
+
+        /// <summary>
+        /// Gets or sets a value indicating whether to consider the pre-tokenization during encoding.
+        /// </summary>
+        public bool ConsiderPreTokenization { get; set; }
+
+        /// <summary>
+        /// Gets or sets the maximum number of tokens to generate.
+        /// </summary>
+        public int MaxTokenCount { get; set; }
+    }
+}
+
diff --git a/src/Microsoft.ML.Tokenizers/Token.cs → src/Microsoft.ML.Tokenizers/EncodedToken.cs b/src/Microsoft.ML.Tokenizers/Token.cs → src/Microsoft.ML.Tokenizers/EncodedToken.cs
@@ -12,7 +12,7 @@ namespace Microsoft.ML.Tokenizers
     /// Represent the token produced from the tokenization process containing the token substring,
     /// the id associated to the token substring, and the offset mapping to the original string.
     /// </summary>
-    public readonly struct Token
+    public readonly struct EncodedToken
     {
         /// <summary>
         /// Gets the Id value associated to the token.
@@ -35,7 +35,7 @@ public readonly struct Token
         /// <param name="id">The Id value associated to the token.</param>
         /// <param name="value">The token string value.</param>
         /// <param name="offset">The offset mapping to the original string.</param>
-        public Token(int id, string value, (int, int) offset)
+        public EncodedToken(int id, string value, (int, int) offset)
         {
             Id = id;
             Offset = offset;