You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The original regular expression using ([^<]+)<\llmlingua> would fail when the text inside llmlingua tags contained other tags (like <tag>...</tag>).
I suggest to replace it with something like ((?:[^<]*(?:<(?!/llmlingua>)[^>]*>)?)*?)</llmlingua> .
fromllmlinguaimportPromptCompressorllm_lingua=PromptCompressor(
model_name="microsoft/llmlingua-2-xlm-roberta-large-meetingbank",
use_llmlingua2=True,
device_map='cpu',
)
structured_prompt_repo_example="""<llmlingua, compress=False>Speaker 4:</llmlingua><llmlingua, rate=0.4> Thank you. And can we do the functions for content? </llmlingua><llmlingua, compress=False>Speaker 0:</llmlingua><llmlingua, rate=0.4> Item 11 is a communication from Council on Price recommendation to increase appropriation in the general fund group </llmlingua><llmlingua, compress=False>Speaker 4:</llmlingua><llmlingua, rate=0.6> We have a promotion and a second time as councilman served Councilman Ringa and customers and they have any comments.</llmlingua>"""original_pattern=r"<llmlingua\s*(?:,\s*rate\s*=\s*([\d\.]+))?\s*(?:,\s*compress\s*=\s*(True|False))?\s*(?:,\s*rate\s*=\s*([\d\.]+))?\s*(?:,\s*compress\s*=\s*(True|False))?\s*>([^<]+)</llmlingua>"matches=re.findall(original_pattern, structured_prompt_repo_example)
print(matches)
# output: [('', 'False', '', '', 'Speaker 4:'), ('0.4', '', '', '', ' Thank you. And can we do the functions for content? '), # ('', 'False', '', '', '\nSpeaker 0:'), ('0.4', '', '', '', ' Item 11 is a communication from Council on Price recommendation to increase appropriation in the general fund group '), # ('', 'False', '', '', '\nSpeaker 4:'), ('0.6', '', '', '', ' We have a promotion and a second time as councilman served Councilman Ringa and customers and they have any comments.')]structured_prompt_with_nested_tags="""<llmlingua, compress=False>Speaker 4:</llmlingua><llmlingua, rate=0.4> Thank you. And can we do the functions for content? </llmlingua><llmlingua, compress=False>Speaker 0:</llmlingua><llmlingua, rate=0.4> Item 11 is a communication from Council on Price recommendation to increase appropriation in the general fund group </llmlingua><llmlingua, compress=False>Speaker 4:</llmlingua><llmlingua, rate=0.6> We have a promotion <tag> and a second time as councilman served Councilman Ringa </tag> and customers and they have any comments.</llmlingua>"""matches=re.findall(original_pattern, structured_prompt_with_nested_tags )
print(matches)
# output: [('', 'False', '', '', 'Speaker 4:'), ('0.4', '', '', '', ' Thank you. And can we do the functions for content? '), # ('', 'False', '', '', '\nSpeaker 0:'), ('0.4', '', '', '', ' Item 11 is a communication from Council on Price recommendation to increase appropriation in the general fund group '), # ('', 'False', '', '', '\nSpeaker 4:')] new_pattern=r"<llmlingua\s*(?:,\s*rate\s*=\s*([\d\.]+))?\s*(?:,\s*compress\s*=\s*(True|False))?\s*(?:,\s*rate\s*=\s*([\d\.]+))?\s*(?:,\s*compress\s*=\s*(True|False))?\s*>((?:[^<]*(?:<(?!/llmlingua>)[^>]*>)?)*?)</llmlingua>"matches=re.findall(new_pattern, structured_prompt_with_nested_tags)
print(matches)
# output: [('', 'False', '', '', 'Speaker 4:'), ('0.4', '', '', '', ' Thank you. And can we do the functions for content? '), # ('', 'False', '', '', '\nSpeaker 0:'), ('0.4', '', '', '', ' Item 11 is a communication from Council on Price recommendation to increase appropriation in the general fund group '), # ('', 'False', '', '', '\nSpeaker 4:'), ('0.6', '', '', '', ' We have a promotion <tag> and a second time as councilman served Councilman Ringa </tag> and customers and they have any comments.')]
compressed_prompt_repo_example=llm_lingua.structured_compress_prompt(structured_prompt_repo_example, instruction="", question="", rate=0.5)
print(compressed_prompt_repo_example['compressed_prompt'])
# output: Speaker 4: Thank you. And can we do the functions for content? # Speaker 0: Item 11 is a communication from Council on Price recommendation to increase appropriation in the general fund group # Speaker 4: We have a promotion and a second time as councilman served Councilman Ringa and customers and they have any comments.compressed_prompt_with_nested_tags=llm_lingua.structured_compress_prompt(structured_prompt_with_nested_tags, instruction="", question="", rate=0.5)
print(compressed_prompt_with_nested_tags['compressed_prompt'])
# output: Speaker 4: Thank you. And can we do the functions for content? # Speaker 0: Item 11 is a communication from Council on Price recommendation to increase appropriation in the general fund group # Speaker 4:
Expected Behavior
The regular expression should handle nested tags within the prompt without disrupting the matching process, treating them as plain text content.
Logs
No response
Additional Information
No response
The text was updated successfully, but these errors were encountered:
Describe the bug
The original regular expression using
([^<]+)<\llmlingua>
would fail when the text inside llmlingua tags contained other tags (like<tag>...</tag>
).I suggest to replace it with something like
((?:[^<]*(?:<(?!/llmlingua>)[^>]*>)?)*?)</llmlingua>
.Commit on my fork: 73baf3f
Steps to reproduce
Confronting pattern matching:
Confronting "compressions" with/without nested non-llmlingua-related tags:
Expected Behavior
The regular expression should handle nested tags within the prompt without disrupting the matching process, treating them as plain text content.
Logs
No response
Additional Information
No response
The text was updated successfully, but these errors were encountered: