Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: regex pattern does not handle nested tags in prompt (structured compression) #201

Open
Imatgay opened this issue Nov 20, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@Imatgay
Copy link

Imatgay commented Nov 20, 2024

Describe the bug

The original regular expression using ([^<]+)<\llmlingua> would fail when the text inside llmlingua tags contained other tags (like <tag>...</tag>).
I suggest to replace it with something like ((?:[^<]*(?:<(?!/llmlingua>)[^>]*>)?)*?)</llmlingua> .

Commit on my fork: 73baf3f

Image

Steps to reproduce

Confronting pattern matching:

from llmlingua import PromptCompressor

llm_lingua = PromptCompressor(
    model_name="microsoft/llmlingua-2-xlm-roberta-large-meetingbank",
    use_llmlingua2=True, 
    device_map='cpu',
)

structured_prompt_repo_example ="""<llmlingua, compress=False>Speaker 4:</llmlingua><llmlingua, rate=0.4> Thank you. And can we do the functions for content? </llmlingua><llmlingua, compress=False>
Speaker 0:</llmlingua><llmlingua, rate=0.4> Item 11 is a communication from Council on Price recommendation to increase appropriation in the general fund group </llmlingua><llmlingua, compress=False>
Speaker 4:</llmlingua><llmlingua, rate=0.6> We have a promotion and a second time as councilman served Councilman Ringa and customers and they have any comments.</llmlingua>"""

original_pattern = r"<llmlingua\s*(?:,\s*rate\s*=\s*([\d\.]+))?\s*(?:,\s*compress\s*=\s*(True|False))?\s*(?:,\s*rate\s*=\s*([\d\.]+))?\s*(?:,\s*compress\s*=\s*(True|False))?\s*>([^<]+)</llmlingua>"

matches = re.findall(original_pattern, structured_prompt_repo_example)
print(matches)

# output: [('', 'False', '', '', 'Speaker 4:'), ('0.4', '', '', '', ' Thank you. And can we do the functions for content? '), 
#          ('', 'False', '', '', '\nSpeaker 0:'), ('0.4', '', '', '', ' Item 11 is a communication from Council on Price recommendation to increase appropriation in the general fund group '), 
#          ('', 'False', '', '', '\nSpeaker 4:'), ('0.6', '', '', '', ' We have a promotion and a second time as councilman served Councilman Ringa and customers and they have any comments.')]

structured_prompt_with_nested_tags = """<llmlingua, compress=False>Speaker 4:</llmlingua><llmlingua, rate=0.4> Thank you. And can we do the functions for content? </llmlingua><llmlingua, compress=False>
Speaker 0:</llmlingua><llmlingua, rate=0.4> Item 11 is a communication from Council on Price recommendation to increase appropriation in the general fund group </llmlingua><llmlingua, compress=False>
Speaker 4:</llmlingua><llmlingua, rate=0.6> We have a promotion <tag> and a second time as councilman served Councilman Ringa  </tag> and customers and they have any comments.</llmlingua>"""

matches = re.findall(original_pattern, structured_prompt_with_nested_tags )
print(matches)

# output: [('', 'False', '', '', 'Speaker 4:'), ('0.4', '', '', '', ' Thank you. And can we do the functions for content? '), 
#          ('', 'False', '', '', '\nSpeaker 0:'), ('0.4', '', '', '', ' Item 11 is a communication from Council on Price recommendation to increase appropriation in the general fund group '), 
#          ('', 'False', '', '', '\nSpeaker 4:')] 

new_pattern = r"<llmlingua\s*(?:,\s*rate\s*=\s*([\d\.]+))?\s*(?:,\s*compress\s*=\s*(True|False))?\s*(?:,\s*rate\s*=\s*([\d\.]+))?\s*(?:,\s*compress\s*=\s*(True|False))?\s*>((?:[^<]*(?:<(?!/llmlingua>)[^>]*>)?)*?)</llmlingua>"
matches = re.findall(new_pattern, structured_prompt_with_nested_tags)
print(matches)

# output: [('', 'False', '', '', 'Speaker 4:'), ('0.4', '', '', '', ' Thank you. And can we do the functions for content? '), 
#          ('', 'False', '', '', '\nSpeaker 0:'), ('0.4', '', '', '', ' Item 11 is a communication from Council on Price recommendation to increase appropriation in the general fund group '), 
#          ('', 'False', '', '', '\nSpeaker 4:'), ('0.6', '', '', '', ' We have a promotion <tag> and a second time as councilman served Councilman Ringa  </tag> and customers and they have any comments.')]

Confronting "compressions" with/without nested non-llmlingua-related tags:

compressed_prompt_repo_example = llm_lingua.structured_compress_prompt(structured_prompt_repo_example, instruction="", question="", rate=0.5)
print(compressed_prompt_repo_example['compressed_prompt'])
# output: Speaker 4: Thank you. And can we do the functions for content? 
#         Speaker 0: Item 11 is a communication from Council on Price recommendation to increase appropriation in the general fund group 
#         Speaker 4: We have a promotion and a second time as councilman served Councilman Ringa and customers and they have any comments.


compressed_prompt_with_nested_tags = llm_lingua.structured_compress_prompt(structured_prompt_with_nested_tags, instruction="", question="", rate=0.5)
print(compressed_prompt_with_nested_tags['compressed_prompt'])
# output:  Speaker 4: Thank you. And can we do the functions for content? 
#          Speaker 0: Item 11 is a communication from Council on Price recommendation to increase appropriation in the general fund group 
#          Speaker 4:

Expected Behavior

The regular expression should handle nested tags within the prompt without disrupting the matching process, treating them as plain text content.

Logs

No response

Additional Information

No response

@Imatgay Imatgay added the bug Something isn't working label Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant