You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To get this working in more places, more experimentation with token reduction is needed. How stripped down/minified can we get the HTML without causing reliability issues?
This isn't straightforward as it seems & many off the shelf tools are focused on different problems:
Minifiers seem to confuse GPT-4 a fair bit, so using off-the-shelf obfuscators/minifiers isn't the right solution here.
A lot of tools exist to sanitize HTML, but they often remove class names/etc. that are important to keep as hints. (and will be important if we get to the point of generating XPath)
It seems like the right approach is going to be an allow/disallow list based approach to extend/expand upon what's been done already in lxml.clean.
The text was updated successfully, but these errors were encountered:
jamesturk
changed the title
Automatic Token Reduction
Better Automatic Token Reduction
Jun 6, 2023
Leaving myself a note that it is probably desirable to have a mode that does not modify the page structure. (i.e. no deletion of container tags) so that XPath can remain valid if we go down that route. This doesn't mean it's a hard and fast rule, but that it should be configurable ideally.
This means
<div>
<div>
<div>
Content
</div>
</div>
</div>
Can't be simplified.
Counts should be avoided in generated XPath at almost any cost (e.g. never generate //table[4]) but could run into similar issues, maybe a second toggle for this?
To get this working in more places, more experimentation with token reduction is needed. How stripped down/minified can we get the HTML without causing reliability issues?
This isn't straightforward as it seems & many off the shelf tools are focused on different problems:
It seems like the right approach is going to be an allow/disallow list based approach to extend/expand upon what's been done already in
lxml.clean
.The text was updated successfully, but these errors were encountered: