Better Automatic Token Reduction #49

jamesturk · 2023-06-06T22:05:28Z

To get this working in more places, more experimentation with token reduction is needed. How stripped down/minified can we get the HTML without causing reliability issues?

This isn't straightforward as it seems & many off the shelf tools are focused on different problems:

Minifiers seem to confuse GPT-4 a fair bit, so using off-the-shelf obfuscators/minifiers isn't the right solution here.
A lot of tools exist to sanitize HTML, but they often remove class names/etc. that are important to keep as hints. (and will be important if we get to the point of generating XPath)

It seems like the right approach is going to be an allow/disallow list based approach to extend/expand upon what's been done already in lxml.clean.

The text was updated successfully, but these errors were encountered:

jamesturk · 2023-06-07T01:15:10Z

Leaving myself a note that it is probably desirable to have a mode that does not modify the page structure. (i.e. no deletion of container tags) so that XPath can remain valid if we go down that route. This doesn't mean it's a hard and fast rule, but that it should be configurable ideally.

This means

<div>
   <div>  
   <div>  
     Content
   </div>
   </div>
</div>

Can't be simplified.

Counts should be avoided in generated XPath at almost any cost (e.g. never generate //table[4]) but could run into similar issues, maybe a second toggle for this?

jamesturk changed the title ~~Automatic Token Reduction~~ Better Automatic Token Reduction Jun 6, 2023

jamesturk added the planned enhancement New feature or request label Jun 6, 2023

jamesturk added this to the 0.6.0 milestone Jun 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better Automatic Token Reduction #49

Better Automatic Token Reduction #49

jamesturk commented Jun 6, 2023

jamesturk commented Jun 7, 2023

Better Automatic Token Reduction #49

Better Automatic Token Reduction #49

Comments

jamesturk commented Jun 6, 2023

jamesturk commented Jun 7, 2023