You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We know they have similar overall accuracy on some of the scenarios.
What is the "per-arg class" accuracy for each? For instance, there are different classes/batches of Dominoes trials -- some with 2 dominoes, some with 4, some with occluders, some with missing dominoes, etc. Are there particular arg classes where people/models are doing especially well, or differently from each other?
This analysis is critical for improving the benchmark: we want to make sure that models are not doing well by essentially classifying which arg set was used to generate a trial, and then guessing the average outcome for that class (which may not be 50% since we didn't "microbalance" the scenarios.)
The text was updated successfully, but these errors were encountered:
For example, compare humans to DPI to FitVid
We know they have similar overall accuracy on some of the scenarios.
What is the "per-arg class" accuracy for each? For instance, there are different classes/batches of Dominoes trials -- some with 2 dominoes, some with 4, some with occluders, some with missing dominoes, etc. Are there particular arg classes where people/models are doing especially well, or differently from each other?
This analysis is critical for improving the benchmark: we want to make sure that models are not doing well by essentially classifying which arg set was used to generate a trial, and then guessing the average outcome for that class (which may not be 50% since we didn't "microbalance" the scenarios.)
The text was updated successfully, but these errors were encountered: