-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automated statement labeling #10
Comments
Originally posted by @amirrr in Watts-Lab/commonsense-platform#86 (comment) Dimensions of statement and their definition:behavior
everyday
figure_of_speech
judgment
opinion
reasoning
Also: categoryWhich knowledge category or categories describe this claim? (choose all that apply)
|
Note, we should check labels against the original version of the statement, because cleaned statements might need different labels. Once we have a good labeling strategy we should label all the new clean statements freshly #9 |
Non GPT based approachFeature based model
Category based model
Could try mean or mode for baseline. Model type can be random forest or XGBoost GPT approachJust ask GPT the questions? |
Here's the data for the non-GPT approach (leaving out the category "Society and social sciences" since it resulted in the most accurate model).
|
Great, so we seem to need to do better, hahaha. Also, I think it's fine to trim these to 3 decimal places (e.g., -0.333), and we probably only need Perhaps we can look at columns like (where each one shows the |
I edited your comment a bit more to indicate what I was thinking. (for some reason Github doesn't send notifications for edits) |
The table is complete now. I ran the GPT labeling against the first 2000 of the statements. Refer to this issue for more details about the prompt. |
Thanks. Interesting. We're not doing very well. Just so I understand, how are you doing the score calculation for GPT? Would you mind making a second table that shows F1 scores for each of these as well? |
|
Interesting! Would you mind adding doing that for the others too? Just to see if our scores there get a lot better? |
These are the Jaccard accuracy, f1 and global
|
How interesting. So none of these is really good enough for everything, though most are OK on some of the features. One more way we could look at this: for each of these samples, can you balance the data so that the training data have an equal number of each category for each feature. After taking out a test split, take the smaller group and sample the larger group by the number of items the smaller group has. |
@joshnguyen99 — can you also put updates here when you have them. Also, we have a project for this https://github.com/orgs/Watts-Lab/projects/27/views/5 and we can start tracking some of our efforts there. |
Sorry for the late message—there were some bugs in my training scripts but I managed to fix them.
Below is the performance by the best model for each feature. I held out 10% of the dataset, stratified by the predicted feature.
We have a minor improvement in the F1 score for Compared to GPT it's pretty much the same. But I don't think we used the same test set, so it might be worth it to sync up. I have also added you to the repo for my finetuning scripts (https://github.com/joshnguyen99/commonsense-playground). Also added you to my wandb project to keep track of finetuning if you're interested. |
Great, thanks! Can you move that repo into the Watts-lab org — we like to keep stuff centralized where possible. Interesting that the simple models seem to be doing best overall still. I think RF is random forest with embedding as features. Why don't we set up a common train-test split code, so we can do repeatable splits. I think @amirrr was working on running RF and XGBoost with a balanced training set, which we think will dramatically help on the |
One more thing, we are doing a lot of lm related stuff in |
@markwhiting — Sure, I can move the code to Watts-Lab! For now, I will commit it to a folder within |
OK, I might have found something in @amirrr's code that led to very different results from mine. In for outcome in outcomes:
X_train = merged_df[merged_df['category'] != 'Society and social sciences'].embeddings
y_train = merged_df[merged_df['category'] != 'Society and social sciences'][outcome]
X_test = merged_df[merged_df['category'] == 'Society and social sciences'].embeddings
y_test = merged_df[merged_df['category'] == 'Society and social sciences'][outcome] For example, if
This is not entirely random, and we have very different percentages of positive examples in the training and test sets: behavior
Train: 63.4% (2503/3950)
Test : 95.4% (436/457)
everyday
Train: 63.7% (2518/3950)
Test : 57.1% (261/457)
figure_of_speech
Train: 20.7% (818/3950)
Test : 7.9% (36/457)
judgment
Train: 69.9% (2763/3950)
Test : 93.7% (428/457)
opinion
Train: 60.6% (2394/3950)
Test : 89.9% (411/457)
reasoning
Train: 63.1% (2494/3950)
Test : 52.7% (241/457) In other words, the training and test subsets aren't stratified. |
Got it. We are going to fix this by sorting statements into two groups based on their embeddings and their cosine similarity and then match them according to whatever we are trying to model. This should make sure the training and testing groups are more balanced, which could help with more accuracy. I will share the train and test in the same repository. |
Results with the balanced dataset test:
|
@amirrr and @markwhiting, I have uploaded the 6 roberta-large models (for predicting dimensions) to HuggingFace. They can be found in our lab's HF page: https://huggingface.co/CSSLab You can try the Inference API on the right-hand side. |
Here's the performance of a multi-label classifier. I used the non-chat version of TinyLlama (1.1B params) and fine-tuned it once using the multi-label version of our dataset.
Overall this looks better than the RoBERTa models used before, but not significantly. The perk here is that this is only one multi-label model instead of six binary classifiers. (I also fine-tuned LLaMA-2 7B for this task but it actually performs worse than TinyLlama. I suspect it's mostly because the dataset is small relative to the model size, evidenced by the the high variance achieved relatively during fine-tuning.) |
Thanks, any proposals to get a better result? I feel like at this point using something like that for some properties and using @amirrr's lates models for others seems like it gets us the best overall rate, though it would be great if we could put it all into a single model like the design you have while achieving a winning quality on every dimension. |
Moving this to statements repo |
Check how GPT labels statements on our labeling task. Use$Global R^2 = 1-\frac{mse(prediction,actual)}{mse(baseline,actual)}$ to score, and we can visualize in observable.
Would be nice to see how we do on each question.
The text was updated successfully, but these errors were encountered: