Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automated statement labeling #10

Open
markwhiting opened this issue Oct 3, 2023 · 24 comments
Open

Automated statement labeling #10

markwhiting opened this issue Oct 3, 2023 · 24 comments
Assignees

Comments

@markwhiting
Copy link
Member

markwhiting commented Oct 3, 2023

Check how GPT labels statements on our labeling task. Use $Global R^2 = 1-\frac{mse(prediction,actual)}{mse(baseline,actual)}$ to score, and we can visualize in observable.

Would be nice to see how we do on each question.

@markwhiting markwhiting changed the title Statement labeling GPT Statement labeling Oct 5, 2023
@markwhiting
Copy link
Member Author

markwhiting commented Oct 11, 2023

Originally posted by @amirrr in Watts-Lab/commonsense-platform#86 (comment)

Dimensions of statement and their definition:

behavior

  • Social: it refers to beliefs, perceptions, preferences, and socially con- structed rules that govern human experience; it can be “real” or opinion, but is intrinsically of human origins. e.g., I exist and am the same person I was yesterday. He yelled at me because he was angry. There are seven days in the week.
  • Physical: it refers to objective features of the world as described by, say, physics, biology, engineering, mathematics or other natural rules; it can be measured empirically, or derived logically. e.g., Men on average are taller than women. The Earth is the third planet from the Sun. Ants are smaller than Elephants.

everyday

  • Everyday: people encounter, or could encounter, situations like this in the course of their ordinary, everyday experiences, e.g., Touching a hot stove will burn you. Commuting at rush hour takes longer. It is rude to jump the line.
  • Abstract: this claim refers to regularities or conclusions that cannot be observed or arrived at solely through individual experience, e.g., Capitalism is a better economic system than Communism. Strict gun laws save lives. God exists

figure_of_speech

  • Figure of speech: it contains an aphorism, metaphor, hyperbole, e.g., Birds of a feather flock together. A friend to all is a friend to none.
  • Literal language: it is plain and ordinary language that means exactly what it says. e.g. The sky is blue. Elephants are larger than dogs. Abraham Lincoln was a great president.

judgment

  • Normative: it refers to a judgment, belief, value, social norm or convention. e.g., If you are going to the office, you should wear business attire,not a bathing suit. Treat others how you want them to treat you. Freedom is a fundamental human right.
  • Positive: it refers to something in the world such as an empirical regularity or scientific law, e.g., hot things will burn you; the sun rises in the east and sets in the west.

opinion

  • Opinion: it is something that someone might think is true, or wants others to think is true, but can’t be demonstrated to be objectively correct or incorrect; it is inherently subjective. e.g., FDR was the greatest US president of the 20th Century.. The Brooklyn Bridge is prettier than the Golden Gate. Vaccine mandates are a tolerable imposition on individual freedom.
  • Factual: it is something that can be demonstrated to be correct or incorrect, independently of anyone’s opinion, e.g., the earth is the third planet from the sun (this is correct and we know it is correct), Obama was the 24th president of the United States (this is incorrect, but we know it’s incorrect). It will be sunny next Tuesday (we don’t yet know if this is correct, but we will be able to check in the future).

reasoning

  • Knowledge: the claim refers to some observation about the world; it may be true or false, opinion or fact, subjective or objective e.g., The sun rises in the east and sets in the west. Dogs are nicer than cats. Glasses break when they are dropped.
  • Reasoning: the claim presents a conclusion that is arrived at by combining knowledge and logic, e.g., The sun is in the east, therefore it is morning. My dog is wagging its tail, therefore it is happy. The glass fell off the table, therefore it will break and the floor will become wet.

Also:

category

Which knowledge category or categories describe this claim? (choose all that apply)

  • General reference
  • Culture and the arts
  • Geography and places
  • Health and fitness
  • History and events
  • Human activities
  • Mathematics and logic
  • Natural and physical sciences
  • People and self
  • Philosophy and thinking
  • Religion and belief systems
  • Society and social sciences
  • Technology and applied sciences

@markwhiting
Copy link
Member Author

Note, we should check labels against the original version of the statement, because cleaned statements might need different labels.

Once we have a good labeling strategy we should label all the new clean statements freshly #9

@markwhiting
Copy link
Member Author

markwhiting commented Oct 11, 2023

Non GPT based approach

Feature based model

leave one category out 
for each feature in [behavior ... ]: 
  training_data = data[category != LOO_category]
  test_data = data[category == LOO_category]
  model: feature ~ embedding on training_data
  predict: feature ~ embedding on test_data
  baseline: mode(feature) on training_data

Category based model

leave one design point out (a particular combo of features) 
multinomial regression: category ~ embedding
(same style testing regime)

Could try mean or mode for baseline.

Model type can be random forest or XGBoost

GPT approach

Just ask GPT the questions?

@amirrr
Copy link
Collaborator

amirrr commented Oct 12, 2023

Here's the data for the non-GPT approach (leaving out the category "Society and social sciences" since it resulted in the most accurate model).

Feature Random Forest XGBoost GPT
behavior -0.333 -1.475 0.097
everyday 0.092 0.476 0.056
figure_of_speech 0.0 0.156 0.105
judgment 0.034 -0.487 -0.097
opinion -0.022 -0.126 0.165
reasoning 0.028 0.502 -0.128

@markwhiting
Copy link
Member Author

markwhiting commented Oct 12, 2023

Great, so we seem to need to do better, hahaha. Also, I think it's fine to trim these to 3 decimal places (e.g., -0.333), and we probably only need $R^2$.

Perhaps we can look at columns like (where each one shows the $R^2$ for that model for each feature):
Random forest
XGBoost
GPT

@markwhiting
Copy link
Member Author

I edited your comment a bit more to indicate what I was thinking. (for some reason Github doesn't send notifications for edits)

@amirrr
Copy link
Collaborator

amirrr commented Oct 15, 2023

The table is complete now. I ran the GPT labeling against the first 2000 of the statements. Refer to this issue for more details about the prompt.

@markwhiting
Copy link
Member Author

Thanks. Interesting. We're not doing very well.

Just so I understand, how are you doing the score calculation for GPT?

Would you mind making a second table that shows F1 scores for each of these as well?

@amirrr
Copy link
Collaborator

amirrr commented Oct 16, 2023

Feature GPT Mean GPT Mode
behavior 0.498 0.097
everyday 0.509 0.056
figure_of_speech 0.105 0.000
judgment 0.529 0.035
opinion 0.497 -0.022
reasoning 0.585 0.028

@markwhiting
Copy link
Member Author

Interesting! Would you mind adding doing that for the others too? Just to see if our scores there get a lot better?

@amirrr
Copy link
Collaborator

amirrr commented Oct 16, 2023

These are the Jaccard accuracy, f1 and global $R^2$ (with baseline being average of scores) scores for Random Forest (RF) and XGBoost methods on labeling statements.

Feature RF Jaccard Score RF F1 Score RF Global R-squared XGBoost Jaccard Score XGBoost F1 Score XGBoost Global R-squared GPT F1 Score GPT Global R-squared
behavior 0.934 0.966 -0.497 0.950 0.950 -1.594 0.794 0.498
everyday 0.511 0.677 -0.733 0.674 0.674 0.083 0.791 0.509
figure_of_speech 0.028 0.054 -0.055 0.182 0.182 0.084 0.402 0.105
judgment 0.939 0.968 -0.031 0.963 0.963 -0.588 0.772 0.529
opinion 0.891 0.943 -0.184 0.940 0.940 -0.252 0.769 0.497
reasoning 0.438 0.609 -0.826 0.623 0.623 0.056 0.776 0.585

@markwhiting
Copy link
Member Author

How interesting. So none of these is really good enough for everything, though most are OK on some of the features.

One more way we could look at this: for each of these samples, can you balance the data so that the training data have an equal number of each category for each feature.

After taking out a test split, take the smaller group and sample the larger group by the number of items the smaller group has.

@markwhiting
Copy link
Member Author

@joshnguyen99 — can you also put updates here when you have them. Also, we have a project for this https://github.com/orgs/Watts-Lab/projects/27/views/5 and we can start tracking some of our efforts there.

@markwhiting markwhiting changed the title GPT Statement labeling Automated statement labeling Oct 25, 2023
@joshnguyen99
Copy link

Sorry for the late message—there were some bugs in my training scripts but I managed to fix them.

  • I tried two models, RoBERTa-base (124M) and RoBERTa-large (355M) and added a binary classification module on top of each model. RoBERTa-large unsurprisingly outperformed in all cases.
  • For each model and each feature, I searched among 4 learning rates and chose one with the lowest validation loss.
  • Each model was trained using 10 epochs, and I employed early stopping to avoid overfitting.

Below is the performance by the best model for each feature. I held out 10% of the dataset, stratified by the predicted feature.

Feature Precision Recall F1 AUROC
behavior 0.768 0.833 0.799 0.725
everyday 0.630 1.000 0.773 0.474
figure_of_speech 0.641 0.294 0.403 0.741
judgment 0.790 0.897 0.840 0.749
opinion 0.635 1.000 0.777 0.564
reasoning 0.619 1.000 0.765 0.608

We have a minor improvement in the F1 score for figure_of_speech and reasoning compared to RF (what model was this again @amirrr?)

Compared to GPT it's pretty much the same. But I don't think we used the same test set, so it might be worth it to sync up.

I have also added you to the repo for my finetuning scripts (https://github.com/joshnguyen99/commonsense-playground). Also added you to my wandb project to keep track of finetuning if you're interested.

@markwhiting
Copy link
Member Author

Great, thanks! Can you move that repo into the Watts-lab org — we like to keep stuff centralized where possible.

Interesting that the simple models seem to be doing best overall still.

I think RF is random forest with embedding as features.

Why don't we set up a common train-test split code, so we can do repeatable splits. I think @amirrr was working on running RF and XGBoost with a balanced training set, which we think will dramatically help on the figure_of_sepech $F_1$. Can you share those results too @amirrr.

@markwhiting
Copy link
Member Author

One more thing, we are doing a lot of lm related stuff in commonsense-lm. Not sure if we want to share that space between our GPT explorations and other models, but I think ultimately we probably want a singular place for it all. We can talk through logistics of that next week.

@joshnguyen99
Copy link

@markwhiting — Sure, I can move the code to Watts-Lab! For now, I will commit it to a folder within commonsense-statements, next to Amir's training scripts. Let's talk about how LLM-related code can be organized under one big repo when we meet.

@joshnguyen99
Copy link

OK, I might have found something in @amirrr's code that led to very different results from mine.

In dimension-prediction.ipynb, you used this to perform train-test-split:

for outcome in outcomes:
    X_train = merged_df[merged_df['category'] != 'Society and social sciences'].embeddings
    y_train = merged_df[merged_df['category'] != 'Society and social sciences'][outcome]

    X_test = merged_df[merged_df['category'] == 'Society and social sciences'].embeddings
    y_test = merged_df[merged_df['category'] == 'Society and social sciences'][outcome]

For example, if outcome == "behavior", then

  • The training set contains all statements not in the "Society and social sciences" bucket
  • The test set contains only "Society and social sciences" statements.

This is not entirely random, and we have very different percentages of positive examples in the training and test sets:

behavior
  Train: 63.4% (2503/3950)
  Test : 95.4% (436/457)

everyday
  Train: 63.7% (2518/3950)
  Test : 57.1% (261/457)

figure_of_speech
  Train: 20.7% (818/3950)
  Test : 7.9% (36/457)

judgment
  Train: 69.9% (2763/3950)
  Test : 93.7% (428/457)

opinion
  Train: 60.6% (2394/3950)
  Test : 89.9% (411/457)

reasoning
  Train: 63.1% (2494/3950)
  Test : 52.7% (241/457)

In other words, the training and test subsets aren't stratified.

@amirrr
Copy link
Collaborator

amirrr commented Nov 8, 2023

Got it. We are going to fix this by sorting statements into two groups based on their embeddings and their cosine similarity and then match them according to whatever we are trying to model. This should make sure the training and testing groups are more balanced, which could help with more accuracy. I will share the train and test in the same repository.

@amirrr
Copy link
Collaborator

amirrr commented Nov 13, 2023

Results with the balanced dataset test:

feature xgb_accuracy xgb_f1 xgb_R-squared rf_accuracy rd_f1 rf_R-squared
behavior 0.588 0.740 0.691 0.571 0.727 0.514
everyday 0.522 0.686 0.656 0.499 0.665 0.426
figure_of_speech 0.295 0.455 -0.053 0.304 0.466 -0.456
judgment 0.563 0.720 0.695 0.541 0.702 0.491
opinion 0.570 0.726 0.671 0.553 0.712 0.476
reasoning 0.501 0.667 0.650 0.462 0.631 0.393

@joshnguyen99
Copy link

@amirrr and @markwhiting, I have uploaded the 6 roberta-large models (for predicting dimensions) to HuggingFace.

They can be found in our lab's HF page: https://huggingface.co/CSSLab

You can try the Inference API on the right-hand side.

@joshnguyen99
Copy link

Here's the performance of a multi-label classifier. I used the non-chat version of TinyLlama (1.1B params) and fine-tuned it once using the multi-label version of our dataset.

Feature Precision Recall F1 AUROC
behavior 0.843 0.548 0.664 0.744
everyday 0.711 0.507 0.592 0.669
figure_of_speech 0.667 0.291 0.405 0.769
judgment 0.774 0.533 0.631 0.594
opinion 0.762 0.552 0.641 0.675
reasoning 0.691 0.786 0.735 0.645
Micro 0.747 0.564 0.643 0.724
Macro 0.741 0.536 0.611 0.683

Overall this looks better than the RoBERTa models used before, but not significantly. The perk here is that this is only one multi-label model instead of six binary classifiers.

(I also fine-tuned LLaMA-2 7B for this task but it actually performs worse than TinyLlama. I suspect it's mostly because the dataset is small relative to the model size, evidenced by the the high variance achieved relatively during fine-tuning.)

@markwhiting
Copy link
Member Author

Thanks, any proposals to get a better result? I feel like at this point using something like that for some properties and using @amirrr's lates models for others seems like it gets us the best overall rate, though it would be great if we could put it all into a single model like the design you have while achieving a winning quality on every dimension.

@markwhiting
Copy link
Member Author

Moving this to statements repo

@markwhiting markwhiting transferred this issue from Watts-Lab/commonsense-platform Jun 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants