-
Notifications
You must be signed in to change notification settings - Fork 201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling Data with Interventions #181
Comments
Hi, Yujia asked if I would give you an answer. This data we used for this
tech report:
https://arxiv.org/abs/1805.03108
There is another data file here with the experimental variables given for
each row:
https://github.com/cmu-phil/example-causal-datasets/blob/main/real/sachs/data/sachs.2005.with.jittered.experimental.continuous.txt
The experimental variables are actually all 0/1; they have simply been
jittered with a small amount of Gaussian noise to allow matrix inversions
using them to not yield singularity exceptions. You can recover 0/1 by
thresholding them with a threshold of 0.5.
I think the issue with the number of rows is that one of the datasets in
the Sachs paper was not used for their analysis, so we omitted it as well.
If you can't figure out which one that was, let me know; I'll go through my
notes. I think it was...10?
Let me know if that helps.
Best,
Joe
…On Mon, Apr 22, 2024 at 5:13 PM chrisquatjr ***@***.***> wrote:
Hi,
Thank you for the excellent repository! It has been exciting exploring the
discovery tools in this repository lately. I was using the data from Sachs
et al. 2005 before I realized the data was already implemented as an
internal dataset here. I am running into some confusion, however, and
thought I would ask here.
For clarity's sake, here is how I am loading in the internal dataset:
from causallearn.utils.Dataset import load_dataset
data, labels = load_dataset(dataset_name="sachs")
df_internal = pd.DataFrame(data=data,columns=labels)
From what I can tell, the internal implementation of the dataset is some
subset of the 14 excel tables one retrieves if they download from the paper
directly. First, I noticed the internal dataset contains exactly the same
columns as all 14 of the excel tables I have from the paper. The rows, by
contrast, differ substantially. There are on the order of 11 thousand rows
present across all 14 tables but there are only around 7 thousand rows
present in the internal dataset. (I did also confirm that the first 5 rows
of the internal dataset match exactly to those of the 1. cd3cd28.xls
file, so it does not look like any normalization/processing has altered the
values themselves).
Taken together, it seems the internal dataset is a row-joined subset of
the original Sachs dataset. Is this a correct assessment? If so, what
subset of the tables are included? Why aren't all conditions included?
Please let me know if I have simply missed some tutorial or documentation
somewhere. Any assistance would be greatly appreciated.
Overall, my goal is to reproduce the graph seen in Figure 3A. I know the
authors used a simulated annealing approach, but I want to try more current
approaches.
—
Reply to this email directly, view it on GitHub
<#181>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACLFSR52SVH22AWOWZ3Q3I3Y6V4NVAVCNFSM6AAAAABGTQXYVKVHI2DSMVQWIX3LMV43ASLTON2WKOZSGI2TONBXGM2TSMY>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Thank you for the great explanation! I have been following the paper you suggested and have been able to follow everything using Tetrad's GUI, which I switched over to as I do not see an implementation of FASK in this library (let me know if I simply missed it). I followed the paper up to this point: "After running FASK, we deleted the intervention variables from the resulting graph keeping only the graph over the measured variables." I am not sure how to do this in Tetrad. I do not see anything in the manual about deleting or removing variables in this way. This format also does not appear to conform to Tetrad's "status and value" convention. If I adjust the data to conform to this format, would Tetrad immediately know to not include these variables in the graph output? |
Oh my gosh, I missed your message! Let me think how to respond. |
Ah I see. Here's the data: The intervention variables are all variables after 'jnk'--these are experimental variables that have been jittered with a small amount of Gaussian noise. |
Hi,
Thank you for the excellent repository! It has been exciting exploring the discovery tools in this repository lately. I was using the data from Sachs et al. 2005 before I realized the data was already implemented as an internal dataset here. I am running into some confusion, however, and thought I would ask here.
For clarity's sake, here is how I am loading in the internal dataset:
From what I can tell, the internal implementation of the dataset is some subset of the 14 excel tables one retrieves if they download from the paper directly. First, I noticed the internal dataset contains exactly the same columns as all 14 of the excel tables I have from the paper. The rows, by contrast, differ substantially. There are on the order of 11 thousand rows present across all 14 tables but there are only around 7 thousand rows present in the internal dataset. (I did also confirm that the first 5 rows of the internal dataset match exactly to those of the
1. cd3cd28.xls
file, so it does not look like any normalization/processing has altered the values themselves).Taken together, it seems the internal dataset is a row-joined subset of the original Sachs dataset. Is this a correct assessment? If so, what subset of the tables are included? Why aren't all conditions included?
Please let me know if I have simply missed some tutorial or documentation somewhere. Any assistance would be greatly appreciated.
Overall, my goal is to reproduce the graph seen in Figure 3A. I know the authors used a simulated annealing approach, but I want to try more current approaches.
The text was updated successfully, but these errors were encountered: