Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in rule makefit_combined #4

Open
colin986 opened this issue Apr 30, 2021 · 15 comments
Open

Error in rule makefit_combined #4

colin986 opened this issue Apr 30, 2021 · 15 comments

Comments

@colin986
Copy link

colin986 commented Apr 30, 2021

Hi Cyril,

Thanks for your work - codonDT has the potential to be quite useful for our work.

I have matched riboseq and rnaseq data. I've run a test sample with only the riboseq to completion. Now I'm trying to run both the RNA and Riboseq. And hit this R error.

Error in sparse.model.matrix(data = ncount, formu) : object 'formu' not found Calls: MakeFit -> sparse.model.matrix -> terms Execution halted [Fri Apr 30 19:44:11 2021] Error in rule makefit_combined: jobid: 138 output: Data/Fit/CHX7_RIBO_fit_24:25.RData, Data/Fit/CHX7_RIBO_fit_24:25.RData.pred, Data/Fit/CHX7_RIBO_fit_24:25.RData.cor

Best,
Colin

@cgob
Copy link
Owner

cgob commented Apr 30, 2021

Dear Colin,
Thanks for you comment. You are right. This has been corrected in my current local version with other improvements and new options. This should be pushed on GitHub by Monday. Don’t hesitate to pull the last version as it improves small details in the pipeline and allows more options for the user.
Best,
Cédric

@colin986
Copy link
Author

Thanks Cèdric, I’ll hold off rerunning until your next version is pushed.

I have another question about the experimental design. In our data we quadruplicate CHX/RNASeq pairs for two conditions. When I try include all 8 samples - snakemake throws and error I presume this is because there is a limitation for triplicates?

Also for the first column of the samples tsv how should this be set and should the sample I’d match between the chx and RNASeq or is it sufficient to designate the replicate and columns

Thank you,
Colin

@cgob
Copy link
Owner

cgob commented May 4, 2021

Dear Colin,
The new version should be pushed by the end of the day.
For the samples .tsv file, we are finally using a three column file with SAMPLES, Type and SRR as header.
Ribo-seq and RNA-Seq should have the same "SAMPLES" name. In case you want replicates to be fitted together (one output), you can give them the same name. Otherwise, I advice you to run them separately with different sample names in order to have an idea about the variance in the data. Tell me if it's not clear.

Best,

Cédric

Example 1 (one fit per sample, 2 output):
SAMPLES Type SRR
cond_A1 RNA SRRX1
cond_A2 RNA SRRX2
cond_A1 RIBO SRRX3
cond_A2 RIBO SRRX4

Example 2 (merged replicates, 1 output):
SAMPLES Type SRR
cond_A RNA SRRX1
cond_A RNA SRRX2
cond_A RIBO SRRX3
cond_A RIBO SRRX4

@colin986
Copy link
Author

colin986 commented May 4, 2021

Thanks Cedric, that's clear on the replicates!

Looking forward to working with the new version.

Colin

@colin986
Copy link
Author

colin986 commented May 5, 2021

Thanks Cedric, I've pulled the latest version. Just a note on the definition of the A site position in config.yaml - the comments says:

position of the A site from the 5'end of the read (set to 5 or 6)

I presume this should be 15 or 16?

I'm also having a problem when downloading from local fastq files; looking at the Snakemake file - it seems to alway expect the data to be on SRA.

Colin

@cgob
Copy link
Owner

cgob commented May 6, 2021

Hi Colin,
I have updated the readme file and hopefully this might help to answer your questions. I will try to make it clearer in the next update.

Position of the A site: Because the reads are shifted according to the CDS reading frame, we are using codon coordinates and not nucleotide. Depending of the libary type and species, it is either set to 5 or 6. The best is to run the pipeline and check that the pattern matches the position of the A site.

Fastq files (see readme): In case fastq files are directly provided, files names must be specified under the "SRR" field without the ".fastq" extension and placed in the "/workdir/data/raw/" output folder. Tell me if it's working.

I hope this was helpful.

Best,

Cédric

@colin986
Copy link
Author

colin986 commented May 6, 2021

Thanks Cedric, sorry for confusion on the A site.

I have included the fastq files in the data/raw output folder - the pipe line still tries to download from sra using the sample name. on my side when I look at the Snakemake file there's no option to bypass SRA

image

I also see the following error when I include more than 1 line in the sample.tsv file.

image

Colin

@cgob
Copy link
Owner

cgob commented May 6, 2021

There is no defined rule in the snakefile to bypass SRA but because snakemake rules are based on input file names and wildcards, it's working. Snakemake will identify that the fastq files are already in the folder and will proceed with the mergefastq rule. You have to be sure that the SRR field is not the SRA anymore but the fastq name without extension.

Example
In my workdir I have two fastq files:
Leu1_S2_R1_001.fastq, Leu2_S6_R1_001.fastq

Here is my samples.tsv:
SAMPLES Type SRR
Leu1 RIBO Leu1_S2_R1_001
Leu2 RIBO Leu2_S6_R1_001

I hope this is working.
Best,

Cédric

@colin986
Copy link
Author

colin986 commented May 6, 2021

Hi Cedric,

No change - tried to simplify things

In my workdir/Data/Raw directory I have:
ntsr1.fastq

samples.tsv
SAMPLES Type SRR
ntsr1 RIBO ntsr1

Error:

image

Best,
Colin

@cgob
Copy link
Owner

cgob commented May 6, 2021

Which command do you use to run snakemake ? Can you try to add "--reason" to understand while snakemake is triggering the download rule while ntsr1.fastq is existing. And can you try "--touch" as well - when the output file is newer that the input file, snakemake will rerun the rule.

@colin986
Copy link
Author

colin986 commented May 7, 2021

Thanks Cedric, you were right - the mistake was on my side I used "--touch" and everything runs smoothy now up until the combined fit of the Ribo and RNA.

I hit another error there - the glm doesn't converge after 200 iterations. I've increased the number to 1,000 and started again.

I suspect this is specific to my data though and it won't converge with the increased MAXITER.

For this run, I've preprocessed my Ribo data by removing reads mapping to rRNA, snoRNA and tRNA. I then filter the reads to the lengths that are (>60%) in phase.

Do you recommend using raw data or preprocessed?

@colin986
Copy link
Author

Hi Cedric,

Just an update, the problem with the MAXITER was on my side, I restricted the upper and lower bounds of the read length too stringently, when I used the defaults everything worked smoothly. I was wondering could you explain a little more about those please?

A minor error occurs when loading the Ensembl.cds.parse.fa.rData file - when running multiple jobs that overlap that file is incomplete. When I run the pipeline there is no problem.

Colin

@cgob
Copy link
Owner

cgob commented May 11, 2021

Hi Colin,
I'm happy to hear that you were able to run the pipeline. As the number of parameters in the model is high, the read coverage has to be quite deep to achieve proper convergence of the model. In order to select reads with correct size (i.e. corresponding to a real ribosome footprint), the upper and lower bounds read length parameters are used. I usually look at the fragment size distribution plot to set them correctly. Reads with size between the boundaries are selected and used for the fit. How are the results?

I'm not recommending to preprocess the data but to directly start with the raw data. Which organism are you analysing?

Best,

Cédric

@colin986
Copy link
Author

Hi Cedric,

The read coverage is > 100million RPFs (28-31nt) post filtering. For run I completed, I didn't select read lengths by phasing - I used the default values in the config.yaml. When I restricted the by read length to the narrow region the glm did not converge.

I haven't gone through the results in detail as of yet, but intend to in the coming days.

The organism is a cell line from the Chinese hamster. At the moment I'm using an Ensembl GTF, however a better annotation is available through NCBI -do you plan on incorporating different GTFs in the future?

Regards,
Colin

@cgob
Copy link
Owner

cgob commented May 11, 2021

Your coverage is indeed very high.
I haven't planned to add different GTFs but I can have a look at your NCBI GTF and try to adapt the code if you want.
Best,
Cédric

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants