Error in rule makefit_combined #4

colin986 · 2021-04-30T18:34:53Z

Hi Cyril,

Thanks for your work - codonDT has the potential to be quite useful for our work.

I have matched riboseq and rnaseq data. I've run a test sample with only the riboseq to completion. Now I'm trying to run both the RNA and Riboseq. And hit this R error.

Error in sparse.model.matrix(data = ncount, formu) : object 'formu' not found Calls: MakeFit -> sparse.model.matrix -> terms Execution halted [Fri Apr 30 19:44:11 2021] Error in rule makefit_combined: jobid: 138 output: Data/Fit/CHX7_RIBO_fit_24:25.RData, Data/Fit/CHX7_RIBO_fit_24:25.RData.pred, Data/Fit/CHX7_RIBO_fit_24:25.RData.cor

Best,
Colin

The text was updated successfully, but these errors were encountered:

cgob · 2021-04-30T20:32:13Z

Dear Colin,
Thanks for you comment. You are right. This has been corrected in my current local version with other improvements and new options. This should be pushed on GitHub by Monday. Don’t hesitate to pull the last version as it improves small details in the pipeline and allows more options for the user.
Best,
Cédric

colin986 · 2021-04-30T20:48:20Z

Thanks Cèdric, I’ll hold off rerunning until your next version is pushed.

I have another question about the experimental design. In our data we quadruplicate CHX/RNASeq pairs for two conditions. When I try include all 8 samples - snakemake throws and error I presume this is because there is a limitation for triplicates?

Also for the first column of the samples tsv how should this be set and should the sample I’d match between the chx and RNASeq or is it sufficient to designate the replicate and columns

Thank you,
Colin

cgob · 2021-05-04T06:30:29Z

Dear Colin,
The new version should be pushed by the end of the day.
For the samples .tsv file, we are finally using a three column file with SAMPLES, Type and SRR as header.
Ribo-seq and RNA-Seq should have the same "SAMPLES" name. In case you want replicates to be fitted together (one output), you can give them the same name. Otherwise, I advice you to run them separately with different sample names in order to have an idea about the variance in the data. Tell me if it's not clear.

Best,

Cédric

Example 1 (one fit per sample, 2 output):
SAMPLES Type SRR
cond_A1 RNA SRRX1
cond_A2 RNA SRRX2
cond_A1 RIBO SRRX3
cond_A2 RIBO SRRX4

Example 2 (merged replicates, 1 output):
SAMPLES Type SRR
cond_A RNA SRRX1
cond_A RNA SRRX2
cond_A RIBO SRRX3
cond_A RIBO SRRX4

colin986 · 2021-05-04T07:42:17Z

Thanks Cedric, that's clear on the replicates!

Looking forward to working with the new version.

Colin

colin986 · 2021-05-05T21:39:22Z

Thanks Cedric, I've pulled the latest version. Just a note on the definition of the A site position in config.yaml - the comments says:

position of the A site from the 5'end of the read (set to 5 or 6)

I presume this should be 15 or 16?

I'm also having a problem when downloading from local fastq files; looking at the Snakemake file - it seems to alway expect the data to be on SRA.

Colin

cgob · 2021-05-06T08:09:02Z

Hi Colin,
I have updated the readme file and hopefully this might help to answer your questions. I will try to make it clearer in the next update.

Position of the A site: Because the reads are shifted according to the CDS reading frame, we are using codon coordinates and not nucleotide. Depending of the libary type and species, it is either set to 5 or 6. The best is to run the pipeline and check that the pattern matches the position of the A site.

Fastq files (see readme): In case fastq files are directly provided, files names must be specified under the "SRR" field without the ".fastq" extension and placed in the "/workdir/data/raw/" output folder. Tell me if it's working.

I hope this was helpful.

Best,

Cédric

colin986 · 2021-05-06T08:14:46Z

Thanks Cedric, sorry for confusion on the A site.

I have included the fastq files in the data/raw output folder - the pipe line still tries to download from sra using the sample name. on my side when I look at the Snakemake file there's no option to bypass SRA

I also see the following error when I include more than 1 line in the sample.tsv file.

Colin

cgob · 2021-05-06T08:25:23Z

There is no defined rule in the snakefile to bypass SRA but because snakemake rules are based on input file names and wildcards, it's working. Snakemake will identify that the fastq files are already in the folder and will proceed with the mergefastq rule. You have to be sure that the SRR field is not the SRA anymore but the fastq name without extension.

Example
In my workdir I have two fastq files:
Leu1_S2_R1_001.fastq, Leu2_S6_R1_001.fastq

Here is my samples.tsv:
SAMPLES Type SRR
Leu1 RIBO Leu1_S2_R1_001
Leu2 RIBO Leu2_S6_R1_001

I hope this is working.
Best,

Cédric

colin986 · 2021-05-06T10:09:34Z

Hi Cedric,

No change - tried to simplify things

In my workdir/Data/Raw directory I have:
ntsr1.fastq

samples.tsv
SAMPLES Type SRR
ntsr1 RIBO ntsr1

Error:

Best,
Colin

cgob · 2021-05-06T12:36:00Z

Which command do you use to run snakemake ? Can you try to add "--reason" to understand while snakemake is triggering the download rule while ntsr1.fastq is existing. And can you try "--touch" as well - when the output file is newer that the input file, snakemake will rerun the rule.

colin986 · 2021-05-07T07:59:26Z

Thanks Cedric, you were right - the mistake was on my side I used "--touch" and everything runs smoothy now up until the combined fit of the Ribo and RNA.

I hit another error there - the glm doesn't converge after 200 iterations. I've increased the number to 1,000 and started again.

I suspect this is specific to my data though and it won't converge with the increased MAXITER.

For this run, I've preprocessed my Ribo data by removing reads mapping to rRNA, snoRNA and tRNA. I then filter the reads to the lengths that are (>60%) in phase.

Do you recommend using raw data or preprocessed?

colin986 · 2021-05-10T07:55:27Z

Hi Cedric,

Just an update, the problem with the MAXITER was on my side, I restricted the upper and lower bounds of the read length too stringently, when I used the defaults everything worked smoothly. I was wondering could you explain a little more about those please?

A minor error occurs when loading the Ensembl.cds.parse.fa.rData file - when running multiple jobs that overlap that file is incomplete. When I run the pipeline there is no problem.

Colin

cgob · 2021-05-11T09:09:39Z

Hi Colin,
I'm happy to hear that you were able to run the pipeline. As the number of parameters in the model is high, the read coverage has to be quite deep to achieve proper convergence of the model. In order to select reads with correct size (i.e. corresponding to a real ribosome footprint), the upper and lower bounds read length parameters are used. I usually look at the fragment size distribution plot to set them correctly. Reads with size between the boundaries are selected and used for the fit. How are the results?

I'm not recommending to preprocess the data but to directly start with the raw data. Which organism are you analysing?

Best,

Cédric

colin986 · 2021-05-11T09:20:09Z

Hi Cedric,

The read coverage is > 100million RPFs (28-31nt) post filtering. For run I completed, I didn't select read lengths by phasing - I used the default values in the config.yaml. When I restricted the by read length to the narrow region the glm did not converge.

I haven't gone through the results in detail as of yet, but intend to in the coming days.

The organism is a cell line from the Chinese hamster. At the moment I'm using an Ensembl GTF, however a better annotation is available through NCBI -do you plan on incorporating different GTFs in the future?

Regards,
Colin

cgob · 2021-05-11T09:24:56Z

Your coverage is indeed very high.
I haven't planned to add different GTFs but I can have a look at your NCBI GTF and try to adapt the code if you want.
Best,
Cédric

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error in rule makefit_combined #4

Error in rule makefit_combined #4

colin986 commented Apr 30, 2021 •

edited

Loading

cgob commented Apr 30, 2021

colin986 commented Apr 30, 2021

cgob commented May 4, 2021

colin986 commented May 4, 2021

colin986 commented May 5, 2021 •

edited

Loading

cgob commented May 6, 2021 •

edited

Loading

colin986 commented May 6, 2021 •

edited

Loading

cgob commented May 6, 2021

colin986 commented May 6, 2021

cgob commented May 6, 2021

colin986 commented May 7, 2021

colin986 commented May 10, 2021

cgob commented May 11, 2021

colin986 commented May 11, 2021

cgob commented May 11, 2021

Error in rule makefit_combined #4

Error in rule makefit_combined #4

Comments

colin986 commented Apr 30, 2021 • edited Loading

cgob commented Apr 30, 2021

colin986 commented Apr 30, 2021

cgob commented May 4, 2021

colin986 commented May 4, 2021

colin986 commented May 5, 2021 • edited Loading

cgob commented May 6, 2021 • edited Loading

colin986 commented May 6, 2021 • edited Loading

cgob commented May 6, 2021

colin986 commented May 6, 2021

cgob commented May 6, 2021

colin986 commented May 7, 2021

colin986 commented May 10, 2021

cgob commented May 11, 2021

colin986 commented May 11, 2021

cgob commented May 11, 2021

colin986 commented Apr 30, 2021 •

edited

Loading

colin986 commented May 5, 2021 •

edited

Loading

cgob commented May 6, 2021 •

edited

Loading

colin986 commented May 6, 2021 •

edited

Loading