Case 2: Polished

This describes Case 3 of the three input cases:

Case 1: An unresolved primary assembly with associated contigs (the output of FALCON 2-asm) or without (e.g., the output of Canu or wtdbg2).
Case 2: A haplotype-resolved but unpolished set (e.g., the output of FALCON-Unzip 3-unzip).
Case 3: IDEAL! A haplotype-resolved, CLR long-read, Arrow-polished set of primary and alternate contigs (e.g., the output of FALCON-Unzip 4-polish).

Example Dataset

We have listed some example files to test the pipeline based on Chromosome 30 Hzea at https://data.nal.usda.gov/dataset/data-polishclr-example-input-genome-assemblies.

Case 3 will take primary assembly from the FALCON/4-polish folder.

Param	Files	Download link
`--primary_assembly`	"cns_p_ctg.fasta"	cns_p_ctg.fasta
`--alternate_assembly`	"cns_h_ctg.fasta"	cns_h_ctg.fasta
`--mitochondrial_assembly`	"GCF_022581195.2_ilHelZeax1.1_mito.fa"	GenBank download fasta
`--illumina_reads`	"testpolish_{R1,R2}.fastq"	testpolish_R1.fastq, testpolish_R2.fastq
`--pacbio_reads`	"test.1.filtered.bam"	test.1.filtered.bam_.gz

Note: The PacBio Reads (test.1.filtered.bam_.gz) must be decompressed before running the pipeline.

gunzip -dc test.1.filtered.bam_.gz > test.1.filtered.bam

Recommended parameters

nextflow run isugifNF/polishCLR -r main \
  --primary_assembly "cns_p_ctg.fasta" \
  --alternate_assembly "cns_h_ctg.fasta" \
  --mitochondrial_assembly "GCF_022581195.2_ilHelZeax1.1_mito.fa" \
  --illumina_reads "*_{R1,R2}.fastq" \
  --pacbio_reads "test.1.filtered.bam" \
  --step 1 \
  -profile slurm \
  -resume
  

Note: On some browsers, the dashes (-) and underscores (_) can be copied incorrectly. So if you run into an error that says not valid in the pipeline try manually retyping those parameters.

Step 2 runs another round of Arrow polishing with the PacBio reads, then polishes with short-reads with two rounds of FreeBayes. We broke these two steps into seperate phases to allow for manual scaffolding.

Provide the purged primary primary_purged.fa and alternate contigs haps_purged.fa from purge_dups, and mitochondrial genome mitochondrial.fasta as input to step 2.

If scaffolding data, like Hi-C, are available to you, you should scaffold the primary_purged.fa and provide that as input for the --primary_assembly.

Regardless don't forget to include parameter flags --step 2 and resume to this command.

 nextflow run isugifNF/polishCLR -r main \
  --primary_assembly "primary_purged.fa" \
  --alternate_assembly "haps_purged.fa" \
  --mitochondrial_assembly "data/mitochondrial.fasta" \
  --illumina_reads "*_{R1,R2}.fastq" \
  --pacbio_reads "test.1.filtered.bam" \
  --step 2 \
  -profile slurm \
  -resume