Usage
The plugin adds one new function to be able to parse the SAM/BAM/CRAM files correctly. This function is called bam()
.
bam()
function
The function has one mandatory argument:
- The BAM/CRAM/SAM file used as output of the process/workflow
The function has one optional argument:
- The reference FASTA file. This is necessary for some operations on CRAM files.
The fasta can either be a local file or a file URL (currently only supports HTTP and HTTPS protocols)
The function can also be called as sam()
or cram()
. These are simply aliases of the bam()
function and will do exactly the same thing.
Additionally, the stringency
option can also be used to set the validation stringency of the HTSJDK library. This can be used to silence the validation errors emitted when an alignment file isn't correct. This options accepts 3 possible values: lenient
, silent
and strict
(default).
This will create an AlignmentFile
object which has several methods to access the content of the SAM/BAM/CRAM file.
.getHeader()
method
The .getHeader()
method returns a list of all header lines:
[
"@HD\tVN:1.6\tSO:unsorted",
"@SQ\tSN:MT192765.1\tLN:29829",
"@RG\tID:1\tLB:lib1\tPL:ILLUMINA\tSM:test\tPU:barcode1",
"@PG\tID:minimap2\tPN:minimap2\tVN:2.17-r941\tCL:minimap2 -ax sr tests/data/fasta/sarscov2/GCA_011545545.1_ASM1154554v1_genomic.fna tests/data/fastq/dna/sarscov2_1.fastq.gz tests/data/fastq/dna/sarscov2_2.fastq.gz",
"@PG\tID:samtools\tPN:samtools\tPP:minimap2\tVN:1.11\tCL:samtools view -Sb sarscov2_aln.sam"
]
.getHeaderMD5()
method
The .getHeaderMD5()
method returns the MD5 checksum of the header:
.getReads()
method
A reference is needed here for CRAM files
The .getReads()
method returns a list of all raw reads from the alignment file:
[
"ACGCACATTGCTAACTAAGGGCACACTAGAACCAGAATATTTCAATTCAGTGTGTAGACTTATGAAAACTATAGGTCCAGACATGTTCCTCGGAACTTGTCGGCGTTGTCCTGCTGAAATTGTTGACACTGTGAGTGCTTTGGTTTATGA",
"ATGTGTACATTGGCGACCCTGCTCAATTACCTGCACCACGCACATTGCTAACTAAGGGCACACTAGAACCAGAATATTTCAATTCAGTGTGTAGACTTATGAAAACTATAGGTCCAGACATGTTCCTCGGAACTTGTCGGCGTTGTCCTG",
"GCATAGACGGTGCTTTACTTACAAAGTCCTCAGAATACAAAGGTCCTATTACGGATGTTTTCTACAAAGAAAACAGT",
"GCATAGACGGTGCTTTACTTACAAAGTCCTCAGAATACAAAGGTCCTATTACGGATGTTTTCTACAAAGAAAACAGT",
"TAGGTGAGTTAGGTGATGTTAGAGAAACAATGAGTTACTTGTTTCAACATGCCAATTTAGATTCTTGCAAAAGAGTCTTGAACGTGGTGTGTAAAACTTGTGGACAACAGCAGACAACCCTTAAGGGTGTAGAAGCTGTTATGTAC",
"TTACAGAGCAAGGGCTGGTGAAGCTGCTAACTTTTGTGCACTTATCTTAGCCTACTGTAATAAGACAGTAGGTGAGTTAGGTGATGTTAGAGAAACAATGAGTTACTTGTTTCAACATGCCAATTTAGATTCTTGCAAAAGAGTCTTGAA",
"GTCTACAAGCTGGTAATGCAACAGAAGTGCCTGCCAATTCAACTGTATTATCTTTCTGTGCTTTTGCTGTAGATGCTGCTAAAGCTTACAAAGATTATCTAGCTAGTGGGGGACAACCAATCACTAATTGTG",
"GTCTACAAGCTGGTAATGCAACAGAAGTGCCTGCCAATTCAACTGTATTATCTTTCTGTGCTTTTGCTGTAGATGCTGCTAAAGCTTACAAAGATTATCTAGCTAGTGGGGGACAACCAATCACTAATTGTG",
"AACCAGAATATTTCAATTCAGTGTGTAGACTTATGAAAACTATAGGTCCAGACATGTTCCTCGGAACTTGTCGGCGTTGT",
"AACCAGAATATTTCAATTCAGTGTGTAGACTTATGAAAACTATAGGTCCAGACATGTTCCTCGGAACTTGTCGGCGTTGTCC",
"ACTTTCCAAAGTGCAGTCAAAAGAACAATCACGGGTACACACCACTGGTTGTTACTCACAATTTTGACTTCACTTTTAG",
...
]
You can also supply an integer to the function to limit the amount of reads the function returns:
[
"ACGCACATTGCTAACTAAGGGCACACTAGAACCAGAATATTTCAATTCAGTGTGTAGACTTATGAAAACTATAGGTCCAGACATGTTCCTCGGAACTTGTCGGCGTTGTCCTGCTGAAATTGTTGACACTGTGAGTGCTTTGGTTTATGA",
"ATGTGTACATTGGCGACCCTGCTCAATTACCTGCACCACGCACATTGCTAACTAAGGGCACACTAGAACCAGAATATTTCAATTCAGTGTGTAGACTTATGAAAACTATAGGTCCAGACATGTTCCTCGGAACTTGTCGGCGTTGTCCTG"
]
.getReadsMD5()
method
The .getReadsMD5()
method returns the MD5 checksum of the raw reads:
.getSamLines()
method
A reference is needed here for CRAM files
The .getSamLines()
method returns a list of all lines from the alignment file:
[
"ERR5069949.2151832\t83\tMT192765.1\t17453\t60\t150M\t=\t17416\t-187\tACGCACATTGCTAACTAAGGGCACACTAGAACCAGAATATTTCAATTCAGTGTGTAGACTTATGAAAACTATAGGTCCAGACATGTTCCTCGGAACTTGTCGGCGTTGTCCTGCTGAAATTGTTGACACTGTGAGTGCTTTGGTTTATGA\tAAAA<EEEEEEAEEEAEAAAAEEEEEEEEEAAAEE<EEEEEAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAAAA\ts1:i:183\ts2:i:0\tRG:Z:1\tAS:i:300\tde:f:0.0\trl:i:0\tcm:i:13\tnn:i:0\ttp:A:P\tms:i:300\n",
"ERR5069949.2151832\t163\tMT192765.1\t17416\t60\t150M\t=\t17453\t187\tATGTGTACATTGGCGACCCTGCTCAATTACCTGCACCACGCACATTGCTAACTAAGGGCACACTAGAACCAGAATATTTCAATTCAGTGTGTAGACTTATGAAAACTATAGGTCCAGACATGTTCCTCGGAACTTGTCGGCGTTGTCCTG\tAAAAAEEEEEEEEEE/EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEE/EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAEEEEEEEAEEEEEAAEEEEEEEEEAAEAAA<<EAAEEEEEEEAAA<<<AE\ts1:i:183\ts2:i:47\tRG:Z:1\tAS:i:300\tde:f:0.0\trl:i:0\tcm:i:14\tnn:i:0\ttp:A:P\tms:i:300\n",
"ERR5069949.576388\t83\tMT192765.1\t5798\t50\t77M\t=\t5798\t-77\tGCATAGACGGTGCTTTACTTACAAAGTCCTCAGAATACAAAGGTCCTATTACGGATGTTTTCTACAAAGAAAACAGT\tEA/AEEE/<EEEEEEEEEEEAA<EEEEEEEEEEEEEEEEEEEEEAEEEEEAEEEAEE6/EEEAEEEEEEEEEA6AAA\ts1:i:62\ts2:i:0\tRG:Z:1\tAS:i:154\tde:f:0.0\trl:i:0\tcm:i:1\tnn:i:0\ttp:A:P\tms:i:154\n",
"ERR5069949.576388\t163\tMT192765.1\t5798\t60\t77M\t=\t5798\t77\tGCATAGACGGTGCTTTACTTACAAAGTCCTCAGAATACAAAGGTCCTATTACGGATGTTTTCTACAAAGAAAACAGT\tAAAAA6EEAEEEEEAEEAEEAEEEEEEA6EEEEAEEAEEEEE6EEEEEEAEEEEA///A<<EEEEEEEEEAEEEEEE\ts1:i:62\ts2:i:0\tRG:Z:1\tAS:i:154\tde:f:0.0\trl:i:0\tcm:i:10\tnn:i:0\ttp:A:P\tms:i:154\n",
...
]
You can also supply an integer to the function to limit the amount of lines the function returns:
[
"ERR5069949.2151832\t83\tMT192765.1\t17453\t60\t150M\t=\t17416\t-187\tACGCACATTGCTAACTAAGGGCACACTAGAACCAGAATATTTCAATTCAGTGTGTAGACTTATGAAAACTATAGGTCCAGACATGTTCCTCGGAACTTGTCGGCGTTGTCCTGCTGAAATTGTTGACACTGTGAGTGCTTTGGTTTATGA\tAAAA<EEEEEEAEEEAEAAAAEEEEEEEEEAAAEE<EEEEEAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAAAA\ts1:i:183\ts2:i:0\tRG:Z:1\tNM:i:0\tAS:i:300\tde:f:0.0\trl:i:0\tcm:i:13\tnn:i:0\ttp:A:P\tms:i:300\n",
"ERR5069949.2151832\t163\tMT192765.1\t17416\t60\t150M\t=\t17453\t187\tATGTGTACATTGGCGACCCTGCTCAATTACCTGCACCACGCACATTGCTAACTAAGGGCACACTAGAACCAGAATATTTCAATTCAGTGTGTAGACTTATGAAAACTATAGGTCCAGACATGTTCCTCGGAACTTGTCGGCGTTGTCCTG\tAAAAAEEEEEEEEEE/EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEE/EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAEEEEEEEAEEEEEAAEEEEEEEEEAAEAAA<<EAAEEEEEEEAAA<<<AE\ts1:i:183\ts2:i:47\tRG:Z:1\tNM:i:0\tAS:i:300\tde:f:0.0\trl:i:0\tcm:i:14\tnn:i:0\ttp:A:P\tms:i:300\n"
]
.getSamLinesMD5()
method
The .getSamLinesMD5()
method returns the MD5 checksum of the plain SAM lines:
.getFileType()
method
The .getFileType()
method returns the type ("SAM", "BAM" or "CRAM") of the input file:
.getStatistics()
method
The .getStatistics()
method returns a Map structure containing several statistics for the input file:
This method contains some additional options:
include
: this option takes a list of statistic names (the names in the returned map) and returns a map containing these values.exclude
: this option takes a list of statistic names and returns a map without these values.
Examples:
Supplying a s3://
reference
The plugin also supports s3://
for the reference fasta file. By default nft-bam
will look at the region us-east-1
unless specified in the ~/.aws/credentials
file or in system environment variables. Non-public buckets will need to have the correct permissions set.
Examples
Have a look at the plugin tests to see some example implementations.