Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The .bam file refers to a chromosome 'MT+' not present in the annotation (.gtf) file. #1545

Closed
hyjforesight opened this issue Nov 18, 2021 · 3 comments

Comments

@hyjforesight
Copy link

hyjforesight commented Nov 18, 2021

Are you using the latest version of samtools and HTSlib? If not, please specify.

(run samtools --version)
samtools 1.14

Please describe your environment.

  • OS (run uname -sr on Linux/Mac OS or wmic os get Caption, Version on Windows)
    Ubuntu 20.4 LTS
  • machine architecture (run uname -m on Linux/Mac OS or wmic os get OSArchitecture on Windows)
    x86_64
  • compiler (run gcc --version or clang --version)
    gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0

Please specify the steps taken to generate the issue, the command you are running and the relevant output.

Hello samtools,
Sorry for reporting this bug. It's a little complicated.

I'm using velocyto.py to convert bam file to loom file.

  1. bam file contains single cell RNA-seq data, generated by CellRanger pipeline (10X) by mapping fastq to the reference genome (refdata-gex-mm10-2020-A/gene.gtf). In this gene.gtf file, the mitochondrial chromosome name is annotated as "chrM". See below.
    image
  2. loom file is the converted file.
  3. velocyto.py is a package to convert bam to loom. It calls samtools to align bam file to the reference genome, internally, when running.

The bug is:
When samtools is making up chromosomes, it can make up chromosomes 1-22 well, but cannot make up chromosome MT (see below), resulting in no mitochondrial genes information in the final loom files.
image

Because the bam file is generated from fastq by aligning to the reference "chrM", so the mito chromosome name inside of bam is chrM. Checked by below.
image

Also the mito chromosome name inside of gene.gtf is chrM.

So we're thinking that the bug may be caused by samtools using 'chrMT' to search the mito chromosome inside of bam to do alignment? So no chrMT is founded, leading to no mito genes alignment. We're writing to confirm whether our understanding is right?

Thanks!
Best,
YJ

Here is the illustration of my description.
image

@jkbonfield
Copy link
Contributor

Samtools is totally unaware of specifics of organism or reference sequences - it just uses the data presented to it in the header and binary file. If the header claims MT+ it's because that's what was given to the aligner. (I also recommend that you use the appropriate SQ header tags, such as DS description, UR and M5 fields to annotate references so there is clear data provenance, along with PG lines.)

What you're seeing in those "marking up" lines isn't Samtools output. I suspect it's your conversion script. It looks like that has a baked in notion of which reference is used and isn't honouring what the sam header contains.

It's possible you can rename references using samtools reheader, but I really wouldn't advise this unless you're 100% certain they're identical bar the name (so check lengths, and preferably MD5sums if present). In this case I suspect with MT+ and MT- that this wouldn't be possible.

Regardless, this looks to be a bug with something other than samtools itself, so closing the issue.

@daviesrob
Copy link
Member

This is definitely a velocyto.py problem. It renames chrM to MT, and for added confusion puts + or - to indicate the strand when reporting the missing entry in the .gtf file. So that explains where MT+/MT- come from.

You may be able to fix it by changing chrM to MT in your .gtf file, and possibly also in your reference fasta file so the program gets the name it expects. Unfortunately it looks like velocyto.py hasn't been touched for a while, so I'm afraid you may have to debug it yourself...

@hyjforesight
Copy link
Author

hyjforesight commented Nov 18, 2021

Hello samtools,
Yes, it's the BUG caused by velocyto.py.
Agree with @daviesrob. I solved it by changing naming of mito chromosome in reference genome and rmsk.gtf, and then remake the BAM file. Check this. velocyto-team/velocyto.py#318

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants