Entering edit mode

16 months ago

drowl1
▴
30

Hello!

It looks like the GATK forums site is down this whole week so I am unable to get an answer to a somewhat basic question.

I have generated several VCF files of individual haploid samples using GATK 4.1.7.0 and the variants have their allele frequencies as 1. I would like to calculate their empirical allele frequencies for subsequent plotting of their site frequency spectrum. How do I go about this?

Thanks in advance!

If I understood your request correctly, I'd merge VCF files and sum up 0 if the variant is in reference state (0/0), 1 if it is heterozygous (0/1) and 2 if it is homozygous. Important to not count (./.) variants since they were not called. Then divide by double number of samples where the variant was called. / should be replaced with | for phased data.

Hi German,

I actually want to see how the allele frequency spectrum looks like for every individual sample before any further processing and thus I wish to calculate the empirical alternate allele frequencies of the variants in the individual samples/VCF file.

For example, in Freebayes, one may use the AO and RO values to do so(AO/(AO + RO), but I cannot figure out the equivalent in GATK. This is what I want to know.

Hi @drowl1

in the results I have from GATK I have them in format GT:AD:DP:GQ:PL. So the 2nd column should contain depths from both alleles, I believe. So it should be something like 0/1:45,37:82:etc which shows you number of reads supporting reference (45) and alternative (37).

However it is not mandatory to have AD in the output of GATK, as far as I know, so if you don't have this field, you have to re-call your samples...

Well noted, German, and I now understand that I can use the Alt values in the AD and DP.

But going by GATK's definitions of annotations, whatever I get from this ( Alt value in AD/DP) is the allele fraction. It's great that you have worked with GATK so perhaps you may be able to explain to me if this value is the same as the alternate allele frequency?

These two terms are quite confusing.

Oh, this one I think is easy. DP contains one number - it is a coverage depth of this position. AD is number of reads supporting each variant - several numbers, divided by commas. Sum of AD should, ideally, be equal to DP (I guess does not work for indels but not sure about that). So I'd take a value from AD and divide it by the sum of AD - that's how you get allele frequency.

I also worked with FreeBayes - indeed, the name of fields is different there, but the principle is the same.

I see. In the case of GATK, most of the time, their DP doesn't match the sum of the AD. I am looking at haploid SNPs so, in the AD field, I only have two values. I read somewhere some time ago, can't remember the exact place, but the GATK team explain that the DP value may not sometimes match the sum of AD because the algorithm considers all the reads when computing the DP, while it only takes account of useful/informative reads when computing the AD values.

In my SNP set, this is true and the sum of AD is not equal to DP. for example:

GT:AD:DP:GQ:PL. 1:33,49:88:99:720,0 -AD ≠ DP

So in such a case, how would you get allele frequency?

Personally I would use 33 / ( 33 + 49 ). There are also multi-allelic sites - if e.g. one allele was not called, you will see that sum(AD) != DP. Same for errors in calling/mapping (maybe there are several soft-clipped reads - not sure if GATK counts them for DP). I would use AD and check the distribution - is the median of heterozygous variants AF will be much different from 0.47-0.49 - then there is a mistake and it has to be revisited.

Hi German,

Thanks a lot for your suggestions and explanation, much appreciated. In a previous post in the GATK forums, I have seen that the authors recommend using AD/DP to calculate empirical allele frequency so I guess I'll go with that.

The discussion is here: https://sites.google.com/a/broadinstitute.org/legacy-gatk-forum-discussions/2015-11-25-2015-07-16/6202-VCF-file-and-allele-frequency

Hi, I agree, in this case (as I understood from the description) DP is equal to sum of AD

and then of curse AD/DP is the right way as well as AD / sum AD