qPCRforum.com : Discuss GenEx and qPCR
Forum for discussion about GenEx and qPCR (quantitative real-time pcr)
Here we discuss the problem of missing data
Moderator: MultiD Support
6 posts • Page 1 of 1
I have a few questions regarding missing data. I believe that when I have missing data within a technical replicate ( for example.. using the cut off value of Cq=35) I can using the missing data button to use the avg of replicates to fill the cell. Also, when I use this cut-off and eliminate all replicates of a biological sample ( thus the gene expression is very low for a particular animal), I can include "0" as that value to complete the pre-processing and then in the data manager in Genex remove that sample from that dose group for the downstream statistics. Is this correct?
When using the Grubbs test to identify outliers ( I am working with biological samples in replicates) when should I apply this? After doing all the pre-processing of replicates (qpcr, ref genes, rts) thus left with only the animal identifier and dose being able to identify the biological outlier ( the actual animal)? If I apply the Grubbs test before the pre-processing it seems to identify only one of the technical replicates ( thus with missing data I would fill in this cell)..thus not really identifying a biological outlier? I dont' believe I would redo the Grubbs test to to increased error?
I followed the tutorials and forum but found little info on the timing of the grubbs test and how to deal with biological variation.....
You are asking two very important questions that by no means trivial re trivial. Missing data can be obtained due to two very different reasons, which must be handled differently.
1) Read is missing due to technical failure. In this case, the sample did contain target molecules but we failed to measure them. The correct approach is to replace this reading with the mean value of technical replicates. In GenEx this can be done using the “Missing Data” function, but you can also leave the cell blank. GenEx will then automatically handle is as the mean of its technical replicates.
2) Read is missing due to too low a number of target molecules. In this case, the sample contained fewer molecules than we can detect. It is not necessarily blank, but the number of targets is below the limit of detection of our assay. Although, it is hardly a catastrophe to replace it by the mean reading of positive technical replicates, such handling introduces a bias. Reason is that if a sample contains very few molecules a fraction of the technical replicates are expected to be negative. In fact, this is the approach to determine the level of detection of qPCR assays (see GenEx manual for LOD). For example, let say a sample contains in average 1 molecule per aliquot. If three aliquots are measured as technical replicates, they may contain one molecule each. But, due to random sampling (so called Poisson distribution) there may be two targets in one aliquot, one target in a second aliquot and the third aliquot may be blank. The last one will not give a qPCR reading. If this “missing data” is replaced by the average of the two positive samples, artificially the estimated target concentration in the sample is increased! To avoid introducing the bias GenEx offers alternative means to handle missing data that are appropriate for the case when data are missing due to too low target concentration. The approach is to replace the missing data by the Cq measured at LOD (= level of detection) +1. If you have not determined the LOD of your assay you can usually use the highest Cq measured for a truly positive sample +1. In GenEx “Missing Data” this option is called “Fill with column’s maximum +1”.
If you have large data sets and want to automate the process, go for option 1 (average of technical replicates). It introduces only a small (usually negligible) error if there are readings below LOD. Using option to for failed reactions could introduce large error.
Any triplicates or higher replicates can be tested for outliers by the Grubb’s test. It’s straight forward. If you have a nested design, you perform an outlier test before each averaging of technical replicates (test for qPCR outliers; average qPCR replicates; test for RT outliers; average RT replicates etc). There is a risk of doing too extensively outlier detection with standard Grubb’s test due to multiple testing complications. If performing the test with 95% confidence the probability is 5% that a normal sample with accidently have a deviant reading and be counted as outlier. If an outlier test is performed once this is usually an acceptable error rate. However, when performing outlier test on every sample for each gene and furthermore on several levels, the number of tests may be very large and several normal readings may be eliminated. A normal test is therefore not recommended to perform too extensively. GenEx offers a modified Grubb’s test, which requires that in addition to fulfilling the Grubb’s criteria (which is being off the mean by a standard deviation (SD) that depends on the number of replicates) the SD should be larger than a predefined value. Default in GenEx is 0.25 cycles. This additional criterion removes most false outliers allowing for multiple testing.
Whether outlier test shall be performed on biological replicates or not depends on the situation and context. Biological systems often show wide variation and removing an extreme reading may lose the most exciting sample. Also, Grubb’s test assumes normal distribution. Technical replicates usually show normal distribution (in Cq scale) but biological replicates may not. In GenEx non-parametric tests are available to compare data that do not show normal distribution. In most case it is therefore advisable to keep all the biological replicates. A situations when a biological outlier can be considered for removal is when rather large groups are compared and data show normal distribution. The outlier is then most likely real, and caused by a rare situation in factor that is not addressed.
How to handle missing data in biological replicates in an array!? Since in array analysis there are often no technical replicates but only biological replicates – there is no option to replace missing data on the basis of technical replicates! How can one solve this problem? Imagine there are two test groups to be compared (1 and 2) and one of them contains several subgroups like disease states (subgroup 11, 12 and 14 in test group 2) where subgroup 11 contains missing data in variable 1 and subgroup 12 contains missing data in variable 2! If, for example the LOD CT cut-off is 27 – you can’t just replace the missing data with LOD +1. This wouldn’t reflect “true” values of this case and would lead to false conclusions regarding the mean of test group 2 and therefore the significance in statistics! Once a CT-value has been measured for one or two cases within a subgroup, like in the example, – is it appropriate to replace the missing data by the mean of the subgroup or by interpolation? Should one leave these missing data blank, replace it by “0” or regard the variable as not detected in this subgroup since there are too few values measured for this variable? I am not sure to replace this missing data or not! If the option to leave missing data in biological replicates as blank is the right way – how can one handle missing data if you have redundant data and lots of variables? To inactivate a sample/case in the Data Manager is not appropriate since it might be that there are valid CT-values for other variables in the same array!
Missing data is for obvious reasons a major problem in data analysis and should always be handled with great care to avoid reaching erroneous conclusions. As you point out, restoring missing data – no matter what approach – influences the mean value and the standard deviation, which parameters enter parametric testing. Leaving out the missing data is not correct either, because that will bias the mean to higher expression and narrow the standard deviation artificially. Most appropriate way to treat data sets with missing data in univariate analysis (i.e., analyzing one gene at a time) is to use non-parametric statistic, such as the Mann-Whitney´s and Wilcoxon´s tests. The missing data should then be replaced with the highest Cq + offset. Note, the value of the offset does not influence non-parametric testing.
If expression of many genes has been measured multivariate methods are usually more appropriate than univariate; multiple testing ambiguity is avoided and correlation between genes´ expressions is exploited. Most multivariate methods require full data set and missing data must therefore be imputed. Using Cq + offset, the offset now becomes a weight in the statistics reflecting the significance of a data point being missed. Based on experience we have found when analyzing rather homogenous data a small offset (+1) is good choice, since a data point being missed is usually due to sampling ambiguity (i.e., Poisson distribution). When distinct groups are compared missing data for a particular gene predominantly in some of the groups is probably biologically very significant. Then it should be given high weight, which is equivalent to using large offset (i.e., +4).
Note, working with offsets you can always test the impact of the offset on the results/conclusions by repeating the analysis using somewhat different offset (i.e., +1 compared to +2; or +4 compared to +5). If changing the offset a little has negligible influence on the results you can be confident the way you handled the missing data has not influenced your conclusions appreciably.
thanks for your post! As you indicated in your answer, it is the best way to replace missing data with highest Cq + a certain offset when analyzing with univariate analysis like Mann-Whitney Test. But if you replace missing data in that way and during pre-processing you calculate RQ based on groups (e.g. group 1 will be your baseline) - the fold change calculated is based on the mean which in turn is based on the missing data - so that in this way the missing data will have an effect on the fold change - or am i wrong!? I want to point out that using offscale data for univariat non-parametric testing can result in significant signals (expression)- but how can you compare this significant signal, where offscale data do not influence (non parametric) statistics with the fold change for this signal where the offscale data will basically have an effect as i mentioned before?
You are right in several points. Correction of off-scale data using an offset affects:
1. Calculation of fold change
2. p-value from t-test
and these calculations shall not b e performed on data that has been corrected.
It does not affect:
1. Calculation of p-value with non-parametric methods.
Hope this helps!
6 posts • Page 1 of 1
Who is online
Users browsing this forum: No registered users and 1 guest
Home of the GenEx analysis software