How Much Training Data Do INeed
"How much training data do I need?"
The above question comes up quite a lot when using xrate. Here's a back-of-envelope calculation to guide such decisions.
The amount of data you need is determined by the "slowest event rate", i.e. rate at which the slowest event occurs per site at equilibrium. The rate of mutation at equilibrium is , so the slowest event rate is .
Suppose that is the total branch length in the tree, i.e. the total elapsed time per site.
Let be the number of times you want to observe the slowest event, and let be the number of sites you'd have to train on to observe the slowest event times. Then the total amount of evolutionary time represented by your training data is and you want so the number of training sites you need is .
(Note that the definition of a "site" depends on your chain: a site could be a single alignment column for a neutral DNA model, three columns for a codon model, or two for an RNA basepair model.)
How big should be? Assuming an uninformative prior: if you observe Poisson-distributed events in time , then the posterior distribution for the underlying event rate is a gamma distribution with mean and variance . Thus the fractional error, i.e. the ratio of the standard deviation to the mean, is . For a desired fractional error of or less, you should therefore train on sites.
Of course the above is a circular argument: it assumes you know ahead of time. In practice, while you may have some idea of what the slowest event rate will be (based on previous experience and data), any estimate you might have for is of order-of-magnitude accuracy at best.
We can extend the above line of reasoning to parametric models (see xgram format page for info). When evaluating the slowest event rate , we should allow for parametric chains where multiple mutations share the same rate parameter . If an event rate is some function of a rate parameter , then gives the effective contribution of to . A better definition of is therefore .
-- Ian Holmes - 29 Sep 2006