Scaling to large numbers of features by lazily loading chunks of data
Applications: human SNP and EST tracks, short-read sequencing reads, etc. Any data with large numbers of features.
When I first started thinking about lazy feature loading in JBrowse, I imagined taking a DAS
-like approach: doing some kind of range query for features in the region being viewed. But there are two problems with that approach:
Problem 1: dynamic queries are difficult to cache
One of my goals for JBrowse is to make sure the client can cache feature data, so if the user goes back to a region they've seen before, they probably won't have to re-download the data. But if it's a dynamic query, then it's difficult for the server to generate appropriate HTTP caching-related headers, because many databases don't have a built-in mechanism for answering the question, "has this query result changed since <time/version>?".
Problem 2: features that span the region boundaries
Features that span the region boundaries have to be transferred in the results for both regions, and the client has to take a code-size and performance hit to "de-duplicate" them. For example, suppose you start with this set of features:
And separate them into two regions:
Then the red-colored features that span the region boundary are annoying to deal with.
Nested Containment Lists
In JBrowse, the client downloads "nested containment lists" of features. In a nested containment list, features which are "contained" within other features are put into a sublist (subtree) rooted at the containing feature:
Then, at every level of the tree, there is no more containment (because the contained features are "hidden" within their containing features). As part of that process, features are sorted by their start position. The lack of containment means that (at each level of the tree) those features are also
sorted on their end position. And if the features are sorted on both start and end, then range overlap queries can be performed with a binary search. If a given feature has a sublist of features within it, the binary search is recursively performed on that sublist.
Lazy Nested Containment Lists
Given that tree structure, you could imagine performing lazy loading by splitting out subtrees as separate files:
Then you could lazy-load the subtree whenever your binary search hit the containing feature.
However, in general you can't expect there to be conveniently-placed container features that evenly split up the data. Most feature sets will have relatively little containment, especially if they're already split up by type as they are in JBrowse. Some feature sets won't have any containment at all.
Evenly-split Lazy Nested Containment Lists
But you could introduce fake features:
placed so that they split up the set of features into even chunks. Then you could split out the NCLists rooted at those fake features into separate files:
All boundary issues are handled by the regular NCList code, and there's no de-duplication necessary. And the httpd can easily generate appropriate caching headers for those separate files.
Edit 1/29/10: Looking back at the original paper, this might be what they were talking about in section 2.4, or very similar to it. I still don't completely understand what they're talking about in that section, though. -Mitch
- 22 Oct 2009
Copyright © 2008-2013 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback