Home - this site is powered by TWiki(R)
JBrowse > LazyFeatureLoading
TWiki webs: Main | TWiki | Sandbox   Log In or Register

Changes | Index | Search | Go

Scaling to large numbers of features by lazily loading chunks of data

Applications: human SNP and EST tracks, short-read sequencing reads, etc. Any data with large numbers of features.

Range Queries

When I first started thinking about lazy feature loading in JBrowse, I imagined taking a DAS-like approach: doing some kind of range query for features in the region being viewed. But there are two problems with that approach:

Problem 1: dynamic queries are difficult to cache

One of my goals for JBrowse is to make sure the client can cache feature data, so if the user goes back to a region they've seen before, they probably won't have to re-download the data. But if it's a dynamic query, then it's difficult for the server to generate appropriate HTTP caching-related headers, because many databases don't have a built-in mechanism for answering the question, "has this query result changed since <time/version>?".

Problem 2: features that span the region boundaries

Features that span the region boundaries have to be transferred in the results for both regions, and the client has to take a code-size and performance hit to "de-duplicate" them. For example, suppose you start with this set of features:

features.png

And separate them into two regions:

boundaryspan.png

Then the red-colored features that span the region boundary are annoying to deal with.

Nested Containment Lists

In JBrowse, the client downloads "nested containment lists" of features. In a nested containment list, features which are "contained" within other features are put into a sublist (subtree) rooted at the containing feature:

nclist.png

Then, at every level of the tree, there is no more containment (because the contained features are "hidden" within their containing features). As part of that process, features are sorted by their start position. The lack of containment means that (at each level of the tree) those features are also sorted on their end position. And if the features are sorted on both start and end, then range overlap queries can be performed with a binary search. If a given feature has a sublist of features within it, the binary search is recursively performed on that sublist.

Lazy Nested Containment Lists

Given that tree structure, you could imagine performing lazy loading by splitting out subtrees as separate files:

nclist-lazy.png

Then you could lazy-load the subtree whenever your binary search hit the containing feature.

However, in general you can't expect there to be conveniently-placed container features that evenly split up the data. Most feature sets will have relatively little containment, especially if they're already split up by type as they are in JBrowse. Some feature sets won't have any containment at all.

Evenly-split Lazy Nested Containment Lists

But you could introduce fake features:

fakefeatures.png

placed so that they split up the set of features into even chunks. Then you could split out the NCLists rooted at those fake features into separate files:

lazyfakefeatures.png

All boundary issues are handled by the regular NCList code, and there's no de-duplication necessary. And the httpd can easily generate appropriate caching headers for those separate files.

Edit 1/29/10: Looking back at the original paper, this might be what they were talking about in section 2.4, or very similar to it. I still don't completely understand what they're talking about in that section, though. -Mitch

-- MitchSkinner - 22 Oct 2009

Edit | Attach | Print version | History: r19 < r18 < r17 < r16 < r15 | Backlinks | Raw View | Raw edit | More topic actions


Parents: WebHome
This site is powered by the TWiki collaboration platformCopyright © 2008-2013 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback
TWiki Appliance - Powered by TurnKey Linux