|
| Benchmarks of generate-tiles.pl on Drosophila melanogaster chromosome 4 on the SixtyFourBitCluster
(go back to the SixtyFourBitCluster page)
It's 11:15PM, and all of the 208 rendering jobs launched yesterday are done, except for one (track 11, coding sequences at 100bp zoom level, about half done).
Today IanHolmes mentioned that he noticed some lag over the NFS on the cluster... is this lag hampering the rendering, since all the PNG tile files are written to lorien over the NFS? If so, then by how much? Lastly, what's causing it?
I'm going to answer the first two questions by benchmarking the re-rendering of track 8 (mRNA - in this case, fairly representative of a typical track), but storing the tiles locally this time and requiring no NFS use, as follows:
- at the 100bp zoom level, one generate-tiles.pl instance per node (i.e. one MySQL database connection to gdtile for storing primitives) - do this on sinclair and ivanova
- at the 500bp zoom level, one instance per node (i.e. one connections to gdtile) - do this on franklin and marcus
- at the 100bp and 500bp zoom level simultaneously, i.e. two instances (two connections to the same gdtile database) per node - do this on delenn and lennier
- at the 100bp and 500bp zoom level simultaneously, i.e. two instances (two connections, but to two different gdtile databases) per node - do this on londo and vir
Note that all of the above are dual-CPU nodes.
LATER NOTE: the 100bp and 500bp two-process benchmarks are thrown out, since they don't effectively strain the database with two simultaneous connections... the 500bp finishes so much earlier that this is not an effective test and doesn't mean anything.
This will give us some initial benchmarks to compare against for how rendering would perform if all operations were strictly local to the nodes they are on. Then, when the last NFS-requiring job from yesterday completes and the NFS is free, we will be able to benchmark the above over the NFS (running one, then two, then four... and so on jobs at a time, to see how they scale) and see if there is any drop in performance and how it scales as a function of the number of nodes (and CPUs) involved.
Another thought... we ran all the above jobs at generate-tiles.pl -v 2 (highest verbosity level) setting, which dumps out a lot of debug info (this is a shakedown run after all) that gets written over the NFS... could this be hampering the process? We'll control for that later...
Commands to run the four benchmarks above, respectively, are:
$ nohup `date > /home/sgeadmin/dmel/local-output/START ; ./generate-tiles.pl -c /home/sgeadmin/dmel/ -o /home/sgeadmin/dmel/local-output/ -l 4 -p 0 -v 2 --print-tile-nums --no-xml -r t8z1r1-12817 &> /home/sgeadmin/dmel/local-output/out.8.1.local ; date > /home/sgeadmin/dmel/local-output/FINISH` &
and
$ nohup `date > /home/sgeadmin/dmel/local-output/START ; ./generate-tiles.pl -c /home/sgeadmin/dmel/ -o /home/sgeadmin/dmel/local-output/ -l 4 -p 0 -v 2 --print-tile-nums --no-xml -r t8z2r1-2564 &> /home/sgeadmin/dmel/local-output/out.8.2.local ; date > /home/sgeadmin/dmel/local-output/FINISH` &
and
$ date > /home/sgeadmin/dmel/local-output/START
$ nohup `./generate-tiles.pl -c /home/sgeadmin/dmel/ -o /home/sgeadmin/dmel/local-output/ -l 4 -p 0 -v 2 --print-tile-nums --no-xml -r t8z1r1-12817 &> /home/sgeadmin/dmel/local-output/out.8.1.local ; date > /home/sgeadmin/dmel/local-output/FINISH1` &
$ nohup `./generate-tiles.pl -c /home/sgeadmin/dmel/ -o /home/sgeadmin/dmel/local-output/ -l 4 -p 0 -v 2 --print-tile-nums --no-xml -r t8z2r1-2564 &> /home/sgeadmin/dmel/local-output/out.8.2.local ; date > /home/sgeadmin/dmel/local-output/FINISH2` &
and (run from /home/sgeadmin/dmel/local_output1):
$ date > /home/sgeadmin/dmel/START
$ nohup `./generate-tiles.pl -c /home/sgeadmin/dmel/ -o /home/sgeadmin/dmel/local-output1/ -l 4 -p 0 -v 2 --print-tile-nums --no-xml -r t8z1r1-12817 &> /home/sgeadmin/dmel/local-output1/out.8.1.local ; date > /home/sgeadmin/dmel/FINISH1` &
$ cd ../TiledImage2
$ nohup `./generate-tiles.pl -c /home/sgeadmin/dmel/ -o /home/sgeadmin/dmel/local-output2/ -l 4 -p 0 -v 2 --print-tile-nums --no-xml -r t8z2r1-2564 &> /home/sgeadmin/dmel/local-output2/out.8.2.local ; date > /home/sgeadmin/dmel/FINISH2` &
-- AndrewUzilov - 08 Feb 2006
Just some notes on track 8 (mRNA) of Drosophila melanogaster chromosome 4 rendering benchmarks above (this is taken from the rendering run over the NFS, but this doesn't matter, the primitive and tile quantities will always be the same no matter where you run generate-tiles.pl):
- 100bp zoom level debug trace contains:
- 14,452 "Recorded" statements (i.e. that's how many primitives stored into database)
- 13,061,860 "Replaying" statements (i.e. how many primitives are executed)
- 12,817 tiles rendered
- 500bp zoom level debug trace contains:
- 4,199 "Recorded" statements (primitives stored)
- 2,613,479 "Replaying" statements (primitives executed)
- 2,564 tiles rendered
Interesting... slow NFS or no slow NFS, no wonder the tile rendering step is taking so much time... we're rendering more primitives than there actually are in the image by about three orders of magnitude!
-- AndrewUzilov - 09 Feb 2006
At 11:13AM cluster time, started a single NFS benchmark on franklin as follows:
$ date > /mnt/nfs/dmel/benchmark/START ; nohup ./generate-tiles.pl -c /home/sgeadmin/dmel/ -o /mnt/nfs/dmel/benchmark/ -l 4 -p 0 -v 0 --print-tile-nums --no-xml -r t8z1r1-12817 &> /mnt/nfs/dmel/benchmark/out.8.1 &
Note that I took the verbosity off. Maybe that was slowing down things. I'll run the above with the verbosity set to full blast (-v 2) when the above is done and the NFS is free again.
Additionally, I'm launching a local benchmark, one process per node, on gkar and morden (same benchmark, two nodes for comparison) to see how much verbosity slows down things, as follows:
$ date > /home/sgeadmin/dmel/local-output/START ; nohup ./generate-tiles.pl -c /home/sgeadmin/dmel/ -o /home/sgeadmin/dmel/local-output/ -l 4 -p 0 -v 0 --print-tile-nums --no-xml -r t8z1r1-12817 &> /home/sgeadmin/dmel/local-output/out.8.1.local &
(We'll see when it terminates by the "last modified" timestamp on the output file.)
-- AndrewUzilov - 09 Feb 2006
IanHolmes has proposed increasing the tile size... let's see how well we do with a tile width of 10,000 pixels instead of 1,000, on marcus:
$ date > /home/sgeadmin/dmel/local-output.large-tile/START ; nohup ./generate-tiles.pl -c /home/sgeadmin/dmel/ -o /home/sgeadmin/dmel/local-output.large-tile/ -l 4 -p 0 -v 2 --print-tile-nums --no-xml -r t8z1r1-12817 &> /home/sgeadmin/dmel/local-output.large-tile/out.8.1.local &
Also, started another benchmark on sinclair (now that the original is done) with a MySQL 512MB key buffer size and a 1024MB table cache (increased per ChrisMungall's suggestion), -v 2, local, the original 1000 pixel tile size.
And another one using Ian's updated TiledImage.pm on ivanova, but the original memory settings for MySQL, -v 2, local, original 1000 pixel tile size.
LATER NOTE: I screwed up the 10,000-pixel benchmark above and threw it out, but tried different possibilities for larger tiles to see how the rendering time scales... see below.
-- AndrewUzilov - 09 Feb 2006
Benchmark results (benchmarks started Wed 8th, 2006)
(all on Dmel chromosome 4, track 8 - mRNA)
1000-pixel-wide tiles, default MySQL settings, no global_primitive table:
100bp, 1 process, local, verbose=0 (gkar) - 15 hr 39 min
100bp, 1 process, local, verbose=0 (morden) - 15 hr 38 min
100bp, 1 process, local, verbose=2 (sinclair) - 15 hr 42 min
100bp, 1 process, local, verbose=2 (ivanova) - 15 hr 46 min
100bp, 1 process, NFS, verbose=0 (franklin) - 16 hr 16 min
100bp, 1 process, NFS, verbose=2 (franklin) - 16 hr 14 min
100bp and 101bp, 2 processes, local, verbose=2 (londo) - 15 hr 20 min
100bp and 101bp, 2 processes, local, verbose=2 (vir) - 15 hr 19 min
500bp, 1 process, local, verbose=2 (franklin) - 46 min
500bp, 1 process, local, verbose=2 (marcus) - 46 min
1000-pixel-wide tiles, 512MB key buffer size and 1024MB MySQL table cache, no global_primitive table:
100bp, 1 process, local, verbose=2 (sinclair) - 15 hr 46 min
1000-pixel-wide tiles, default MySQL settings, using the new global_primitive table:
100bp, 1 process, local, verbose=2 (ivanova) - 15 hr 43 min
variable-size tiles (all 100bp, 1 process, local, verbose=2):
2000-pixel-wide tiles (marcus) - 8 hr 17 min
4000-pixel-wide tiles (lennier) - 4 hr 34 min
8000-pixel-wide tiles (lennier again) - 2 hr 36 min
16000-pixel-wide tiles (marcus again) - 1 hr 44 min
32000-pixel-wide tiles (marcus again... again) - 1 hr 20 min
Analysis
- Drastically increasing memory allocated to MySQL does nothing.
- Separating frequently used primitives (i.e. those with no bounding box, which will get applied to every tile) into a separate table does nothing.
- Increasing tile size is the optimization of choice, and the time reduction scales linearly with the tile increase. (TODO: I am going to keep pushing the size up to see how optimal it can go.)
- Running two rendering processes connected to the same database (see the 100bp and 101 bp trials, which cache roughly the same number of primitives into the same database) does not affect the runtime versus one process, and oddly enough seems to take slightly less than the one process (maybe londo and vir are just slightly faster for some unknown reason).
- Outputting tiles over the NFS is slightly slower than doing it locally, but not significantly so.
- The extremely verbose output does not significantly affect the rendering time.
-- AndrewUzilov - 15 Feb 2006 |