How To Run Pipeline

From Biowiki
Jump to: navigation, search

See also

---

How to run a genome screen

CAUTION: As the pipeline is continuously evolving, these instructions apply only to the later incarnation, i.e. caf1screen_v12 and beyond. Older data can be hacked into the same form and be made compatible, which is an ongoing (albeit lower-priority) process. [AVU 2007/09/17]

TODO: write an intro/high-level overview mentioning the most basic things.

Things to mention/stress:

  • more than one model can be used per screen (well, when I fix the "load hits" rule to have model name in it)

Check out the project

You must have an account on cvs.biowiki.org to do this. In theory it can be done anonymously, but... I don't know how. (TODO: find out)

To check out to the default directory (same as the project name), use:

cvs -d:ext:USERNAME@cvs.biowiki.org:/usr/local/cvs checkout 12fly-analysis

To check out the project to some other directory (e.g. hackathon), use:

cvs -d:ext:USERNAME@cvs.biowiki.org:/usr/local/cvs checkout -d hackathon 12fly-analysis

Update/The New Way

STUB

  • start your own CVS project
    • get makefile and env-*.bat from template in pipeline lib (TODO: add them, write a template with instructions)
  • get .cvsignore file from pipeline template (TODO: add it)
  • symlink to grammars dir, or start your own

Get source data into the database

STUBS:

  • use/add to Makefile.source and Makefile.load_db
  • if necessary, add to sge/SCREEN/links-to-segments rule in Makefile.sge

Write the model

STUBS:

  • goes in grammars/
  • make sure to put correct model name in the grammar file (xrate "name" modifier)
    • maybe safety-check this in the makefile?
  • redirect to Rna Models

Set up the environment

You will need to set and export some environment variables that control where the pipeline gets and puts data, what model it uses, and so on.

Use absolute paths when specifying paths.

TIP: To save yourself the effort of defining variables every time you log in, you can put their definitions in a file and execute it before running any make commands (which also nicely aggregates a lot of screen-specific stuff in one file).

For example, the file env-caf1screen_v12.bat (included in the project as a sample) contains variable definitions for the screen caf1screen_v12.

Load it like this before running make:

./env-caf1screen_v12.bat

Required environment variables

If you do not set these, the pipeline will produce an error and refuse to run.

CAUTION: Many variables that don't specify paths are used to create database and table names. So, you should not give them any values using crazy characters that are illegal to use in database and table names. To be on the safe side, use only alphanumerics and underscores.

Of course, path variables (e.g. ORIG_SEGMENTS) can use whatever characters are valid on your system.

  • SCREEN
    • The name of your screen. This is purely for organizational purposes and can be anything you want. It usually controls names of directories and the database where your results will be stored (unless you override it with an optional variable, see below).
    • example values: caf1screen_v12, fungi00
  • SPECIES
    • The reference species. When we map annotations/hits from alignment to genomic coordinates, we will only do the mapping for this species.
    • example values: dmel, dana, coprinus_cinereus_3
  • MODEL
    • The name of the model used for the screen. We will look for the xrate grammar file for this model in grammars/${MODEL}.eg.
    • example values: ncRnaDualStrand, ncRnaDualStrand_v12
  • NULL_MODEL_FWD and NULL_MODEL_REV
    • Names of forward- and reverse-strand null model states in the model grammar (they can have the same value if the null model is strand-symmetric). xrate dumps the lgInside score for all nonterminals for the hit annotation, and these variables control from which nonterminal the null model score is fetched to calculate the log likelihood ratio lg [P(model)/P(null model)].
    • example values: FWD_INTERGENIC, REV_INTERGENIC, LEFT_INTERGENIC
  • DB_HOST
    • Host running the database where data is stored.
    • example values: sheridan
  • ORIG_SEGMENTS
    • Path to directory containing alignment segments that we want to screen. We will recursively go through everything in this dir, looking for files with suffix .stock and making symlinks from sge/$SCREEN/in/ to them. You have to make sure there are no files with .stock suffix in $ORIG_SEGMENTS that you don't want to screen.
    • example values: /nfs/data/genome/fly/align/12fly/pecan/orig/, /nfs/data/genome/fly/align/12fly/mavid/
  • ALIGN_TYPE
    • What kind of alignments are contained in $ORIG_SEGMENTS. This will control how the rule links-to-segments in Makefile.sge behaves when symlinking.
    • example values: mavid, pecan (currently the only two supported values)
  • MERCATOR_DIR
    • Path to directory containing Mercator map, genomes, and (if any) .agp files, although you can override paths to all these explicitly using optional variables.

Optional environment variables

If you do not explicitly specify these, they will be given sensible defaults (usually based on the values of required environment variables).

The point of having these variables is so that you can override a default by setting it in the environment. For example, if I'm not happy with the default value of $AGP_DIR, I can override it before running make:

export AGP_DIR=/some/other/dir
make ...
  • SOURCE_DB
    • Name of database storing source data that is independent of our screen and will probably not change (e.g. known annotations). It is convenient to have more-or-less static data be in a separate database from the one where you store your results, since there can be multiple results databases (usually one for each screen), and you don't want to load the same source data over and over again into each results database.
  • RESULTS_DB
    • Name of database storing our genome screen results.
  • DUMP_DB
    • Name of database containing tables that rules in Makefile.gff will dump to GFF. This variable is purely for the purpose of controlling which database you want to use to make a GFF file (e.g. to send to another lab) and shouldn't actually have any effect on a genome screen.
  • MERCATOR_MAP_FILE, MERCATOR_GENOMES_FILE, and AGP_DIR
    • Paths to the Mercator "map" and "genomes" files and the .agp file dir. Normally "map" and "genomes" are ${MERCATOR_DIR}/map and ${MERCATOR_DIR}/genomes and ${AGP_DIR} is the same as ${MERCATOR_DIR} (these are actually the defaults), but you can override them if necessary.
    • example values: /nfs/projects/hackathon/source/jason/alignments/{map,genomes} (from the fungal mini-hackathon, Fungal Genomes)

Create and test job(s) for SGE

STUBS:

  • add .xrate rule to Makefile.sge, locally test it

Run the screen

CAUTION: Always run make commands in the same directory as the makefiles are in, since they are heavily dependent on local paths. Do not use make -f ..., it won't work.

So, before starting any screen you should of course

cd PROJECT_DIR_WITH_MAKEFILES

CAUTION: All messages are dumped to stdout/stderr and not saved to log files. It's up to you redirect them there, e.g.

make screen > ${SCREEN}.out 2>${SCREEN}.err

then watch the progress (in other windows) using

tail -f ${SCREEN}.out

tail -f ${SCREEN}.err

Or, you can use the tee command (but for watching stdout only), e.g.

make screen 2>${SCREEN}.err | tee ${SCREEN}.OUT

Completely automated

The following make command will figure out what pipeline stages are complete and what need to be re-run automatically. You may be given "yes/no" prompts at several steps (e.g. creating new tables/overriding old tables)

make screen

12FLY-ONLY NOTE: For dmel, use

make screen-caf1-dmel

instead, since it requires some _dmel_-specific hacking to work. That rule is defined in the caf1screen CVS project.

Manually do each step

CAUTION: Don't hit Tab to expand/auto-complete paths that use environment variables (e.g. sge/$SCREEN/run. They will usually expand to the absolute path, which the makefiles are unprepared for (makefiles expect paths local to their location).

TIP: Use make -n before running a rule to see what commands the rule would execute, but not actually run them. See if they are sensible before executing the rule.

Get alignments into a standard directory structure

TODO: comment, esp. about adding overriding segments rule

make checkpoints/${SCREEN}/segments

Run SGE jobs

TODO: comment

make sge/${SCREEN}/run

Load SGE results into database

TODO: comment

make checkpoints/${SCREEN}/hits-and-windows-loaded-${MODEL}

Map hits to genomic coordinates

You will have to do this for each species you're interested in, changing the value of the SPECIES environment variable before each execution.

mkdir gff/${SCREEN}
make gff/${SCREEN}/hitsGenomic_${MODEL}.${SPECIES}.gff
make checkpoints/${SCREEN}/tables/hitsGenomic_${MODEL}_${SPECIES}

This actually dumps database contents to GFF, remaps them using featurevole.pl (see Mercator Perl), and loads the remapped hits back into the database. The reason for this runaround is that it seemed easier to re-use the existing featurevole.pl (which was written to work on GFF files) rather than adapt it to work with a database.

(In retrospect, that assumption was false, but at least I wrote a lot of handy Perl on the way, e.g. dump-to-gff.pl, a generic database-to-GFF dumper. [AVU])

It is OK to lose hits (TODO: put dmel statistics on how much can be OK to lose) when doing the alignment-to-genomic conversion because some alignment intervals will map to all gaps in the genomic sequence and featurevole.pl will throw them out.

12FLY-ONLY NOTE: For dmel, you should the following rule:

make checkpoints/$SCREEN/tables/hitsGenomic_${MODEL}_${SPECIES}_r5

which creates two tables of hits:

  • _r4 - release 4 coordinates used by CAF1
  • _r5 - release 5 coordinates, used by the latest Fly Base annotation (which is significantly better than the release 4 annotation)
    • this table may contain less hits since not all release 4 sequence can be remapped to release 5 for reasons I am not yet sure about; however, spot-checks show the mapping by my code is being done correctly (TODO: get to the bottom of this [AVU])

TODO: comment on Mavid heterochromatin problem.

Analyze/post-process the results

Rank the hits

make checkpoints/${SCREEN}/tables/hitRank_${MODEL}

Make filter tables

(TODO) describe:

  • what they are for (can probably copy-and-paste from an old email)
  • add make rule syntax for each of the following columns

filterAlign columns

  • id
  • lgOdds
  • rank
  • length
    • length of SS annotation, in alignment columns
  • bp
    • number of base pairs
  • covar
    • number of alignment columns showing covariant mutations (as determined by colorstock.pl)
  • seqs
    • number of sequences with at least 1 nt of sequence (i.e. not 100% gaps) in the hit's multiple alignment

filterGenomic columns

See also: some MySQL tips on filling these columns

  • id
  • length
    • length from genomic coordinates (i.e. gaps removed)
  • ntInKnown
    • number of hit nucleotides overlapping known noncoding RNA(s)
    • 12FLY-ONLY NOTE: for flies, it is Fly Base {nc,micro,sno,sn,t,r}RNA annotations
  • ntInClone
    • number of hit nucleotides overlapping a sequenced/located clone (cDNA, EST, etc.)
    • this does not include microarray data! that's a separate column
  • ntInExon
    • number of hit nucleotides overlapping exon(s)
  • ntInIntron
    • number of hit nucleotides overlapping intron(s)
  • ntInUTR5
    • number of hit nucleotides overlapping 5' UTR(s)
  • ntInUTR3
    • number of hit nucleotides overlapping 3' UTR(s)
  • repeat
    • boolean: is this a repeat?
  • gaps
    • number of gaps in the sequence
  • gc
    •  % GC content
  • ntInAffy (12FLY/dmel ONLY)
    • number of hit nucleotides overlapping Affy transfrag(s)
    • CAUTION:
      • It is very frequent that transfrags in different time points will cover the same genomic region. Thus, although a hit may overlap multiple transfrags, it will only overlap that region once, so the nt count will be inflated. Just because a hit overlaps 300 transfrag nt's does not mean it overlaps 300 transcribed genomic nt's!

* Due to this, you can't exactly tell what fraction of a hit got transcribed, so don't use that as a filter measure! Maybe we should average "fraction transcribed" over timepoints, or just have a binary value and tag only 100% transcribed hits as good. * This is not intuitive. We should remove multiple overlap of genomic nucs, but that's much harder.

      • We disregard strand info when computing overlap because Affy's method for this set doesn't assign strands.

CAUTION: The "ntInXXX" columns store the sum total of overlap, e.g. if 10 nt in the 5' part of the hit overlap exon 1 and 20 nt in the 3' part of the hit overlap exon 2, then the column will store 30 nt. You can't say from looking at the value where the overlap came from - you have to look at the corresponding isect table for that. You can, however, easily answer the question of whether there was any overlap at all.

CAUTION: When overlap is calculated, strand information is not considered! (E.g. the intervals [100,199,+] and [150,249,-] overlap.)

Using database tables to answer questions (with SQL examples)

Here are some sample questions you may want to ask about the screen results and the SQL to answer them. We primarily use filter tables, but sometimes isect tables.

For this exercise, we'll use the database caf1screen_v12 on host sheridan, looking at results produced by model ncRnaDualStrand_v12 for reference genome dmel.

Log in as root to circumvent some permission issues we're having:

<noautolink>

ssh sheridan
mysql --user=root
# [[My SQL]] console opens, there is no password
USE caf1screen_v12;
# enter your SQL query

</noautolink>

How many high-ranking hits overlap known ncRNAs?

Let's define "high-ranking" to mean the top 10000 hits by log-odds score, not using any other filtering.

SELECT COUNT(*) FROM filterGenomic_ncRnaDualStrand_v12_dmel WHERE ntInKnown > 0 AND rank <= 10000;

I want to get a list of quality hits and their colorized alignments

Let's define "quality hits" to be hits that are:

  • at least 20 nt long in the reference genome
  • don't overlap a known ncRNA
  • in the top 1% by log-odds score
  • have at least 5 base pairs
  • have at least 2 columns with covariant base pairs
  • appear in at least 3 sequences

The first two items require the filterGenomic table, the others require the filterAlign table. We can query both tables using subqueries.

First, however, let's see how many hits we have:

SELECT COUNT(*) FROM filterAlign_ncRnaDualStrand_v12;
# returns count of 2163360

So, the top 1% would be rank <= 21633.

Now, let's get a list of hit IDs fitting our "quality hit" criteria (warning: this takes about 3-5 minutes):

SELECT id FROM filterAlign_ncRnaDualStrand_v12 WHERE id IN 
 (SELECT id FROM filterGenomic_ncRnaDualStrand_v12_dmel WHERE length >= 20 AND ntInKnown = 0)
AND rank <= 21633 AND bp >= 5 AND covar >= 2 AND seqs >= 3;

To get colorized alignments, we will have to get the coordinates of these hits, dump them to GFF, and run them through get-hit-alignments.pl and colorstock.pl.

First, let's get the alignment coordinates of these hits and put them in a temporary table tmp (warning: this takes about 7-8 minutes):

CREATE TABLE tmp SELECT * FROM hitsAlign_ncRnaDualStrand_v12 WHERE id IN
 (SELECT id FROM filterAlign_ncRnaDualStrand_v12 WHERE id IN
  (SELECT id FROM filterGenomic_ncRnaDualStrand_v12_dmel WHERE length >= 20 AND ntInKnown = 0)
 AND rank <= 21633 AND bp >= 5 AND covar >= 2 AND seqs >= 3);

Now, we dump that table to GFF using dump-to-gff.pl (ignore the strand, score, etc. since we don't need them downstream):

/nfs/projects/pipeline/perl/dump-to-gff.pl -h sheridan -d caf1screen_v12 -t tmp --seqid segment --start start --end end --sql "order by lgOdds desc" > tmp.gff

Lastly, we run the GFF through get-hit-alignments.pl (which extracts the hits from the very windows they were in) and colorstock.pl to make it pretty:

cat tmp.gff | /nfs/projects/pipeline/perl/get-hit-alignments.pl /nfs/projects/caf1screen/sge/caf1screen_v12/out/ ncRnaDualStrand_v12.xrate | /nfs/src/dart/perl/colorstock.pl -inv [[Dro Mel]]

CAUTION: Remember, you are going to get the alignment/sequences of the hit as they appear in the window, not as they appears in the genome(s). So if your #=GC STRAND annotation is on the minus strand, you have to revcomp the alignment to get the sequences as they would appear in the genome. If #=GC STRAND is the plus strand, do nothing.

Remember this if you are ever spot-checking your output by BLASTing against Fly Base!

Voila! Note that the #=GF ID field in the Stockholm output contains the unique ID of the hit, so you can match it against database tables.

Protocol for spot-checking final prediction set

It's useful to spot-check your results by eye. Here is one way to do it (example specific to caf1screen_v12, species dmel).

  1. Pick a random hit from your "best hits table," e.g.:
    • SELECT * FROM top_hitsAlign_ncRnaDualStrand_v12 ORDER BY RAND(NOW()) LIMIT 1
  1. Extract its annotated and colorized alignment from its window using make dump-hit-<hitID>, e.g.:
    • make dump-hit-2477_324257_324305_-
  1. Admire it or cringe in disgust.
  2. Look up its genomic coordinates in hitsGenomic_ncRnaDualStrand_v12_dmel.
    • Use SELECT * FROM caf1screen.map_dmel_scaffold_names to get the conversion chart for chromosome identifiers, you'll need it.
  1. BLAST the hit sequence against dmel r5.3 in Fly Base.
  2. Verify that our genomic coordinates are the same as what Fly Base reports for the hit sequence (to make sure all the ad nauseum mappings went correctly).
  3. Follow the link from BLAST result page to look up the hit in the genome browser with every track open (fortunately these settings are cached somewhere so you only set it once) and ensure it really doesn't overlap anything known (this is to make sure our filters are really filtering).
    • Disregard tiling BAC, chromosome band, and other such tiling/"global" tracks.
  1. Look up the hit, using genomic coordinates, in the UCSC genome browser, paying particular attention to the conservation track and Repeat Masker track.

If you want to just kick back and watch random hits scroll by in their colorized glory, picking out your favorites as you go along, I found this to be quite pleasing:

/nfs/projects/pipeline/perl/dump-to-gff.pl -h sheridan -d caf1screen_v12 -t top_hitsAlign_ncRnaDualStrand_v12 --seqid segment --start start --end end --sql "ORDER BY RAND(NOW())"
| /nfs/projects/pipeline/perl/get-hit-alignments.pl /nfs/projects/caf1screen/sge/caf1screen_v12/out/ ncRnaDualStrand_v12.xrate 2>/dev/null 
| /nfs/src/dart/perl/colorstock.pl -cols 160 -less -inv [[Dro Mel]]

Selecting top 100 hits by eye

The details (specific to caf1screen_v12/ncRnaDualStrand_v12) have been moved here: http://biowiki.org/TwelveFlyScreen#Selecting_hits_for_experimental

TODO: the above-linked procedure needs to be made generic, moved to Makefiles and standalone, commented Perl (instead of command-line 1-liner Perl).

---

Other stuff - pipeline design/adding stuff to it

Design philosophy

STUBS:

  • recursive make vs. dependencies
  • checkpoints (database and other)
  • use of environment variables (e.g. SPECIES, SCREEN) to circumvent the one-stem-only restriction (that is, without complicated string fiddling of $*, which you can't do in a dependency anyway)
  • local vs absolute paths
    • local used for targets, and otherwise wherever possible
    • absolute used for nonlocal execution (e.g. SGE test rule), libs, etc.
  • makefile hashes
  • what's the point of having alignment segments in sge/SCREEN/in/ instead of source/?
  • screen numbering (increment only when source data changes, to avoid wasting space by duplicating sge/SCREEN/in/)
  • advice:
    • avoid using % like the plague; make seems to be unable to figure out dependencies often (e.g. table-existence dependencies), and using env vars instead works
  • see also:

Including the pipeline as a "library" - how to do it?

TODO: this is just a note-taking rant for now...

It is desirable to use the pipeline as a generic "library" that can be included in any arbitrary project (e.g. a set of screens of a particular dataset, like CAF1).

Desired goals:

  1. project makefiles can override rules in pipeline makefiles
  2. pipeline "library" variables accessible by project makefiles
  3. pipeline "library" safety checks inherited by project makefiles
  4. pipeline "library" can be used for satisfying dependencies in project makefiles

Our options:

  1. Using "include" (must be at the bottom of project's root makefile)
    • Pros
      • Satisfies goal 1 for pattern rules
      • Satisfies goals 2 and 3 completely
      • Satisfies goal 4 (at least in theory, since make dependencies seem to be flaky, or at least in my hands [AVU])
    • Cons
      • Fails to satisfy goal 1 for explicit rules

* Produces warnings (but we can live with those) * Matches target in most recently included file (i.e. a pipeline file), overriding project makefiles (this is BAD; you want project makefiles to override pipeline makefiles, instead) * But maybe this is not a common case?

  1. Using catch-all rule with recursive make (also must be at bottom of project's root makefile)
    • Pros
      • Satisfies goal 1 for all rules; recommended by GNU Make manual (Overriding Makefiles)
      • Satisfies goal 4 (even for pattern rules and dependencies containing stems)
    • Cons
      • Fails to satisfy goal 2 (but we can do without pipeline variables, if necessary)
      • Fails to satisfy goal 3 (this is BAD; loss of ifndef checks could lead to much foot-shooting, as many errors occur due to undef variables silently expanding to 0-length strings)
      • You have to be very careful to export all variables defined in project makefiles, since recursive make will not inherit them
      • The GNU Make manual example seems (for reasons I am struggling to identify [AVU]) to cause execution of make MAKEFILE for every included makefile via the catch-all rule. This doesn't seem to affect anything (you get a "nothing to be done" message, since the makefiles are there already and I don't think there are any dependencies in the phantom rule causing this), but it leads to a confusing preamble in the output.
      • VERY BAD: cannot use project makefile to override rules for making dependencies in the pipeline makefile

* this alone is sufficient reason not to use catch-alls for including pipeline libs at all; even the 12fly screen depends on alternate ways to satisfy dependencies, so it will fail there

  1. A combination of the two
    • Use include for satisfying goals 2 and 3 (pipeline/Makefile.defs), and catch-all for goals 1 and 4 (pipeline/Makefile.root)

Data version numbers - where to put them?

e.g. dmel release 4 versus 5

Have tables with annotations and hit predictions - where do we store what release they are in?

Possibilities:

  1. table name - probably the best
    • Pros: easy to write dependencies on a particular version number
    • Cons: harder to decompose a table name (e.g. hitsGenomic_MODEL_SPECIES_VERSION) into components using "make" string functions
      • hmm, but if we store the version number in an env var, e.g. $(VER), then it becomes very easy: hitsGenomic_MODEL_SPECIES_VERSION becomes hitsGenomic_$(MODEL)_$(SPECIES)_$(VERSION), and the decomposition is done for you already!
  1. table column
    • Pros:
    • Cons: wasteful; you're storing the same data over and over in each row
  1. table comment
    • Pros: the cleanest and tersest
    • Cons: if we need to find what version the data is in a makefile rule, have to use trickery (e.g. cat a query to MySQL to get the version number, save it in an env var... oy) that is very annoying

Copied from pipeline/Makefile.sge

This used to be in ==pipeline/Makefile.sge=='s header, but I'm moving it here because all documentation should in one place - HERE.

TODO: I have not read this in a while... it might be outdated.

Job are submitted to SGE using something like:

make sge/subdir/somejob.sge-done

The .sge-done rule is a fancy wrapper around a 'qsub' command that submits another makefile target (usually .xrate) that will actually specify what the job must do.

If the job is successfully submitted to the SGE queue, a .sge-jid file is created in 'subdir' storing the job ID. When the job starts running, this file will also contain the name of the host it's running on.

When the job is done, a .sge-done checkpoint file is created in 'subdir' by whatever makefile rule was actually submitted (e.g. at the end of an .xrate target rule).

Outputting to disk is generally done locally, then copied to the NFS when done, to (potentiall) reduce NFS overhead from many small writes (replacing it by the occasional large write).

If you try to resubmit the job, we will first check if the .sge-done checkpoint exists - if it does, the job must have already been run, and make will respond with "target is up to date" and refuse to run your job. So, delete ALL the .sge-* files for the job if you REALLY want to re-run it.

If the .sge-done file does not exist, but the .sge-jid file does, we will use the job ID in .sge-jid to see if the job is in the queue - if it is, we will refuse to submit the job, since it's been scheduled already.

Submitting multiple jobs (e.g. all jobs in 'subdir') is accomplished with wrappers like:

make sge/subdir/run

where 'run' is a phony target that is actually a wrapper around a loop doing recursive calls to 'make sge/subdir/somejob.sge-done'.

Conventions/general notes

  • Left/start coord always less than/equal to right/end coord, regardless of strand.
  • 1-based inclusive coords used everywhere, unless noted otherwise.
  • CLEAN UP: Each SQL file should only contain one "create table" definition... you can have multiple SQL statements to "do stuff" to the table (including hardcoding stuff that's in it, a la the fly scaffold name mapping stuff), but do NOT create more than one table (init-table.pl will behave unpredictably if you do - TODO: maybe a better solution would be for "init-table" to check for this and die if non-compliant).

Nomenclature (draft)

How to read my nomenclature shorthand:

  • everything is a literal, except words in <angle brackets>, which denote a variable
    • TODO: explain variable values for all conventions below
  • {} denotes "choose one from this set"
  • stuff in [] is optional

Database tables (TODO: explain each table "type", note that all tables must begin with type name followed by underscore, because SQL file naming conventions depend on this):

  • {goodhits,hits}_<model>[_<species>[_<rev>]]
  • isect_<table1>_VS_<table2>
    • isect tables should be allowed to violate naming convention (because of 64-char maximum on table names, ugh) - make sure this doesn't screw anything else!
  • known_<data>[_<rev>]
  • map_<species>_<old coords>_to_<new coords>
  • Oddball tables: there better not be any when I'm through with this!

TODO: note that use of _ should be clearly forbidden in some of these names. Note that model names can't contain . in them (has special meaning in MySQL).

Schema are in: sql/table-{goodhits,hits,isect,known,map}.sql

---

To do

TODO: these are getting moved to the RT "pipeline" queue. Please look there, also.

General design

Documentation

  • This writeup (expand STUB, rewrite/clean up CLEAN)

Database

  • remove generic database-making SQL in sql/
  • make sure each table containing genomic coordinates clearly and independently states, either in the table name or in a column in the table, what release coords the data is in (i.e. r4 or r5 for dmel, not an issue for other genomes... yet)
  • make "known flybase" tables have just one primary key, which should be the Fly Base ID
    • verify there are no duplicate Fly Base IDs, first (things are never intuitive...)
  • load the latest Fly Base data (r5.2) - has many more known snoRNAs!

Code

  • why does this drop 39 segments:

<noautolink>

cd /nfs/projects/12fly-analysis/perl
./map-dmel-r4-to-r5.pl sheridan caf1screen.map_dmel_r4_to_r5 caf1screen.known_mercator_map_r5 caf1screen.known_mercator_map genbank_accnVer start end

</noautolink>

Makefiles

  • grep for todo in makefiles, fix them
  • remove/replace/comment out dead/old rules
  • why is target not getting cleaned up in this case? (it's my fault for not having mavid.mfa where it should be, but the target should be removed by make):

<noautolink>

[avu@kosh 12fly-analysis]$ make gff/caf1screen_v12/hitsAlign_ncRnaDualStrand_v12.dmel.mercator.gff 
/nfs/src/mercator-perl/featurevole.pl -genomes /nfs/data/genome/fly/align/12fly/mercator/genomes -map /nfs/data/genome/fly/align/12fly/mercator/map -align sge/caf1screen_v12/in -inverse -gff gff/caf1screen_v12/hitsAlign_ncRnaDualStrand_v12.dmel.align.gff -out gff/caf1screen_v12/hitsAlign_ncRnaDualStrand_v12.dmel.mercator.gff [[Dro Mel]]_CAF1
# loaded 4684 lines from Mercator map file '/nfs/data/genome/fly/align/12fly/mercator/map'
# no AGP file found, ignoring (looked for file 'DroMel_CAF1.agp')
Couldn't open alignment file 'sge/caf1screen_v12/in/1/mavid.mfa': No such file or directory at /nfs/src/mercator-perl/featurevole.pl line 453, <GFF> line 1.
make: '''*''' [gff/caf1screen_v12/hitsAlign_ncRnaDualStrand_v12.dmel.mercator.gff] Error 2

</noautolink>

Organization/Nomenclature

  • Relax the restrictions on what a "screen" is, etc. You should be able to change the window-making method within the same screen. This necessitates working the model name into the window table name, or having some other way to differentiate windows within a screen.
  • clean up/delete old perl/
  • checkpoints/_ naming conventions should just be filename = table name; it's a lot easier and shorter that way, and it's easier to pass the table NAME from the dependency list to a program without hacking the target/dependency strings.
    • DONE - just need to get older screens into that shape
  • wrap Yuri's project (model training makefiles/etc. in _12fly-ncRNA and Twelve Fly Scan) into this one
  • remove dead/useless Perl code
  • get older data (<v12) into format that would have been produced by the current pipeline version
    • adjust makefile rules accordingly
    • done for sge/ dir, except for caf1screen_v02.unfiltered and caf1screen_v08

Other

  • just out of curiosity, run a longer benchmark on windowlicker with -c option - was it worth the effort?

---

-- Created by: Andrew Uzilov on 04 Aug 2007