Fungal Genomes

From Biowiki
Jump to: navigation, search

Fungal genomes mini-hackathon

New procedure [added 2007/09/22 by AVU]

The entire process has been condensed into something like:

cd /nfs/projects/fungiscreen
./env-fungi01.bat
make screen

See How To Run Pipeline for more info, including some fungi-specific notes.

Data source

alignments.tar.bz2 mercator output with pecan alignments in output.mfa
genomes.tgz FASTA format
gff.tgz GFF format
prank_alignments.tar.bz2 PRANK alignments in prank.mfa
treefile Tree, four species, Newick Format

Who

Add yourself...

Jason Stajich, Andrew Uzilov, Ian Holmes, Robert Bradley

Makefile walkthrough /home/projects/hackathon

Environment variables

SCREEN=fungi00 # name of the screen, does not include model name since can run on different models
MODEL=ncRnaDualStrand
NULL_MODEL=ncRnaDualStrandNull

Makefile.sge -> Sun gridengine Makefile wrappers

  • automatically break things into windows, does the tranformations, load data into database
  • wrappers you don't need to touch
  • does dependancies
  • windowlicker breaks things into sub directories

Add rules to work on hardcoded names, to test then can be submitted to SGE

ADD tree

  • defined a variable in main makefile FUNGI-TREE
fungi00.%.xrate:
 $(ADD-TREE) $(FUNGI-TREE) segment.stock > segment.withtree.stock

WINDOWLICKER params:

  • w windowsize
  • gr reference sequence name
  • g min percentage of reference sequence that can be
  • r reference sequence name
  • b low complexity filer ( bit content or fraction)
  • -- passed to xrate
  • -s make log xrate inside score
  • -l maximum suffix length in DP matrix
  • -g grammars

Andrew looked for smallest alignment to set lower bound and do a test $ ls -lrS `find . -name output.mfa`| head 2419 was smallest

We can use stockholm2fasta.pl to convert multi-FASTA files to Stockholm.

The fungiscreen/Makefile rule segments takes care of this.

Ready to run

In the end we should be able to do make screen in the /home/projects/hackathon/ directory because we have set the SCREEN environment variable

... something is happening

look in out directory.

After finishing

Genomic-coordinate annotations will be in gff/${SCREEN}/hitsGenomic_${MODEL}.${SPECIES}.gff

To get a .gff dump sorted by lgOdds and restricting to hits <= 130nt, do

/home/projects/pipeline/perl/dump-to-gff.pl -h sheridan -d fungi01 -t hitsAlign_ncRnaDualStrand_v15 --seqid segment 
--start start --end end --strand strand --score lgOdds source=ncRnaDualStrand_v15 type=ncRNA 
--sql "where end-start+1<=130 order by lgOdds desc" > gff/fungi01/hitsAlign_ncRnaDualStrand_v15.gff

and then call the make rule to convert to genomic coords,

make gff/${SCREEN}/hitsGenomic_${MODEL}.${SPECIES}.gff

Hacky--we really should keep everything wrapped in Makefiles--but works for now...

TODO: need to get the structures/sequence out, can use alignmonkey for this easily...

---

Older results: