click on the Biowiki logo to go to homepage
Research Teaching | Blog
Main | GBrowse | TWiki
Biowiki > Main > TWiki Users > Jason Stajich > Jason Blogging > Generic Genome Browser > MySQL

Search

Advanced search...

Topics


Links
misc bio links
Open bio projects
Bio wikis, blogs
Meetings, journals
off-site links
Rfam / Pfam
GMOD/GBrowse
UCSC, Ensembl
Bioperl wiki
AAA fly wiki
ISCB; RNA society


journals
Nature, Science
Cell, PLoS Biol
Nature Genetics
Genome Research
Genome Biology
PLoS Comp Bio
NAR, Mol Biol Evol
J Comp Biol
BMC Bioinformatics
Bioinformatics
Phys Rev Lett


bioinformatics blogs
fungalgenomes.org
Jason Stajich
Buried Treasure
Cyrus Harmon
Flags and Lollipops
nodalpoint.org
mutantphenotype
Yi Xing
Yokofakun
Propeller Twist
Notes from the Biomass
bioinformatics.org
Biohackery
Andrew Dalke
Neil Saunders
Personal Genome
iSpecies blog
bioinformaticsblog
Public Rambling
Inforbiomatica
hublog
Oakleys corner


bio/science blogs
Biocurious
Declan Butler
Nascent
Pharyngula
Bioethics.net
Free Association
young female scientist
OpenScience project
genetics+health
evolgen
post-genomics
gene expression
Informaticopia
Biopeer
Tangled Bank
Cavalcade of Mammals
Triplepoint
three-toed sloth
panda's thumb
stranger fruit
Genetic Chaos
In the Pipeline


tech blogs
everybody :-x esr
xkcd
27B/6
geekdad
Bruce Schneier
videogameworkout
Annalee Newitz
Rudy Rucker
Bruce Sterling
Need To Know
The Register
The Scientist
Raph Koster
jay is games
Red Orbit
+ve tech journal
Laughing Meme
Respectful Insolence
Braintickle


collaborators
Alex Bateman
Casey Bergman
Ewan Birney
Rachel Brem
Steven Brenner
Bill Bruno
Sue Celniker
Richard Durbin
Sean Eddy
Mike Eisen
Nick Goldman
Richard Goldstein
Andreas Heger
Jotun Hein
Steve Holbrook
Andrew Holmes
John Huelsenbeck
Marcin Joachimiak
Sam Griffiths Jones
Suzi Lewis
Gerton Lunter
Chris Mungall
Janet Newman
Rasmus Nielsen
Lior Pachter
Tom Peat
Chris Ponting
David Schaffer
Kimmen Sjolander
Guy Slater
Jason Stajich
Lincoln Stein


PageRank Checker

Open Source database engine. See http://www.mysql.com. Some people (Chris Mungall) really prefer Postgre SQL.

-- JasonStajich - 03 Jun 2005



Useful Commands

NOTE: keywords are in caps just for clarity, they are NOT actually case sensitive.

Starting up the interface to MySQL

On the command line:

mysql

To start up as a specific user, use:

mysql --user=USERNAME

If you want to be prompted for a password, use:

mysql -p

Managing the database

Creates a new database:

  CREATE DATABASE <database name>;

Deletes a database (dangerous!):

  DROP DATABASE <database name>;

Grants all possible permissions on all tables in database to specified user (user name can be an empty string - i.e. '' - to denote all users) on the local machine (i.e. machine running mysqld):

  GRANT ALL ON <database name>.* TO '<user name>'@'localhost';

Getting info about database

Show data about tables in the database:

  SHOW TABLE STATUS FROM <database name>;

or, if you loaded the database (see below), simply:

  SHOW TABLE STATUS;

Shows the schema (i.e. columns and their allowed datatypes) for some table:

  USE <database name>;
  DESCRIBE <table name>;

Manipulating database contents

Loads the database for manipulation (you must do this before using any of the commands below, or else specify the database name explicitly):

  USE <database name>;

Clears all entries in that table in your loaded database (dangerous!):

  DELETE FROM <table name>;

Select field(s) from some table:

  SELECT <column name> [, <column name>]* FROM <table name> WHERE <condition>;

Select every row and every column from some table:

  SELECT * FROM <table name> WHERE 1;

Insert a row:

  INSERT INTO <table name> (<column name> [, <column name>]*) VALUES (<value> [, <value>]*);

Delete a row (careful!):

  DELETE FROM <table name> WHERE <condition>;

Managing connections

Shows open connections:

  SHOW PROCESSLIST;

Kills a connection (thread):

  KILL <connection/thread ID>;

Fancy stuff

How to update a row with data from multiple rows (e.g. a sum)

The better way (fast, easy)

For each distinct hit in the isect table, we add up all the nucs that it overlaps and save that value to the ntInKnown column.

Note that one hit can overlap multiple annotations, which means that one hit will appear in multiple rows in isect. The trick is how to sum up a set of overlaps, returning a scalar value to put in a single row in filterGenomic.

# create tmp table storing total overlap for each hit
CREATE TABLE tmp2 SELECT t2_note AS id, SUM(ROUND((percent_t2_in_t1/100)*(t2_end-t2_start+1))) AS nt FROM
isect_known_ncRNA_VS_hitsGenomic_ncRnaDualStrand_v12_dmel GROUP BY t2_note;
# update the filter table
UPDATE filterGenomic_ncRnaDualStrand_v12_dmel AS f, tmp AS i SET f.ntInKnown = i.nt WHERE f.id = i.id;

Actually, you can avoid having the tmp table at all by using a subquery:

UPDATE filterGenomic_ncRnaDualStrand_v12_dmel AS f, 
 (SELECT t2_note AS id, SUM(ROUND((percent_t2_in_t1/100)*(t2_end-t2_start+1))) AS nt
  FROM isect_known_ncRNA_VS_hitsGenomic_ncRnaDualStrand_v12_dmel GROUP BY t2_note) AS i
SET f.ntInKnown = i.nt WHERE f.id = i.id;

The old and stupid way (slow, hard)

The following is from my e-mail to Mitch and Rob that I drafted pleading for help, but never sent because I figured out a solution. I don't want to just throw it out, so I'm dumping it here, although it should be edited down. [AVU]

Here is the goal: set the value of the column ntInKnown in the table caf1screen_v12.filterGenomic_ncRnaDualStrand_v12_dmel to be equal to the number of nucleotides in that hit that overlap a known.

The easy solution would be to just have a boolean: true for any overlap, false for no overlap, NULL for "not yet figured out". But since we want to do filtering by degree of overlap, I wanted to save the raw value of "hit nucs in known annotation" and sidestep having to decide what the degree should be, so y'all can arbitrarily pick one later.

Generally, I update filter tables using syntax like this (here we are adding to the filter table the length of the annotation in genomic coords):

UPDATE filterGenomic_ncRnaDualStrand_v12_dmel AS f, hitsGenomic_ncRnaDualStrand_v12_dmel AS h
 SET f.length = h.end-h.start+1 WHERE f.id = h.id;

Adding total nuc overlap would be this easy if each overlapping hit matched exactly 1 known, but a small handful match multiple knowns. My test case is hit 1175_604501_604592_+ which overlaps 2 miRNAs; take a look:

SELECT * FROM isect_known_ncRNA_VS_hitsGenomic_ncRnaDualStrand_v12_dmel_r5 WHERE t2_note = '1175_604501_604592_+';

I think that we should store in ntInKnown the sum of both overlaps, which for this hit should be 23+22=45 nt.

Since we're storing fraction of overlap (ugh, I should have never done that, will be fixed at some later point), we need to convert it to number of nucs... I do it like this:

ROUND((percent_t2_in_t1/100)*(t2_end-t2_start+1))

"ROUND" has to be used because we get values like "22.9999" instead of "23" due to I guess imprecision of floating point calculation... or something. Rounding to nearest should fix it, they can't be that far off in the wrong direction.

So OK, we can convert fraction of overlap to nucs; the following, for example, gives us nuc overlap for every hit/known pair:

SELECT *,ROUND((percent_t2_in_t1/100)*(t2_end-t2_start+1)) AS nucOverlap FROM isect_known_ncRNA_VS_hitsGenomic_ncRnaDualStrand_v12_dmel_r5;

Apparently you can use functions like COUNT and SUM to convert a set of values (records/rows/whatever) into a single scalar. So, to get the sum of nucs that our test hit overlaps across all knows, we can extend the above to do this:

SELECT SUM(ROUND((percent_t2_in_t1/100)*(t2_end-t2_start+1))) FROM isect_known_ncRNA_VS_hitsGenomic_ncRnaDualStrand_v12_dmel_r5 WHERE t2_note = '1175_604501_604592_+';

OK... but how do we wrap that into an UPDATE and do it for every single row in filterGenomic? I tried the following, but it doesn't work:

# reset the table
UPDATE filterGenomic_ncRnaDualStrand_v12_dmel SET ntInKnown = 0;
# increment each column, hoping it will increment multiple times for multiple hits... but it doesn't
UPDATE filterGenomic_ncRnaDualStrand_v12_dmel AS f, isect_known_ncRNA_VS_hitsGenomic_ncRnaDualStrand_v12_dmel_r5 AS i SET f.ntInKnown = f.ntInKnown + ROUND((i.percent_t2_in_t1/100)*(i.t2_end-i.t2_start+1)) WHERE f.id = i.t2_note;

Our desired hit (1175_604501_604592_+) gets an overlap of 23 instead of 45, which means only the first match is counted. So, this doesn't work.

So what do we do? Apparently you can assign the output of subqueries to field values, so the following is the solution:

UPDATE filterGenomic_ncRnaDualStrand_v12_dmel AS f, isect_known_ncRNA_VS_hitsGenomic_ncRnaDualStrand_v12_dmel_r5 AS i SET f.ntInKnown = (SELECT SUM(ROUND((i.percent_t2_in_t1/100)*(i.t2_end-i.t2_start+1))) FROM isect_known_ncRNA_VS_hitsGenomic_ncRnaDualStrand_v12_dmel_r5 AS i WHERE i.t2_note = f.id) WHERE f.id = i.t2_note;


-- Created by: AndrewUzilov on 02 Mar 2006

Actions: Edit | Attach | New | Ref-By | Printable view | Raw view | Normal view | See diffs | Help | More...