My SQL

From Biowiki
Jump to: navigation, search

Open Source database engine. See http://www.mysql.com. Some people (Chris Mungall) really prefer Postgre SQL.

-- Jason Stajich - 03 Jun 2005

---

---

Useful Commands

NOTE: keywords are in caps just for clarity, they are NOT actually case sensitive.

Starting up the interface to MySQL

On the command line:

mysql

To start up as a specific user, use:

mysql --user=USERNAME

If you want to be prompted for a password, use:

mysql -p

Managing the database

Creates a new database:

  CREATE DATABASE <database name>;

Deletes a database (dangerous!):

  DROP DATABASE <database name>;

Grants all possible permissions on all tables in database to specified user (user name can be an empty string - i.e. - to denote all users) on the local machine (i.e. machine running mysqld):

  GRANT ALL ON <database name>.* TO '<user name>'@'localhost';

Getting info about database

Show data about tables in the database:

  SHOW TABLE STATUS FROM <database name>;

or, if you loaded the database (see below), simply:

  SHOW TABLE STATUS;

Shows the schema (i.e. columns and their allowed datatypes) for some table:

  USE <database name>;
  DESCRIBE <table name>;

Manipulating database contents

Loads the database for manipulation (you must do this before using any of the commands below, or else specify the database name explicitly):

  USE <database name>;

Clears all entries in that table in your loaded database (dangerous!):

  DELETE FROM <table name>;

Select field(s) from some table:

  SELECT <column name> [, <column name>]* FROM <table name> WHERE <condition>;

Select every row and every column from some table:

  SELECT * FROM <table name> WHERE 1;

Insert a row:

  INSERT INTO <table name> (<column name> [, <column name>]*) VALUES (<value> [, <value>]*);

Delete a row (careful!):

  DELETE FROM <table name> WHERE <condition>;

Managing connections

Shows open connections:

  SHOW PROCESSLIST;

Kills a connection (thread):

  KILL <connection/thread ID>;

Fancy stuff

How to update a row with data from multiple rows (e.g. a sum)

The better way (fast, easy)

For each distinct hit in the isect table, we add up all the nucs that it overlaps and save that value to the ntInKnown column.

Note that one hit can overlap multiple annotations, which means that one hit will appear in multiple rows in isect. The trick is how to sum up a set of overlaps, returning a scalar value to put in a single row in filterGenomic.

# create tmp table storing total overlap for each hit
CREATE TABLE tmp2 SELECT t2_note AS id, SUM(ROUND((percent_t2_in_t1/100)*(t2_end-t2_start+1))) AS nt FROM
isect_known_ncRNA_VS_hitsGenomic_ncRnaDualStrand_v12_dmel GROUP BY t2_note;
# update the filter table
UPDATE filterGenomic_ncRnaDualStrand_v12_dmel AS f, tmp AS i SET f.ntInKnown = i.nt WHERE f.id = i.id;

Actually, you can avoid having the tmp table at all by using a subquery:

UPDATE filterGenomic_ncRnaDualStrand_v12_dmel AS f, 
 (SELECT t2_note AS id, SUM(ROUND((percent_t2_in_t1/100)*(t2_end-t2_start+1))) AS nt
  FROM isect_known_ncRNA_VS_hitsGenomic_ncRnaDualStrand_v12_dmel GROUP BY t2_note) AS i
SET f.ntInKnown = i.nt WHERE f.id = i.id;

The old and stupid way (slow, hard)

The following is from my e-mail to Mitch and Rob that I drafted pleading for help, but never sent because I figured out a solution. I don't want to just throw it out, so I'm dumping it here, although it should be edited down. [AVU]

Here is the goal: set the value of the column ntInKnown in the table caf1screen_v12.filterGenomic_ncRnaDualStrand_v12_dmel to be equal to the number of nucleotides in that hit that overlap a known.

The easy solution would be to just have a boolean: true for any overlap, false for no overlap, NULL for "not yet figured out". But since we want to do filtering by degree of overlap, I wanted to save the raw value of "hit nucs in known annotation" and sidestep having to decide what the degree should be, so y'all can arbitrarily pick one later.

Generally, I update filter tables using syntax like this (here we are adding to the filter table the length of the annotation in genomic coords):

UPDATE filterGenomic_ncRnaDualStrand_v12_dmel AS f, hitsGenomic_ncRnaDualStrand_v12_dmel AS h
 SET f.length = h.end-h.start+1 WHERE f.id = h.id;

Adding total nuc overlap would be this easy if each overlapping hit matched exactly 1 known, but a small handful match multiple knowns. My test case is hit 1175_604501_604592_+ which overlaps 2 miRNAs; take a look:

SELECT * FROM isect_known_ncRNA_VS_hitsGenomic_ncRnaDualStrand_v12_dmel_r5 WHERE t2_note = '1175_604501_604592_+';

I think that we should store in ntInKnown the sum of both overlaps, which for this hit should be 23+22=45 nt.

Since we're storing fraction of overlap (ugh, I should have never done that, will be fixed at some later point), we need to convert it to number of nucs... I do it like this:

ROUND((percent_t2_in_t1/100)*(t2_end-t2_start+1))

"ROUND" has to be used because we get values like "22.9999" instead of "23" due to I guess imprecision of floating point calculation... or something. Rounding to nearest should fix it, they can't be that far off in the wrong direction.

So OK, we can convert fraction of overlap to nucs; the following, for example, gives us nuc overlap for every hit/known pair:

SELECT *,ROUND((percent_t2_in_t1/100)*(t2_end-t2_start+1)) AS nucOverlap FROM isect_known_ncRNA_VS_hitsGenomic_ncRnaDualStrand_v12_dmel_r5;

Apparently you can use functions like COUNT and SUM to convert a set of values (records/rows/whatever) into a single scalar. So, to get the sum of nucs that our test hit overlaps across all knows, we can extend the above to do this:

SELECT SUM(ROUND((percent_t2_in_t1/100)*(t2_end-t2_start+1))) FROM isect_known_ncRNA_VS_hitsGenomic_ncRnaDualStrand_v12_dmel_r5 WHERE t2_note = '1175_604501_604592_+';

OK... but how do we wrap that into an UPDATE and do it for every single row in filterGenomic? I tried the following, but it doesn't work:

# reset the table
UPDATE filterGenomic_ncRnaDualStrand_v12_dmel SET ntInKnown = 0;
# increment each column, hoping it will increment multiple times for multiple hits... but it doesn't
UPDATE filterGenomic_ncRnaDualStrand_v12_dmel AS f, isect_known_ncRNA_VS_hitsGenomic_ncRnaDualStrand_v12_dmel_r5 AS i SET f.ntInKnown = f.ntInKnown + ROUND((i.percent_t2_in_t1/100)*(i.t2_end-i.t2_start+1)) WHERE f.id = i.t2_note;

Our desired hit (1175_604501_604592_+) gets an overlap of 23 instead of 45, which means only the first match is counted. So, this doesn't work.

So what do we do? Apparently you can assign the output of subqueries to field values, so the following is the solution:

UPDATE filterGenomic_ncRnaDualStrand_v12_dmel AS f, isect_known_ncRNA_VS_hitsGenomic_ncRnaDualStrand_v12_dmel_r5 AS i SET f.ntInKnown = (SELECT SUM(ROUND((i.percent_t2_in_t1/100)*(i.t2_end-i.t2_start+1))) FROM isect_known_ncRNA_VS_hitsGenomic_ncRnaDualStrand_v12_dmel_r5 AS i WHERE i.t2_note = f.id) WHERE f.id = i.t2_note;

---

-- Created by: Andrew Uzilov on 02 Mar 2006