By the end of this lab, you should know how to:
- write and execute simple Python scripts in the UNIX environment
- use arguments passed in from the command line in your Python scripts
- read data from text files and user input
- use the NCBI website to create FASTA files
0. Before we get started...
- The prompt you get depends on the UNIX shell you're running and other things like preference files. In the following lab, I will use a
$ symbol to signify a UNIX prompt. The prompt you see in your terminal window may/may not end in a
$, but just remember that you don't actually type in the
$, just the stuff after it.
- We will also use a Python shell to execute some Python code directly without having to write a script file, save it, and run it. This will be useful for trying out small independent pieces of code and learning how new functions work. You can always use this to test something out quickly or fiddle with a standalone line of code. The prompt you will get will look like this:
>>> and I will use that in this lab to signify a Python shell prompt. As above, remember that you don’t type in the
>>>, just what comes after it.
- Prof. Holmes, as you know, will be focusing more on higher level programming concepts in lecture and less on the finer details. Here in lab is a good place to learn more of the nuts and bolts of programming. We'll try to go over the basics of important concepts, but I will also provide resources for you to look at on your own to really understand the full extent of how to do various things in Python. You probably will not be able to rely just on the things we use in the lab examples for the exercises/homework assignments. A big part of becoming a good programmer is knowing when and how to look up documentation and help on commands and expressions you may not have used before - so when you're unsure about something, Google to see if you can find some help online or better yet, just try it! For example:
- What is the difference between an
int and a
float? What will you get if you do
13/2? What do you need to do to ensure that the output will be
- What is the difference between
== (that's one equals sign and a double-equals sign)?
CHANGE FOR PYTHON:
- If you don't have a Perl book, look up some websites with command references and bookmark them. You can refer to these when you're unsure of the syntax of a command or you forgot exactly how to write a
foreach loop or something. Here are a few:
- Rex Swain's HTML Perl guide <-- This is great to have open while you code!
- perl.com. In addition to this documentation, they also have a six part Perl tutorial accessible from the home page. The first three parts are highly relevant to this course. If you find yourself struggling with just the examples we've provided, I suggest you go through the tutorial.
- Learning Perl
ADD IN TUTORIAL/REFERENCE FOR:
* Control structures
* File import
* Regular expressions
* String resources
0. Some quick basics
Before we get started writing a full program, we'll go over a few quick basics to make sure we're on the same page. We'll cover these technical topics at a pretty bare-bones level, and you'll need to seek out additional resources and practice to make sure you have a good handle on them.
(A) USING THE SHELL
Open a Python shell by simply typing
Let’s get a couple of basic examples out of the way quickly to get used to the shell. Here are sample commands and the outputs you should get:
>>> print “Hello, Ferdinand!”
[In Python, a double asterisk is the command to raise a number to an exponent.]
(B) CONTROL STRUCTURES
A quick overview of the primary types of control structures.
"If-else" loops function as in the following example:
if temp > 80:
print "Boy, it's hot!"
elif temp < 50:
print "Brrr...it's cold!"
print "Nice and temperate!"
Also known as "for" loops, these take one of the two following forms. To loops through a defined range:
for i in range(0,3):
print "Counted number",i
And the output would be:
Counted number 0
Counted number 1
Counted number 2
Note that the
function is INCLUSIVE of the FIRST value and EXCLUSIVE of the SECOND value - so in our case, it starts with
and counts up to but not including
The second form allows you to iterate through anything indexable:
cowTypes = ["brown","white","mooing","flying"]
for cow in cowTypes:
print "I see a", cow, "cow!"
I see a brown cow!
I see a white cow!
I see a mooing cow!
I see a flying cow!
Note that the indexing variable (
) can be whatever you want.
Also known as "while" loops.
bonks = 0
while bonks < 3:
print "Bonk times",bonks,"!"
bonks += 1
Bonk times 0!
Bonk times 1!
Bonk times 2!
This might be useful if you don't know when something might end. Beware of infinite loops, however, and realize what can be duplicated with an "if" statement.
(C) STRINGS AND METHODS
You'll be working a lot with
. Strings are a class
of object that are processed in a particular way. Classes can also contain methods
, or functions associated with objects of that class, and strings contain many methods that will be very useful. Here's one example, and you should look for references on all of the wonderful methods available for strings. We'll look at the
method, which converts a string to upper case. Keep track of what the method does and how it on its own modifies (or doesn't modify) the variable.
>>> name = "Herman"
Functions are important to have a handle on. Fortunately, they're fundamentally not anything very new. A function is just a sort of sub-program, a set of commands that can be invoked by calling that function. Functions are used to reduce code duplication and increase the modularity of your program. A function can take inputs (though need not) and can be repeatedly invoked. Functions can be designed to
something upon being invoked. For instance, a function can manipulate numbers and return the result of the manipulation, or it can run a comparison and return
. Output can in fact happen without "returning," though - a
statement will still print to the screen. Let's look at an example.
>>> def square(num):
Note that you could also write the program like this:
= print num*num=
and get the exact same output. What's the difference? The difference is what you can do with the function.
is tied to the meaning of the function, while printing will simply display the result. So if you tried this with the second function:
>>> score = square(3)
It would immediately output
, but if you then typed in
you would get nothing. However, if you tried this with the function that
return==s the output, typing in ==score
While Python has decent native functionality, much of Python's power comes from external "modules." Modules are simply Python scripts containing functions; these can be collectively imported into your Python program so that you can use those functions in your program. Python comes with many modules built-in and natively available for import; other modules can be downloaded separately and loaded in.
A basic example of a built-in module is the
module. Python's built-in math capabilities are fairly basic; should you want to, for instance, take the logarithm of a number or the sine of a number, you will need the
module. Fortunately, this is rather simple; all you need to do is import the module and begin using the function that you want from that module. Let's try a quick example.
If you try typing
, you will get an error as Python does not have this function. However - let's import the
module! Import it simply by typing:
>>> import math
You will get no feedback, but it has been imported. I will tell you a secret: the function
is a function included in the
module. So now try taking the base-10 logarithm the same way:
Oops! Another error! How come? When you use a function that comes from a module, you must call it as a function of that module, as follows:
If you try this, you should get
. We will be importing modules frequently, including at least two more in this lab alone.
[How can you know what functions a module contains? Fortunately, modules have documentation. To access them, first import them, then enter the command
- for instance,
. Full documentation will appear.]
Whew! That's a lot to take in and a lot to get comfortable with. We're done with the introduction and the shell now. To exit the Python shell, run the command
. You should be back to the UNIX prompt.
2. Adding numbers
Let's start off with a simple program, just to get familiar with the process of writing and executing an actual Python script on UNIX. We'll start off writing a program that's basically a calculator and actually, not even a very good one. It can only add (i.e. don't throw away your current calculator because you probably won't be using this program to replace your calculator.).
Here are the specifications: When a user calls our program from the command line, s/he will also need to tell us how many numbers we will be adding. Then we will ask the user type in that many numbers, and we will print out the sum at the end.
- OK, so let's get started. First, fire up your favorite text editor.
Aside: If you're using
emacs in the lab, it'll bring up a new window and the terminal in which you called the program will not be useable until you exit
emacs. That's pretty annoying, so to avoid that from happening, run
emacs "in the background." Essentially, just put a
& at the end and what that does is it tells UNIX to "fork a new process" for this program -- here's more about UNIX processes and fork.
$ emacs &
Alternatively, if you prefer something simpler like
pico, which doesn't pop up in a new window, don't put the
& or you won't actually be able to use the
pico program you just started. If you accidentally did it, you'll need to kill the process using the
kill command at the prompt with the process ID (PID) of the
pico process you started. You can look up PIDs using the
$ kill XXXX
- Now that you have a space in which to type your script, it's time to designate this as a Python script. By convention, you save Python scripts with the extension
.py, but that's not actually what tells the UNIX system that this is a Python script. There are two ways to get your
.py file treated as a Python file. You can:
1) Tell the operating system how to execute the script. Set the very first line of your Python script to (Note: the character after the pound sign (#) is an exclamation point (!) not the number one (1). If this text is hard to read, increase your font size using your browser's View->Text Size menu option. )
This line must be line 1 of your script (i.e. you can't even put in a blank line before it - it's worth trying to put a blank line before it and seeing the error it gives later on when you try to run the program, just so you can recognize this error if you see it later on).
2) Directly use the python command to run your program. Instead of executing your Python file, you execute the python command, and pass your file as an argument. In a close analogy to the case above, you can type into the console (from the directory your Python file is in):
$ python myPythonFile.py
- One good way to learn programming is to do things step-by-step, i.e. get one small part working before adding onto it. Unless you've done a lot of programming in a particular language, it's usually not a good idea (in my opinion) to try to write the full program in one go, and then figure out everything that is wrong with it. In the spirit of this method, let's just start off with reading an argument from the command line and we'll just print it out. Not too useful of a program but it's a good start.
READING IN ARGUMENTS FROM THE COMMAND LINE
So how do we read in arguments from the command line? There are a couple of ways, but the simplest utilizes a built-in module called
. In our script, we will begin by importing this module, and all arguments passed to the command line will be stored in the list
. The first argument of this list (
) will be the name of the file, and the next ones will be the other arguments in order.
number = sys.argv
print "You typed in ", number
This will save the thing the user typed in into the variable called
and then print it out to the screen.
- So now, we want to test out this little part of the program. First save our script - say,
add.py. (You may want to open up another Terminal window, if you're using
pico, so you don't have to keep quitting the text editor every time you want to test your script.) Before we can run the program (assuming we want to execute the file, rather than passing the file to Python), though, we need to change its file permissions because UNIX doesn't know that this is an executable program - currently it just thinks it's a text file that you can read from and write to. We just need to tell UNIX that you can actually execute this file because it's a program. You do this using the command
$ chmod +x add.py or
$ chmod 755 add.py
Aside: If you're wondering what those numbers after
chmod mean, check out this tutorial on
chmod. Basically the three numbers correspond to the three categories of users defined in Unix: user, group, and world. Each one of these categories can have read (r), write (w), or execute (x) permission on a file. If we write 7 in binary, it will be 111. This is equivalent to "rwx", full permissions for the user. The permissions for the group and world are 5=101="rx", so users in those categories will only have read/execute permissions.
OK, so now UNIX knows it's an executable file. Let's execute it!
$ add.py 5
Did it do what you expect?
In some systems, we would have had to have typed
. Why might we have to put the
in front of
in some cases? This has to do with something called your
variable in UNIX. When you type in a command in UNIX, the system searches all the directories in your
to locate this program. On our systems, the current directory is in your
. If it weren't, you'd tell UNIX where it is, i.e.
, "please execute the
program that is in the
directory aka my current directory".
- Congrats! You now know how to read arguments from the command line! But now how to we make the actual adder? Well, first, from the command line argument, we know how many numbers the user will type in. Since we know how many times we need to ask the user for a number, a good control structure to use is a
for loop. (Remember - when using
range in Python, the first number is included and the last number is excluded. Thus,
range(0,4)= would refer to four numbers: 0,1,2,3.) Second, to take the numbers, we will use
input(), which will accept a numerical input only and store it that way. (
raw_input() can take anything (strings, numbers, whatever), but it will store the input as a string - so if you want to use a number accepted from
raw_input() in a numerical fashion, you must convert it to an
int or a
float.) So here we go:
for i in range(0,number):
userNum = input("Please enter a number: ")
- Great, we can read in the numbers. But now we need to add them up. As Prof. Holmes mentioned in class, in Python, there are many different ways to do something. Here, for example, we could save each of the number the user types in somewhere (maybe a list?) and then once we're done asking the user for numbers, we can just go through each number in the array and add them up. But, why don't we just keep a cumulative running sum of everything the user types in? So let's try that:
sum = 0
for i in range(0,number):
userNum = input("Please enter a number: ")
sum += userNum
- Finally, we want to print out our result. So our full program is
number = sys.argv # read in the number of numbers to add up from the command line
sum = 0
for i in range(0,number):
userNum = input("Please enter a number: ")
sum += userNum # keep a running sum
print "The sum is",sum
comments, which Perl ignores when it's reading through this program. Commenting your code is good programming style and helps to explain what you're doing in the code to other programmers and also to you later on, when you come back to this program a year later and may have forgotten why you did things a certain way. Comments are discussed further in the StyleGuidelines
- Congrats! You've just written an adder! Test it out to make sure it adds numbers like you expect. But... What would happen if someone typed in
$ add.py hello from the command line? What if they don't enter numbers when you ask them to? To find out, you can always test out your programs with "weird" input to see what your program does. Remember that not all users are informed about what your
add.py is supposed to do. Right now, when something unexpected is entered by the user, you should get some complicated looking message. That's not very nice - or in computer speak, your program is not handling errors "gracefully". What would be more useful?
- A little more practice ... Extend your program by making a special feature if the user decides to enter 3 numbers. In addition to printing the sum, print out the numbers in descending order.
3. A slightly better calculator
Let's try something slightly different. Let's write a very similar program, but instead of making someone type in the numbers one by one, we'll read the numbers from a text file. The name of the text file will be passed in through a command line argument. And since we're the designers of this program, we can be a little annoying and impose the following rules:
- Each number should be on a separate line.
- We'll give a couple more functions to our calculator, specifically add, subtract, multiply, and divide.
- Each line should start with one of
+, -, *, /, followed by whitespace, followed by a number. This will tell our program to perform to perform that operation on the running total and that number (for now, we're not worrying about orders of operation)
- If a line does not start with one of the above symbols, we'll ignore that line
Here we go...
- Make a new file in the text editor and give it a descriptive name, e.g.
- One of the first things we need to figure out how to do is to read from a file. <<<<Prof. Holmes talked about this in class. To refresh your memory, we need to first open a file handle, then use the < and > symbols to read lines from the file, and close the file when we're done reading.>>>> Again, in the spirit of starting simple, let's just start with a script that reads lines from a file and print them on the screen.
filename = sys.argv
infile = open(filename,"r")
for line in infile.readlines():
is a method of the
class that takes every line and enters it in order into a
accessible by that method.
- In order to try out this file-reading program, we have to talk about testing. Part of being a programmer is knowing how and when to test out your programs, which includes designing test cases. A good programmer constructs enough test cases to make sure that every part of the code works (i.e. making sure every condition in the program is tested and trying out as many unexpected cases as s/he can think of so even really weird inputs won't crash the program but give useful error messages, etc). Learning how to test code is a skill gained through time and experience, so we don't expect you to be experts right now. But it's good to keep in mind that a programmer's responsibility includes testing.
So how do we test out our script so far? Obviously, we need to have some test files, since our program reads from a file. I won't walk you through the steps in making a test file, but go ahead and make one now and test out your script so far, e.g. if I named my test file test.txt:
$ ./calculate.py test.txt
Does your program do what you expect?
- Ok, so we can read from the file....now, how do we actually do calculations with it? Let's think about the steps involved in English (this is a really good habit to get into and especially important for larger programs. Planning out your program in regular English helps you organize your thoughts before you start worrying about the actual syntax of a programming language. It's also helpful to then write an outline in pseudocode, so that you can see which parts of what you want to do are going to be dependent on one another, and get an idea of the eventual structure of your program.)
Every time I read a line from the file, I want to look at the first character to figure out what mathematical operation to perform on the following number. If I don't see a mathematical operator, I ignore the line. If I do see one, I read in the number and then do the appropriate mathematical operation using that number and the current total. Hey, this seems like a great place for some of that pattern matching we talked about in lecture (here's more info on regular expressions and Perl regex examples)! To begin working with regular expressions in Python, we first need to import a different module, the regular expression module known as
filename = sys.argv
infile = open(filename,"r")
for line in infile.readlines():
while($line = <INPUT>)
if ($line=~ /^\+\s([0-9]+)/ ) # do you know why this regex can match what we want? What's with the \+?
print "add $1\n"; # for now, just print out text to check that we're matching stuff right
elsif ... # fill in the cases for the other math symbols
print "ignored line $line";
Can you think of any other ways of organizing the logic? Maybe using [\+\-\*\/] somewhere? Is the output exactly what you expect? Could you fix the newlines if you wanted to (think chomp!).
- Now that we can read the lines in the file, the only step left is to actually perform the calculations and print out the results. Finish this on your own, and show me when you're finished.
4. Time to do a Python exercise (Homework)!
It's time for you to try out writing a script on your own because as Prof Holmes said in class, the best way to learn programming is by doing it.
to Python Basics HW.
Here your task is to reverse complement a FASTA file. Part of the solution has already been described in the Python lecture notes. Here are the requirements for your script:
- Open a file, whose filename is specified by the user as a command-line argument. That is, if the name of your Python script is
programname and the name of the file is
filename, then the script should be run by typing the following at the Unix command line:
- Do some basic error handling: verify that the user entered a filename and that the file can be opened. If not, print informative error messages and exit.
- Read the contents of the file, assuming it is a FASTA file of DNA sequences, and as you're doing so, print the name and reverse-complement of every sequence on the standard output, in FASTA format. This means that you have to output no more than 80 characters per line! (You only have to worry about this for the actual sequence, not the description line).
- Enable a command line argument and the program logic to output the complement as an RNA sequence. That is, if you use the command line to type:
$ programname filename rna
the program should output the complement as RNA instead of DNA (U's instead of T's).
- You should add the sequence length L in basepairs to the end of the sequence label line that starts with ">" using a format of ", L bp". If the line was originally "> GFP, mut3", which had a dna length of 450, it should now read "> GFP, mut3, 450 bp".
- You will be graded on correctness (90%) and style (10%). Please see the StyleGuidelines for expectations about your style. You are not expected to use functions for this exercise, though you certainly may, so the "no redundancy" requirement is relaxed.
- Turn in your program by uploading it to your individual wiki page.
- As mentioned in lecture, you may work with 1 other student if you so choose. (Remember, you can turn in at most 3 other assignments with the same student). If you do work with another student, put "I worked with xxxx" in a comment at the top of your .py file. Each person should turn in code, even if it's the same.
- Information about LATE assignments: You lose 20% of the points of the assignment for every day it's late. Contact %GSI% or Professor Holmes at least 48 hours before the due date if you have extenuating circumstances, if reasonably possible.
Here are some hints/things to think about:
- A general tip for writing programs - always try to write out in English what you want the program to do, and before you start writing a complicated program, write out pseudocode first...that is, a quasi-code-like representation of what your code will need to look like.
- Create your own test files by visiting the NCBI website to find nucleotide sequences. Search for some protein you know of (eg, hemoglobin) to get a long listing of results, then select a couple sequences from the list (avoid the 'whole genome' sequences and stick to the 'mRNA' sequences so you don't end up trying to process ridiculously huge files). Then on the dropdown boxes near the top, you have the option to show the selected sequences in FASTA format and also to save them in a file.
- Keep in mind that a valid FASTA file can contain 1 or more sequences. Test your script first with one sequence and then add more to your test file.
- If a sequence is longer than a line, you cannot just do a line-by-line reverse complement
- Don't get frustrated if your program doesn't work the way you want in the beginning. Even with many years of programming, a program rarely works on the first try!
- 23 Aug 2012
(credit to AngiChau
Copyright © 2008-2013 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback