fplot help file
This is the fplot help file. fplot is a Perl 5 program that reads a text file
of sequence features from standard input, and writes to standard output a
postscript file which is a graph of sequence features drawn proportionally
along the sequence.
The program assumes perl 5 is installed in #!/usr/local/bin/perl.
11/25/97
Written by Jim Lund, jiml@stanford.edu
Contents
1. Fplot usage and options
2. Formatting the feature text file
3. Generating the feature text file
4. Manipulating the feature text file
5. Feature item colors
6. Feature item types
7. Adding new feature types
8. Two example feature text files
1. Fplot usage and options
fplot
This perl 5 program reads a text file of sequence features from standard
input, and writes either a postscript file to standard output or a set of files
which make up an Html page. The output is a graph of sequence features drawn
proportionally along the sequence.
Usage:
fplot [-a# -b# -c# -d#,# -fFONT -hHtml.base.name -i -l -m#,# -n# -s#,# -z#] \
plot.ps
Switches:
-a# Set the number of points each plotted feature line will take
up. Default is 8.
-b# Set the the height of features drawn on each feature line.
Default is 4; should be 1/2 or less of 'points per line'
(-a#) to keep features on adjacent lines from overlapping.
-c# Set label space, the room reserved on the plot's right side
for the text labels to fit into. If you use less of the
space for label, more is available for the plot. Default is
150 points.
-d#,# Draw features of only part of the DNA sequence in the text
file, the part in the specified range (-dSTART_BP,END_BP).
The end base pair can be indicated by number or as 'end',
capitialized or not.
-fFONT Set font in which text is printed. Default is Times-Roman,
a postscipt font on most printers.
-hHtml.nase.name Html output is generated. A set of three or four files
are made, all file names start with Html.nase.name. The
image appears in one window, and the legend in a second
window. Putting the mouse over matches pops up the relevant
feature info in the other frame. Works only on Netscape
browsers.
-i Html output will be displayed in one browser window with
the legend in one frame, and the image in the other.
-l This option prints the plot in landscape orientation.
-m#,# Set page margins (-mX_MARGIN,Y_MARGIN). Default margins are
36 points.
-n# Set the number of base pairs to be plotted per line.
Default is the entire sequence on a single line.
-z# Set font size. Default is 9 point, which looks good with
lines spaced
-s#,# Set page size in points (-sX_SIZE,Y_SIZE). Defaults are
612 pts. wide by 792 points high.
-q Print out help page.
Typically, options -n, -l, and -d are the most used. The other options are
used more rarely when you want to fiddle with how the image looks. If you
want the other parameters to be different all the time, I recommend changing
the defaults where they are defined, in the global variables section of the
program.
Example:
fplot Dna.ps
dna.fplot is a text file formatted to be read by fplot. The
image is written to file dna.ps
fplot -n10000 -l -d50001,200000 Dna.ps
The image is in landscape orientation, with 10000 bp/line scaling.
The part of the DNA sequence from bp 50001 to 200000 bp is plotted.
fplot -hDNA.summary ".
The second token is "Length:" followed by a number, the length of the DNA
sequence.
Next come the sections for each plot feature line, one per feature line. A
section starts with the token "Line:" followed by a description of what's on
the line. The description gets written on the plot to the right of the feature
line.
Optionally, a "Fields:" token can appear, followed by a comma separated list
of field descriptors. The field token is used when HTML is output, to
give titles to the feature info fields in the HTML legend. Field tokens can
appear anywhere after the line with the "Line:" token. More than one
"Fields:" line can appear in a plotted line. The following line will use
the last "Fields:" declaration of the previous line, until a new "Fields:"
declaration is made.
Individual feature items come next. A feature item line begins with a
semicolon, ";", followed by 4 numbers and optionally, comma separated info
fields. The numbers are feature start, feature stop, color, and feature type.
Feature types 13 and 14 (Line and block height) take a parameter. This must
follow the color numaber, and range from 0 to 1.0. Numbers greater than 1.0
will be drawn, but they will extend over the normal feature height. This is
a feature.
Color is a number from 0 to 27. Feature types currently range from 0 to 26.
The 4 numbers can be separated by any non-number character (except f, F, r,
or R which indicate exon type features), and after the 4 numbers anything can
be written (I usually have a description of the feature.). See more about
color and feature types below.
After a section begins (with the "Line:" token, the item feature lines that
follow are drawn on the current plot feature line. When fplot encounters a new
"Line:" token, a new plot feature line begins. Lines that don't begin with a
token are ignored by the program.
A "Line:" taken without a description means that no axis will be drawn for
this plotting line.
Feature that come later are drawn over earlier ones. Keep this in mind. It
has it's uses; in plotting dot plot results, I first have the low stringincy
matches listed in the feature text file indicated in one color, and then the
smaller region of better matches come afterward, and get drawn over part of
the weak match.
-------------------Start example feature text file------------------------
>DGS cosmids 103a2 to 24b, bp 40001-80000
Length:1110702
This version has the numerous poor DGS-F matches deleted.
The 1110702 bp query DNA sequence: 103a2
Searched against the database: DGSdb9-26-97
Matches with a HSP score of at least 150 are reported.
Matches separated by 20 or less bps are listed as a single longer match.
Query Database Seq. Percent
Start Stop Start Stop Len. Identity Description
Line:BLAST of 9-26-97 DGS db
Fields: Start bp in DB seq, Stop bp in DB seq, Length of DB seq., Percent Identical bl in match, Description
-------------------------------------------------------------------------------
;4139,4241,2,1, 104 2 1284 100% X91348:H.sapiens predicted non codin g cDNA (DGCR5).
;4240,4281,2,1, 94 135 1284 88% X91348:H.sapiens predicted non codin g cDNA (DGCR5).
;21340,21509,2,1, 108 277 1284 92% X91348:H.sapiens predicted non cod
...and so on
-------------------End example feature text file--------------------------
3. Generating the feature text file
This program makes a plot of features in a DNA sequence (or any other
proportional plotting needs you can think up, really). The sequence analysis
is done with other programs, and the output from these programs needs to be
combined and formatted so fplot can parse it. How to do this? The output from
the analysis programs can be copied and concatenated into one file using pretty
much any word processor. The header and section headings can be added by
hand easily. Getting the feature item lines formatted is the only hard part.
This can be done in several ways. The UNIX text editor vi (in line editting
mode) does the job well if you are familiar with regular expresions, but
is hard to learn to use. I use vi, and do a search and replace on all the
feature item lines from a particular source at once, the replacement making
the sequence position numbers come first, then a color and feature type.
Alteratively, you can try using the search and replace capabilities of your
favorite word processor. I don't know of a good way to do this in every case,
these programs often have trouble reconizing the beginning of lines. Doing
it by hand is always an option, but for large, feature rich sequences this
can be time consuming, and makes fplot harder to use. The easier it is to
format other programs output for fplot, the more useful the program is, so
give some thought to optimizing this step if you expect to make a lot of use of
fplot. Learning vi is a *good thing* in any case. :)
Keep an eye on the output analysis programs generate. GRAIL, for example,
indicates reverse strand exons as end_of_exon_bp start_of_exon_bp, and this
orser needs to be reversed for fplot to draw the exons.
4. Manipulating the feature text file
Simple changes in the feature text file can be used to refine the plot that
gets drawn. Two plotted feature lines can be combined into one by removing
the "Line:" token on the second one. The different types of information
can be descriminated by using different feature types.
In a section representing database matches, there may be spurious matches that
clutter up the plot. They can be removed from the drawn image by deleting the
semicolon at the beginning of their lines. This is preferrable to just
deleting them, as you may want to them drawn in when presenting the data for
another purpose or doing other analysis, and deleting the token is less permanent.
5. Feature item colors
There are 28 colors recognized by the program. Color is indicated in the
feature text file by a number from 0 to 27. Colors 2-7 are the primary colors.
0 White
1 Black
2 Red
3 Yellow
4 Green
5 Light Blue
6 Blue
7 Fuscia
8 Maroon
9 Forest green
10 Olive
11 Orange
12 Spring green
13 Navy
14 Royal purple
15 Hot pink
16 Gray-blue
17 Gray
18 Peach
19 Sea green
20 Pale green
21 Pale yellow
22 Purple
23 Teal blue
24 Gray purple
25 Pink
26 Baby blue
27 Black
6. Feature item types
I use a few conventions when planning what sequence feature gets paired
with what symbol. Generally, features that are draw using a symbol
centered on the line are strand neutral, or strand independent items, such
as repetative DNA sequence. Items drawn above the line are forward
strand items, items depending from the line are reverse strand items.
Feaure
number Description Used for:
-----------------------------------------------------------------------------
0 Strand neutral box Strand neutral
feature
1,F,f Forward strand box Exon, forward strand
2,R,r Reverse strand box Exon, reverse strand
3 Strand neutral 1/2 height box
4 Forward strand 1/2 height box
5 Reverse strand 1/2 height box
6 Forward strand caret GRAIL poly A site
forward strand
7 Reverse strand caret GRAIL poly A site
reverse strand
8 Triangle forward strand GRAIL polII promoter
forward strand
9 Triangle reverse strand GRAIL polII promoter
reverse strand
10 Arc GRAIL CpG island
11 Tick mark forward strand Restriction enzyme
site, ??
12 Tick mark reverse strand
13 Height bar Lineplot match,
percent repetitive
14 Height block Lineplot match,
percent repetitive
15 Strand neutral dotted line type 1
16 Forward strand dotted line type 1 Connect 'exons' in
BLAST results
17 Reverse strand dotted line type 1 Connect 'exons' in
BLAST results
18 Strand neutral dotted line type 2
19 Forward strand dotted line type 2
20 Reverse strand dotted line type 2
21 Arrow type 1 thick arrow on the line pointing right
22 Arrow type 2 thick arrow on the line pointing left
23 Arrow type 3 arrow on the line pointing right
24 Arrow type 4 arrow on the line pointing left
25 Arrow type 5 forward strand
26 Arrow type 6 reverse strand
27 Small text centered
28 Small text left justified
29 Small text right justified
30 Large text centered
31 Large text left justified
32 Large text right justified
33 Giant text centered
34 Giant text left justified
35 Giant text right justified
7. Adding new feature types
If you know a little Perl programming and a little Postscript programming,
you can add new feature types to the program. Here's directions on doing
so:
1. In the feature subroutine, in the line "if (($shape > 26) || ($shape < 0))"
Increase 26 by one to allow for your new feature type to be recognized.
2. Copy the elsif section for an existing feature type, and insert it after
the last feature. For example, copy:
elsif ($shape == 6)
{printf("gsave %f %f %f setrgbcolor %d %d %d %d Tri grestore\n",$color_r
,$color_g,$color_b,$x2+1,$y_pos,$x1-1,(($y_pos+($rect_height/2))+1));
last FEATURE_SW;
}
and insert it after the last feature block (feature 26 right now).
3. Give it a new number; for example, the next feature would be 27. Change
"$shape == 6" to "$shape == 27".
4. Each feature block writes postscript code to standard output to draw
one feature. Use the variables $color_r ,$color_g, and $color_b to set the
color. $x1 is the position of the starting bp, $x2 is the postion of the end
bp, and $y_pos is the position of the feature plot line. Your feature should
stay between $y_pos+($rect_height/2) and $y_pos-($rect_height/2) to keep from
bumping into the neighboring feature lines. Remember, a feature sticking
up from the line has y coordinate values less than $y_pos.
5. You can use the existing Postscipt functions I've written in the program:
"Tri" draws a solid right trinagle given the coordinates of the hypotenuse.
The first point is the one on the feature line, the second point is the one
that sticks up (or down). It is called by "x1 y1 x2 y2 Tri".
"Rec" draws a solid rectangle given the coordinates of two corners. It is
called by "x1 y1 x2 y2 Rec".
"Tic" draws a line centered on the feature line given the center point and
the number of points it extends in each direction. It is called by
"x1 y1 y2 Tic". x1, y1 is the center, y2 is added and subtracted from y1
to give the extention.
6. Put any new variables in the variable section of the FEATURE subroutine.
Put new Postscript variables or functions in the Postscript header scetion.
7. That's it. If you add something, please email it to me so I can see!
Things I've thought of adding but haven't: open rectangle, open triangle,
striped box, horizontal arrows (open or closed), vertical arrows.
Thinner or thicker boxs, or tick markers.
8. Two example feature text files
In the first example file:
Note the FASTA sequence name, "DGS cosmids 103a2 to 24b, bp 40001-80000", the
"Length:" token, and the first feature plot line, whose description is
"BLAST of 9-26-97 DGS db". The first feature item line starts ";4139,4241,2,1,
104 2 1284 100% X91...", and will draw a red box on top of the
line from bp 4139 to 4241 to indicate a exon of DGCR5. Note that this blast
match, "31578,31645,4,1, 1442 1509 2309 75% L77571:Homo sapiens DGS-A
mRNA, 3' end." will not appear in the plot because it doesn't have a ";" token
at the beginning of the line. For what it's worth, this blast match table
was generated from the blast output by another short Perl program, parse. Then
the semicolons, and color and feature type number was added using a vi search
and replace command.
-----------------Start example feature file-------------------------------
>DGS cosmids 103a2 to 24b, bp 40001-80000
Length:1110702
This version has the numerous poor DGS-F matches deleted.
The 1110702 bp query DNA sequence: 103a2
Searched against the database: DGSdb9-26-97
Matches with a HSP score of at least 150 are reported.
Matches separated by 20 or less bps are listed as a single longer match.
Query Database Seq. Percent
Start Stop Start Stop Len. Identity Description
Line:BLAST of 9-26-97 DGS db
-------------------------------------------------------------------------------
;4139,4241,2,1, 104 2 1284 100% X91348:H.sapiens predicted non codin
g cDNA (DGCR5).
;4240,4281,2,1, 94 135 1284 88% X91348:H.sapiens predicted non codin
g cDNA (DGCR5).
;21340,21509,2,1, 108 277 1284 92% X91348:H.sapiens predicted non cod
ing cDNA (DGCR5).
;22244,22320,2,1, 184 108 1284 74% X91348:H.sapiens predicted non cod
ing cDNA (DGCR5).
;24234,24441,2,1, 276 483 1284 100% X91348:H.sapiens predicted non cod
ing cDNA (DGCR5).
;25328,25534,2,1, 477 683 1284 99% X91348:H.sapiens predicted non cod
ing cDNA (DGCR5).
;26096,26338,3,2, 245 2 245 94% U84528:Human velo-cardio-facial sy
ndrome 22q11 region mRNA sequence.
;30729,31540,4,1, 595 1407 2309 88% L77571:Homo sapiens DGS-A mRNA, 3'
end.
31578,31645,4,1, 1442 1509 2309 75% L77571:Homo sapiens DGS-A mRNA, 3'
end.
;31830,31875,4,1, 1655 1700 2309 89% L77571:Homo sapiens DGS-A mRNA, 3'
end.
;47869,48025,2,1, 682 838 1284 99% X91348:H.sapiens predicted non cod
ing cDNA (DGCR5).
;51454,53762,4,1, 1 2309 2309 99% L77571:Homo sapiens DGS-A mRNA, 3'
end.
;56104,56550,5,1, 1 447 447 99% L77559:Homo sapiens DGS-B partial
mRNA.
;64092,64544,2,1, 833 1284 1284 100% X91348:H.sapiens predicted non cod
ing cDNA (DGCR5).
;69599,70081,6,2 4398 3916 4398 97% X84076:H.sapiens mRNA for DGCR2.
;69616,69752,6,2 3987 3836 3999 72% D78641:Mouse mRNA for Membrane Glyc
-----------------End example feature file---------------------------------
-----------------Start example feature file #2-------------------------------
>BAC bD3-6 sequence analysis
Length:127359
Line: Genes, ESTs, and features from Genbank search
;13541,14040,1,0,- matches MHC II IE intron 1 at 89%
;21888,22150,16,F GRAIL 2 excellent exon, Kelch exon 1, T1
;29207,29372,16,F GRAIL 2 excellent exon, Kelch exon 2
;33905,34693,16,F GRAIL 2 excellent exon, Kelch exon 3 +ESTs
;35955,36318,3,F Mouse EST MUSF076A, T2
;41708,41900,16,F GRAIL 2 excellent exon, Kelch exon 4 +EST
;46644,46927,16,F GRAIL 2 excellent exon, Kelch exon 5 +ESTs
;49987,50246,16,F GRAIL 2 excellent exon, Kelch exon 6 +ESTs
;54985,55377,11,R 2 mouse ESTS, T3
;57032,57490,26,F Mouse EST W53987, T4
;60011,60112,22,F GRAIL 2 excellent exon, KIAA0149 match 1:
60013-111
;60271,60786,22,F KIAA0149 match 2
;60877,61095,22,F GRAIL 2 excellent exon, KIAA0149 match 3: 60878-1096
;61227,61358,22,F KIAA0149 match 4
;61801,61906,22,F KIAA0149 match 5
;62304,62419,22,F GRAIL 2 excellent exon, T5 exon 1 bp 62338-62419 T5 matche
s 5 ESTs
;63846,63998,22,F GRAIL 2 excellent exon, T5 exon 2 KIAA0149 match 6 & 7: 63
845-901, 63961-991
;64245,64677,22,F T5 exon 3
;65228,65795,7,R 11 mouse ESTs, T6
;72742,73329,1,0,- ZNF74-1 ZN finger protein homology 76-82%
;97847,100269,8,R, DGCR2 exon 10, T7
;101552,101460,8,R, GRAIL 2 excellent exon, DGCR2 exon 9 bp 101456 101699
;102623,102471,8,R GRAIL 2 excellent exon, DGCR2 exon 8 bp 102471 102625
;107356,107080,8,R GRAIL 2 excellent exon, DGCR2 exon 7 bp 107097 107284
;114237,114404,8,R DGCR2 exon 6
;114826,114750,8,R GRAIL 2 excellent exon, DGCR2 exon 5 bp 114749 114826
;116028,115857,8,R GRAIL 2 excellent exon, DGCR2 exon 4 bp 115856 116067
;117023,116901,8,R GRAIL 2 excellent exon, DGCR2 exon 3 bp 116900 117026
Line:Dotplot vs. Human seq. using a 100bp window
Dotplot vs. Human cosmid 103a2 to U30597
67% is yellow, 77% is cyan, 87%+ is purple
The first DNA seq. is 127359 bp, in file /disk2/people/jiml/s/bD3-6.mask2.a:
-- >mouse bac bD3-6 00061
Comparing the DNA seqs. using a 100 bp window, returning regions of
67% or greater homology sustained over at least 100 bp.
DNA seq. 1 DNA seq. 2
Line Line Line Line Match
Start End Start End Length
---------------------------------------------
;91051,91203,3,0, 62835 62987 153
;91106,91205,3,0, 62890 62989 100
;91160,91260,3,0, 62945 63045 101
;91163,91314,3,0, 62948 63099 152
;91217,91317,3,0, 63002 63102 101
;91265,91403,3,0, 63045 63183 139
;91405,91577,3,0, 63182 63354 173
;91548,91650,3,0, 63331 63433 103
;91553,91779,3,0, 63336 63562 227
;92928,93047,3,0, 65300 65419 120
;93102,93232,3,0, 65493 65623 131
;94029,94132,3,0, 66162 66265 104
;94040,94144,3,0, 66173 66277 105
;94052,94152,3,0, 66185 66285 101
;94266,94365,3,0, 66429 66528 100
;94271,94372,3,0, 66434 66535 102
;94744,94953,3,0, 67063 67272 210
;98944,99065,3,0, 70940 71061 122
;99974,100304,3,0, 72146 72476 331
;101424,101765,3,0, 74337 74678 342
;102390,102663,3,0, 75041 75314 274
;105612,105776,3,0, 80024 80188 165
;107070,107323,3,0, 81740 81993 254
;114202,114361,3,0, 90254 90413 160
;114296,114446,3,0, 90347 90497 151
;114709,114870,3,0, 96465 96626 162
;115808,116146,3,0, 98103 98441 339
;116851,117034,3,0, 101352 101535 184
Regions of 72% or greater homology:
---------------------------------------------
;91067,91176,3,0, 62851 62960 110
;91083,91182,3,0, 62867 62966 100
;91085,91188,3,0, 62869 62972 104
;91178,91307,3,0, 62963 63092 130
;91278,91397,3,0, 63058 63177 120
;91411,91570,3,0, 63188 63347 160
;91563,91662,3,0, 63346 63445 100
;91573,91672,3,0, 63356 63455 100
;91610,91710,3,0, 63393 63493 101
;91620,91722,3,0, 63403 63505 103
;91625,91732,3,0, 63408 63515 108
;91638,91740,3,0, 63421 63523 103
;91643,91761,3,0, 63426 63544 119
;92933,93039,3,0, 65305 65411 107
;93118,93220,3,0, 65509 65611 103
;94753,94942,3,0, 67072 67261 190
;94845,94944,3,0, 67164 67263 100
;94847,94946,3,0, 67166 67265 100
;99980,100296,3,0, 72152 72468 317
;101430,101757,3,0, 74343 74670 328
;102399,102657,3,0, 75050 75308 259
;105628,105732,3,0, 80040 80144 105
;107076,107315,3,0, 81746 81985 240
;114208,114354,3,0, 90260 90406 147
;114302,114402,3,0, 90353 90453 101
;114309,114434,3,0, 90360 90485 126
;114717,114860,3,0, 96473 96616 144
;114764,114863,3,0, 96520 96619 100
;115815,116141,3,0, 98110 98436 327
;116859,117026,3,0, 101360 101527 168
Regions of 77% or greater homology:
---------------------------------------------
;91419,91551,5,0 63196 63328 133
;94777,94876,5,0 67096 67195 100
;94779,94883,5,0 67098 67202 105
;94802,94902,5,0 67121 67221 101
;94813,94912,5,0 67132 67231 100
;94821,94936,5,0 67140 67255 116
;99994,100094,5,0 72166 72266 101
;99998,100193,5,0 72170 72365 196
;100096,100219,5,0 72268 72391 124
;100129,100282,5,0 72301 72454 154
;100187,100286,5,0 72359 72458 100
;101442,101732,5,0 74355 74645 291
;102408,102650,5,0 75059 75301 243
;107083,107309,5,0 81753 81979 227
;114218,114347,5,0 90270 90399 130
;114725,114852,5,0 96481 96608 128
;115822,116134,5,0 98117 98429 313
;116867,117017,5,0 101368 101518 151
Regions of 82% or greater homology:
---------------------------------------------
;91425,91538,5,0 63202 63315 114
;91442,91541,5,0 63219 63318 100
;100033,100154,5,0 72205 72326 122
;100158,100271,5,0 72330 72443 114
;101450,101556,5,0 74363 74469 107
;101486,101724,5,0 74399 74637 239
;102416,102644,5,0 75067 75295 229
;107092,107303,5,0 81762 81973 212
;114226,114339,5,0 90278 90391 114
;114732,114833,5,0 96488 96589 102
;114741,114840,5,0 96497 96596 100
;114745,114844,5,0 96501 96600 100
;115831,116124,5,0 98126 98419 294
;116873,117007,5,0 101374 101508 135
Regions of 87% or greater homology:
---------------------------------------------
;101516,101616,14,0 74429 74529 101
;101519,101619,14,0 74432 74532 101
;101522,101704,14,0 74435 74617 183
;102435,102534,14,0 75086 75185 100
;102438,102632,14,0 75089 75283 195
;102536,102636,14,0 75187 75287 101
;107097,107198,14,0 81767 81868 102
;107101,107206,14,0 81771 81876 106
;107110,107225,14,0 81780 81895 116
;107137,107279,14,0 81807 81949 143
;107190,107290,14,0 81860 81960 101
;115837,116002,14,0 98132 98297 166
;115917,116029,14,0 98212 98324 113
;115939,116038,14,0 98234 98333 100
;115947,116056,14,0 98242 98351 110
;115965,116085,14,0 98260 98380 121
;116001,116116,14,0 98296 98411 116
;116878,116988,14,0 101379 101489 111
Regions of 92% or greater homology:
---------------------------------------------
;102446,102612,14,0 75097 75263 167
;115846,115945,14,0 98141 98240 100
Line:ORFs 100-150aa yellow,150-200 green,200+ blue
;519,824,3,R, Frame -3
;2041,2376,3,R, Frame -1
;5575,5955,3,R, Frame -1
;9600,10148,3,F, Frame 3
;9605,10228,3,R, Frame -2
;11227,11547,3,F, Frame 1
;11664,12029,3,F, Frame 3
;12284,12730,3,F, Frame 2
;14103,14519,3,F, Frame 3
-----------------End example feature file #2---------------------------------
End of help file.
Updated 2/99
Written by Jim Lund in the lab of Roger Reeves, Johns Hopkins University