A Brief Tutorial on Navigating Galaxy and kmer-SVM
Galaxy is a web framework in which computational tools for biological data analysis can be made easily available to experimental biologists while requiring a minimum amount of computer-specific knowledge on their part. kmer-SVM is a webserver built on the Galaxy framework that enables the mining of sequence data for transcription factor binding sites. In this tutorial we will highlight the essential features of the Galaxy interface and demonstrate a sample analysis using kmer-SVM.
Introduction to Galaxy
We begin by describing the Galaxy interface. The kmer-SVM homepage (pictured below) has 4 important features: (1) the top bar, (2) the Tools menu, (3) the center panel, and (4) the History bar.
- The top bar contains the following links:
- Galaxy/Beer Lab - Returns to the home page. Available from all pages on this site.
- Analyze Data - Where analyses are performed. This is the homepage.
- Worklow - Galaxy allows for the creation of workflows for ease of repeating the same kind of analyses. The image on the kmer-SVM homepage is of a Galaxy workflow for kmer-SVM.
- Shared Data - Data and other items publicly available to all users of kmer-SVM is available here. kmer-SVM offers datasets for learning to use kmer-SVM under the menu item 'Public Libraries'.
- Visualization - For visualizing data in browsers such as UCSC Genome Browser.
- Help - Links to more extensive documentation for the Galaxy framework. The Galaxy Wiki and the Screencasts are recommended for users wanting more extensive instruction in the use of the Galaxy framework.
- User - For registering with and logging into kmer-SVM. Not required, although some features such as workflows do require a login.
- Status bar - lists the % of space being used by user's data.
- The Tools menu lists the tools available to users of kmer-SVM. Some of these tools will be discussed in more detail below.
- The center pane is where data is actually analyzed. Menus from specific tools are loaded here and results of some analyses can be visualized here as well. In this screenshot it shows text included in the homepage.
- The history bar is where components of an analysis can be accessed. Uploaded data appears in the history bar, as do the results of analyses.
A Sample kmer-SVM Workflow
Getting Data from a Sample Library
Here we will walk through the workflow briefly described on the kmer-SVM homepage. Begin by clicking on 'Shared Data' and selecting 'Public Libraries' from the dropdown menu. You will see the following selection of libraries:
Click on 'Test 5: Esrrb'. You will be taken to the library page:
We are going to import this data so we can use it for the purposes of this tutorial. Click the checkbox next to "Name" in the gold bar in the middle of the page. All items in the library should now be selected. Towards the bottom of the page is a dropdown menu which says "Import to current history". Click the 'Go' button next to it. You should see the following confirmation message:
If you click on the 'Galaxy/Beer Lab' logo in the upper left of the page, you will return to the homescreen:
The imported data is now in the History bar. Note that there 2 types of data: FASTA files and BED files. We are going to work with the BED files. Also note that if you click on the name of an item in your History, you will see a menu like the following:
There are 3 very useful icons. The 'X' in the upper right is to delete an item from your history, the floppy disk is to download an item to your computer, and the eye-shaped icon is to view your dataset in the center pane.
The Get Data Tool
While our tutorial uses datasets already available from Galaxy, you can upload your data using the 'Get Data' Tool. The tool can upload data from a variety of locations. It is worth pointing out that to get data from your computer, select the 'Upload File' link. Click on 'Choose File' to browse your computer and select the file you wish to work with. To upload, click th blue 'Execute' button.
Prepping Data and Training an SVM
kmer-SVM trains on FASTA files. Because many times we will have a BED file as the output of an experiment, we will first show how to get the DNA sequences referenced by a given BED file. We will start by getting the positive sequences - those sequences we want to mine for their sequence content.
In the Tools menu, click on 'Fetch Sequences' and then 'Extract Genomic DNA'. At the top of the center pane, select '1: ESRRB_mm8.bed' and leave all other settings alone. You should see the following webpage:
Click on the blue 'Execute' button. The server will extract DNA sequences from a stored genome (currently only supported for mm8, mm9, hg18 and hg19). When all sequences have been extracted, a new dataset should appear in the History bar:
We are now going to do similarly for the negative dataset - that is, the dataset we are going to compare our positive against.
- Click on 'Extract Genomic DNA' in the Tools bar again, and this time make sure the top dropdown box says '3: ESRRB_mm8_neg10x.bed'.
- Click the blue 'Execute' button again. When the server has finished, we should have two new entries in the History bar.
- Click on 'kmer-SVM' in the Tools bar, then 'Train SVM'. Make sure that the dropdown menu labeled 'Positives' says '5: Extract Genomic DNA on data 1'.
We are not going to change any settings for this tutorial. Click on the blue 'Execute' button. kmer-SVM will now train an SVM on the input datasets, learning the difference between the positives and negatives. This should take approximately 20 minutes.
The Output of a Trained SVM
When kmer-SVM is finished training, the History bar will show an additional 2 items: one labeled "Predictions" and one labeled "Weights".
The "Weights" file is a text file containing the weights for all possible unique kmers of length k (for k=6, this is 2080). Click on the eye icon to view the weights file. Those kmers with large magnitude are the most informative in training the SVM. Users are referred to the accompanying publication for a discussion on interpreting these weights. The values with '#' in front of them are used in the prediction of sequences not in the training data set.
The "Predictions" file is the output of cross-validation, and is used to assess the accuracy of the trained SVM. Specifically, the predictions file is used in conjunction with the tools "Plot ROC Curve" and "Plot PR Curve". Users are directed to the tool pages and the publication accompanying this webserver for explanations of the ROC and PR curves; the summary statistic to be aware of in each case is the area under the curve (AUC-ROC and AUC-PR), which can be interpreted as the likelihood the trained classifier can identify a positive data point when presented with a positive and a negative. We will look at the ROC curve, as the operations for creating a PR curve are similar.
Click on "Plot ROC Curve" within kmer-SVM in the Tools menu. You should see the "Predictions" file has been automatically selected for you. Click on the blue "Execute" button.
You should see a new item in your History. Click on the eye icon to view the resulting ROC Curve.
You have now completed a tutorial of the Galaxy interface and the core kmer-SVM workflow. You are now ready to analyse data using kmer-SVM.