VAP Documentation

General description

VAP was designed to be used by biologists and bioinformaticians to analyze genomic datasets to generate aggregate or individual profiles of groups of genes / annotations / regions of interest (generalized under the reference features term throughout the documentation and interface) from their genome of interest. Graphical representation of the results is automatically generated, and subgrouping can be done easily based of the orientation of the flanking annotations (particularly useful for compact genomes). VAP being highly efficient, it was designed to run on laptop computer through the ability to limit memory footprint of big datasets, but it can also be compiled (C++) and run on a server.

How VAP works

In a usual ChIP-Seq experiment, the localization of a protein interacting with DNA is conducted in different cells or conditions. It is often useful to visualize the trend of a genomic dataset over groups of reference features to capture and compare how the aggregate signal evolves in space over an average feature in different conditions (also called "composite profile" or "metagene"). Sometimes only one reference point is required (e.g. the Transcription Start Site (TSS) or the middle of small regions such as Transcription Factor Binding Site (TFBS)) but most of the time both sides of a gene or a region are interesting. In such cases, and to avoid contamination of the signal caused by the different length of the annotations and inter annotation regions, it is required to set both boundaries of a gene or region of interest using multiple reference_points. For instance, in compact genomes such as the yeast Saccaromyces cerevisiae, the average gene length is ~1.5 kb and the average intergenic regions ~0.5 kb. A researcher interested by the aggregate profile of an epigenetic mark enriched all over the gene, who is using only one reference point (e.g. TSS) will obtain an aggregate profile contaminated after ~100 bp on both sides of the reference point by a mix of signal coming from genes and intergenic regions (all having different length). VAP can process more than one round of dataset(s) and reference group(s); all the datasets (and reference groups) sharing the same alias are processed together.

VAP therefore proposes to use up to six reference_points to delimit the regions of interest, and avoid contamination of adjacent signal. There are always one more block than the number of reference points. Using two reference_points, three blocks are generated to isolate in a block the reference feature (often a gene) in the middle, and the upstream and downstream regions containing signal from flanking intergenic regions (and potentially from flanking genes depending of the length of the regions and the blocks). To completely delimit the flanking intergenic regions, four reference_points are required (five blocks); in such a case the upstream and downstream regions contain (but is not delimited by) signal from the flanking genes. To completely delimit these flanking genes in a block, six reference_points must be used (for a total of seven blocks). Note that, in the annotation analysis_mode, it is possible to choose either the transcription (tx) or coding (cds) coordinate columns (start and end) through the annotation_coordinates_type parameter. Blocks are subdivided in independent numbers of windows using the windows_per_block parameter.

VAP can also be used to generate the profiles over exons (through the analysis_mode parameter) using six reference_points: the seven blocks correspond to 1) the upstream region, 2) the first exons, 3) the first introns, 4) all the middle exons, 5) the last introns, 6) the last exons, and 7) the downstream region.

A user interested by regions not part of a genome annotation file can also directly provide the coordinates of the reference points through the Coordinates reference groups toggle button linked with the analysis_mode parameter. Note that coordinates of the blocks are processed in the same way than the corresponding blocks in annotation analysis_mode, meaning that the provided coordinates are inclusive for block usually corresponding to genes (since genes coordinates are usually inclusive).

Another important aspect to consider is the fact that the transcriptional processes can sense the absolute distance from a certain point (such as the TSS) based on the suite of post-traductional modifications involved at the different steps of transcription, and always following the same order; the combination of marks at a certain point can therefore represent a "ruler" of the absolute distance. That is why we designed VAP to use the Absolute representation mode, but we nonetheless decided to incorporate the Relative mode (from the analysis_method parameter) allowing users to compare the results since this later mode is still sometimes used in the literature. In Absolute mode, each feature is divided in windows of constant window_size, while in Relative mode each feature is divided in a constant number of windows (therefore having varying length for two features of different legth). The Relative mode imply that a signal appearing at a certain absolute distance from a point of reference (e.g. the H3K36me3 histone mark appearing after few hundred base pairs from the TSS of each gene) are not be represented in the same window for short vs long feature (e.g. a signal 600 bp dowstream of the TSS is contained in the 6th window of a 1kb gene divided in 10 windows, but in the 2nd window of a 3kb gene also divided in 10 windows).

In Absolute mode, the content of each block can be totally aligned to the left or to the right, or split to align separately both the starts and the ends of the features (through the block_alignment parameter). A gap is introduced at the split point inside a block or between consecutive blocks representing regions in the genome not necessarily contiguous (e.g. if the first two blocks are aligned to the Right, the genomic coordinate of the last window of the 1st block is not necessarily be contiguous with the genomic coordinate of the first window of the 2nd block when the region contained in this 2nd block is longer (ex 1 kb) than the capacity of the block (ex 10 windows of 50 pb)). The split point can be expressed in absolute or proportional distance of a boundary using the block_split_type, block_split_value, and block_split_alignment parameters. Note that left and right block_alignment are internally represented as split 100% left or right respectively.

VAP can be:
1) run from the Java Interface (requiring Java 7)
2) run from the binary (simply doing "./VAP_exec -p param_file" from the command line)
3) recompiled from the sources distributed under the GPL v3 licence, and either used from the interface through the "vap_native" strategy or run directly

By pushing the "Run" button on the interface, the appropriate platform-dependent binary file for Windows, Mac OSX or Mint (32 and 64 bits) is copied from the jar into the selected output_directory and executed from there. Alternatively, if a file named "vap_native" is present in the output_directory, the interface rather execute this file, allowing a user to compile the code on his own server and use the interface through a X-windowing system. Any relative path must therefore takes the output_directory as the working directory when using the interface. The binary generates the requested output files (according to the parameters file), then graphs containing the requested aggregate profiles are automatically created. It is also possible to (re)create graphs from the appropriate tab of the interface, or by calling the appropriate function from the jar file (see below). Alternatively, macro included into Open Office and Microsoft Office spreadsheets are also {{http://...}available}}. All the output files are centralized in the output_directory selected by the user.

Interface

The Java interface proposes two main tab: Complete VAP process to run a complete analysis, and Only create graphs from existing VAP output files. Few sections of the parameters are considered "advanced" and be accessed by clicking to the "+" sign beside the information button. The interface supports "drag and drop" operations. All the visible parameters are required (the default selection is not always what fit best your needs). Considering VAP is efficient, you can easily test many sets of parameters (and use prefix_filename to keep the various results).

Complete VAP process

A "VAP_parameter.txt" file is automatically created by the interface in the output_directory and can be imported through "Open parameters" from the "File" menu. Note that this file is in APT format (Almost Plain Text) and is directly used to generate the parameters documentation in html. The "File" menu also proposes a selection of parameters often used to generate aggregate profiles for mammalian or yeast genomes (the later being much smaller and compact). The best way to fill the parameters is really to follow the order on the interface and to read the documentation where needed (accessible through the information buttons or from the "Help" menu).

Only create graphs

It is now possible to (re)create graphs either from the interface (see above) or from the command line by calling the appropriate function from the jar file. This fonction requires at least one file, either a parameter file or a map file (both created by VAP). Using a parameters file allows to add all the elements to the created graphs, while using a map file the program don't have access to some information (including the number of reference points and the number of windows per block) such that the annotations under the X axis are missing in the created graphs. The command is java -jar vap.jar create_graphs ParamOrMapFile.txt [options] where the two options are --display_dispersion_values value (or in short -ddv value) and --y_axis_scale expression (-yas expression) where value is a boolean (0 or 1) and where the expression is composed of up to two values (from and to) separated by a semi-colon (;) (an absence of value is interpreted as an automatic selection for each graph independently) (e.g. java -jar vap.jar create_graphs VAP_parameters_file.txt -ddv 1 -yas -1;2).

Usage examples

Aggregate profile

For compact genomes such as yeast, it is recommended to use four reference_points, a window_size of 50 bp and a number of windows_per_block calculated to completely cover (and even double) the average region/gene length in the exploratory data analysis phase (e.g. 10;20;60;20;10 windows of 50 bp per block completely covering regions of 500;1000;3000;1000;500 bp respectively). The aggregate output files contains for each window the aggregate and dispersion values, as well as the proportion of reference features contributing to this window. A gene shorter than the block capacity is completely included in the aggregate profile, meaning that some windows of the block are empty for this reference feature (e.g. 10 windows of 50 bp around the split point are not be filled for a 2500 bp gene in a block with a capacity of 3 kb). Conversely, the portion around the split point of a gene longer than the capacity of the block is not represented in the aggregate profile (e.g. the middle 1 kb of a 4 kb gene is not represented in a block of a 3 kb capacity).

Individual profile

The groups of reference features are often derived from the ChIP-Seq of a protein of interest (e.g. a component of the RNA polymerase II complex), or from a transcriptomic experiment (RNA-Seq or arrays). VAP can be used in this former case, to calculate the average signal over each individual reference feature. To do this, you simply need to create and select one reference group containing all the genes name (i.e. the first column of your genome annotations_file), select your dataset of interest, enter two reference_points, a bigger window_size than the longest gene, only one window_per_block (1;1;1), and select to Output the data of each window for each reference feature (through the write_individual_references parameter) (no need to select any of the aggregate output files). The {{{Individual}individual}} output file contains three columns, the second containing the average signal over each gene (one per line). The user can then sort the genes based on this value to create his few reference groups of similarly occupied/transcribed genes, and use these new groups to generate aggregate profiles. By choosing two windows in the second block and by using some of the split alignment parameters, it is also possible to calculate the average signal over certain portion of the genes (such as the the first/last half/quarter, or the first/last X bp of the genes). Moreover, in a "normal" setup using multiple smaller windows per block, this output file could be used for heatmap and/or clustering representation of the data complementing the aggregate profiles.

Input files

Depending on the type of reference the user want to analyze (Annotations, Exons or Coordinates) the input files to provide are not exactly the same, but in all cases the list of "Reference group and data" files are mandatory.

* Data: The supported formats for data files are {{{https://genome.ucsc.edu/goldenPath/help/bedgraph.html}BedGraph}} and {{{https://genome.ucsc.edu/goldenPath/help/wiggle.html}variableStep and fixedStep WIG}}. When the "track" line of a dataset file contains the field "name", this information is used as dataset_name, otherwise the file name is used as dataset_name in the output files.

* Reference group: The content of this file vary depending on the type of reference you want to analyze. When the first line of a reference group file contains a tag "name=", this string is used as group_name, otherwise the file name is used as group_name in the output files. If a group contains the same reference feature more than once, all instances are kept (and linked to the same genome annotation when applied).

For annotation and exon analysis_mode, the format of the reference group file is simply a list of one annotation name per line coming from the first column of the genome annotations file (annotations_path parameter), which is used to extract the coordinates of the reference features. If the genome contains the same annotation more than once, all instance of the same reference feature are linked with the first instance.

For coordinates analysis_mode, the reference group file must directly contains the coordinates in a special format. The first line of such a file should start by a "#" and contains at least the tag "type=" followed by one of the 6 "coordX" where X is the number of reference_points contained in this file. The full description of this line is: #name=["]file_alias["] type=["]file_type["] desc=["]file_description["] where the three tags can appears in any order. Any other line starting by "#" are considered as comments. Because it could contains up to 6 reference points, the first column (tab-delimited) of the other lines must contains the chromosome, the second column the strand ('+' or '-' (any other character interpreted as '+')), and X columns containing the coordinates (e.g. to analyze regions of interest identified by a start and end coordinates: "chr1 - 1200 1500 region1_score=14.6"). An additional column could optionally contains the name of the region that will be reported in the {{{write_individual_references}"ind_" output file}}. Exceptionally, the BED format file with 3 to 6 columns is supported when there are exactly 2 reference points.

* Genome annotation: For annotation or exon analysis_mode, a genome annotation file (annotations_path parameter) must be provided to extract the coordinates of the reference features of the reference groups. This file must be in GenePred tab-delimited format (exon coordinates must be flanked by a comma (e.g. "200,500,")) and contains at least 10 columns. When a 11th column is present, it is considered to be the alias and this information is used as the second column in the {{{write_individual_references}"ind_" output file}} while other columns (12th and +) are copied at the end of the lines.

* Selection or Exclusion filters: For annotation or exon analysis_mode, it is possible to select (positive filter) or exclude (negative filter) some annotations from the reference groups through the selection_path and exclusion_path parameter respectively. For instance, the reference groups could collectively contain all the genes based on their level in condition X, but in an analysis the user would like to restrict the analysis to those longer than 1kb, and/or to exclude the annotations overlapping another annotation. Only the first column of these files is used (others are ignored). A one-to-one match is done between the annotations in the filters and each reference group (if a group contains the same annotation more than once, the filter should contains the same number of time this annotation to select/exclude all the instances).

 

Output files

The name of all the files created/copied in the output_directory can begin by a prefix if requested by the user through the prefix_filename parameter, otherwise it starts by the following nomenclature:

* Aggregate data: The name of these files begin by "agg_" followed by the dataset_name, and it can contains the group_name and the orientation, depending on the one_graph_per_group and one_graph_per_orientation parameters. This file mainly contains one line per window and 4 columns per reference group: the relative coordinates (where the 0 corresponds to the start of the reference feature and the step corresponds to the window_size parameter), the aggregate value (mean, median, max, or min through the aggregate_data_type parameter) after the smoothing (through the smoothing_windows parameter), the dispersion value (SEM, or SD through the mean_dispersion_value parameter), and the proportion (of the reference features of this group contributing the aggregate value of this window). In a case where a {{{refgroup_path}reference group}} contains 500 reference features, the aggregate value of a particular window is calculated on these (up to) 500 values; in the absolute mode, if 50 reference features don't have any data point in a particular window (e.g. shorter features than the others), the aggregate value is calculated on the 450 remaining values and the column "proportion" contains the 450/500 information (i.e. the proportion of reference features of a group participating in an aggregate data). Note that the proportion information is really useful to identify the regions of the aggregate profiles where the proportion of reference features contributing is too low to be reliable (e.g. if the capacity of a block is 3000 bp while the average gene length is 1500 bp and the features are split at 50%, the aggregate profile around the middle of the block will be noisy and the proportion very low). Also note that the dispersion of the data values inside an individual reference feature window is not computed (e.g. considering a genomic window of 50 bp overlapping 3 data values, only the average of the 3 values is kept for this window (the dispersion of the 3 data values is not considered); the SEM/SD are calculated using the average values across corresponding windows of the reference features of a given group in the process of calculating the {{{aggregate_data_type}aggregate}} value).

* Graph: The name of these files begin by "graph_", and it can contains the {{{dataset_name}dataset name}}, the reference group name and the orientation, depending on the one_graph_per_dataset, one_graph_per_group and one_graph_per_orientation parameters. These png files are created by the Java interface (therefore absent when VAP is run through the command line) and contains two plots, the first contains the aggregate and dispersion data (if requested through the display_dispersion_values parameter) while the second contains the information on the proportion of the reference features contributing to the aggregate values. Note that it is possible to either manually set the Y_axis_scale for all the graphs, or choose an automatic setting for each graph independently. In the process of creating the aggregate profiles, all the reference features of a group are virtually aligned in the same orientation (Positive strand) in order to align all the 5' boundaries to the left (based on the strand information) and all the 3' to the right. Consequently, the region containing a reference feature on the positive strand (P) is used as is, while the region containing a reference feature on the negative strand (N) AND it's upstream and downstream annotations is inverted (e.g. the genes A-B-C (ordered from their increasing genomic coordinates) where gene A and B are on the negative strand and gene C is on the positive strand are on native orientation NNP; in a case where gene B is included in a group of reference features, the whole region is virtually inverted such that this region is used as if the genes were ordered C-B-A and their orientations were NPP).

* Map: The name of this file is "map_graphs_datafiles.txt" and it contains the instructions for the Java interface and the external Open Office and Microsoft Office macro (available at the {{{http://...}VAP website}}) on how to create the graphs. Briefly, the first column of this file contains the name of the graph, and the second column contains the name of the aggregate data file to include in this graph. Depending on the values of the one_graph_per_group, one_graph_per_dataset and one_graph_per_orientation parameters, sometimes multiple aggregate data file must be appended in a graph (particularly because VAP is processing one dataset at a time but the user can choose to combine all the datasets in one graph). Note that even if the user chooses to combine all datasets in one graph, VAP processes one dataset at a time; the combination is done through a map file containing the instructions on how to create the combined graphs.

* Individual data: The name of these files begin by "ind" followed by the {{{dataset_name}dataset name}} and the {{{group_name}reference group name}}. This file mainly contains one line per reference feature of a group, and N columns containing the average of the data value(s) of a given window, where N corresponds to the sum of all the windows requested by the user in the windows_per_block parameter, in addition to few other columns including the reference feature name and coordinates. At the bottom of the file, the aggregate values of the requested orientation(s) (orientation_subgroups parameter) are also provided. This file is created only when the write_individual_references parameter is activated, it can be useful for a heatmap representation or clustering to complement the aggregate profile. Note that the values included in this file are not affected by the smoothing_windows parameter. This file could be useful for a heatmap representation and clustering of all reference features of a group to complement the aggregate profile. Another useful usage of this file is to select only 1 very large window for the block in order to get the average value of each reference feature (e.g. if the reference features are genes, using 2 {{{reference_points}reference points}} with 1 {{{windows_per_block}window per block}} and a {{{window_size}window size}} 20 kb (where the longest gene of the genome is 18 kb), this output file would contains (in the second column) the average value of each gene).

* Parameters: The name of this file is "VAP_parameters.txt" when created by the interface, or any other name when created by the user. The content of this file is thoroughly described in the "Help" menu of the interface and in the {{{http://...}VAP website}}.

* Reference group and data: The name of this file is "refGroupData.txt" when created by the interface, or any other name when created by the user. The content of this file is described in the {{{Input_files}Input_files}} section.

* Log: The name of this file is "VAP_logfile.log" and it contains information on all the VAP jobs ran in this folder (including warning and error messages).

 

Technical description

vap_core

vap_core_doc_rc1.tar.gz

vap_interface