How to Use TFpredict - The Manual

INPUT:

The input of TFpredict is a FASTA file: contains the protein identifiers and sequences in FASTA format (see Format Specification section). You can find the example (eukaryotic) file test_seq.fasta in the project’s main folder.

OUTPUT:

Without any specification, TFpredict produces a console output.
With the use of the -output option, the user can specify an output file that contains what the console output would be.
With the use of the -sabineOutfile option, a SABINE input file can be generated as output. It contains all information required for post-processing the results with SABINE (see Format Specification at https://github.com/draeger-lab/SABINE/). The output filename can be specified by the user. The argument -species also has to be specified if an output file for SABINE shall be created.

USAGE:

java -jar TFpredict_1.4.jar <input_filename> [OPTIONS]

Options

-prokaryote Run TFpredict for prokaryotic data.
-output <output_filename> Output file instead of console output.
-sabineOutfile <sabine_output_filename> Output file for post-processing of the results with SABINE.
-species <organism_name> Organism name (e.g., Homo sapiens). See list of supported organisms.
-tfClassifier <classifier_name> Classifier used for TF/non-TF classification possible values: SVM_linear, NaiveBayes, KNN
-superClassifier <classifier_name> Classifier used for superclass prediction possible values: SVM_linear, NaiveBayes, KNN
-blastPath <path_to_blast> Path to “bin” directory containing BLAST executables (e.g., /opt/blast/latest). Only needed if environment variable BLAST_PATH is not set.
-ignoreCharacteristicDomains no classification based on predefined InterPro domains.
--help to display the usage of the script and an overview of the command line options.

How To Proceed

First, you need to generate an input file in FASTA format (see Format Specification section below) or use the example input file: test_seq.fasta. The input file should contain the following information about the protein under study:

Name or identifier
Organism (see list of supported organisms)
Protein sequence

To run the newest version of TFpredict on the example input file, use the command:

java -jar TFpredict_1.4.jar test_seq.fasta

To post-process the results generated by TFpredict with SABINE to predict DNA-motives for transcription factors identified among the input protein sequences, you have to pass two additional arguments to the program. First, the destination to which the output file shall be written has to be specified, and second, the correct species has to be provided. Please ensure that SABINE supports the given species (see list of supported organisms).

An exemplary call of the program which facilitates the post-processing of the results using the tool SABINE is shown here:

java -jar TFpredict_1.4.jar example_input.fasta -sabineOutfile example_output.txt -species "Homo sapiens"

For a suitable example_input.fasta file (can be found in the SABINE GitHub repository) TFpredict returns an output file, which contains the results of the performed prediction steps in the SABINE input file format.

Format Specification

To analyze a given protein with TFpredict, the tool needs the corresponding amino acid sequence and organism. This information has to be formatted as specified in the TFpredict input file format description.

The results of TFpredict are returned to the user via the standard console output or an output file if the -output option is used. Optionally, an output file can be generated which can be processed using SABINE to predict the DNA-binding specificity of transcription factors identified among the protein sequences analyzed by TFpredict. See SABINE for a detailed description of the file format.

The input file format description specifies the input data for an individual TF. You can pack multiple TFs in one input file to sequentially process more extensive datasets with SABINE. In addition to the general description of the file formats, example input and output files for SABINE are provided on its page.

FASTA file:

>Sequence_1
MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGP
DEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAK
SVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHE
RCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNS
SCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELP
PGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGKEPG
GSRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD

>Sequence_2
...

>Sequence_3
...

SABINE input file:

NA  Identifier
XX
SP  Organism
XX
CL  Classification (decimal classification no. as in TRANSFAC)
XX
S1  Amino acid sequence
XX
FT  DNA-binding domain (InterPro ID   start position   end position)
XX
//
XX