ProteinParser

Overview

Package

Class

Tree

Deprecated

Index

Help

'STRAP:multiple sequence alignments '

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

charite.christo.protein
Interface ProteinParser

All Known Implementing Classes:: DSSP_Parser, NumberedSequence_Parser, PDB_Parser, SBD_Parser, SingleFastaParser, StupidParser, XML_SequenceParser

public interface ProteinParser

HELP Proteins are stored in plain text files. A typical error is to save protein files as MS-Word-Document because the proprietary WORD-format is not recognized by STRAP. Recognition of file formats in STRAP works by try-and-error disregarding of the file suffix.

At first the DSSP-format is assumed which in addition to the amino acid sequence also contains the C-alpha coordinates and the secondary structure definition. INCLUDE_DOC:DSSP_Parser

In case the protein file does not comply with the DSSP format the PDB-format (WIKI:Protein_Data_Bank_(file_format)) is tried. INCLUDE_DOC:PDB_Parser

If the file is not in PDB-format then the fasta-format is tested. INCLUDE_DOC:SingleFastaParser The first series of non blank characters should consist exclusively of digits followed by white space of any length and an amino acid sequence. EMBL-, WIKI:Genbank- and WIKI:Swissprot -files follow this scheme and should be parsed correctly. The header is almost ignored. We only look for the name of the compound and the organism to create some information texts.

WIKI:Genbank files usually contain nucleotide sequence rather than amino acids and nucleotides will be seen in the protein alignment. Genbank files can be interpreted with the dialog ITEM:charite.christo.strap.DialogGenbank. For other nucleotide sequences the reading frame and the translated regions can be set manually (ITEM:charite.christo.strap.EditDna). Three nucleotide bases yield one amino acid.

Finally, when no specific format was recognized all letters in the file are used as one letter codes of amino acids.

File compression: Files ending with .gz, .bz2, .Z or .zip will be decompressed automatically.

Problems:

pdb1a6y.ent.Z: last residue in SEQRES is a MET and is not in ATOMS
pdb1acb: SEQRES 1 I 70 THR GLU PHE GLY where is it ?

SEE_CLASS:PDB_Parser SEE_CLASS:DSSP_Parser SEE_CLASS:NumberedSequence_Parser SEE_CLASS:ProteinParser SEE_CLASS:SingleFastaParser SEE_CLASS:StupidParser SEE_CLASS:XML_SequenceParser SEE_CLASS:SwissHeaderParser

Field Summary
`static long`	`IGNORE_SEQRES`
`static long`	`SEQUENCE_FEATURES`
`static long`	`SIDE_CHAIN_ATOMS`

Method Summary
`boolean`	`parse(Protein p, long options, BA text)`

Field Detail

IGNORE_SEQRES

static final long IGNORE_SEQRES

See Also:: Constant Field Values

SIDE_CHAIN_ATOMS

static final long SIDE_CHAIN_ATOMS

See Also:: Constant Field Values

SEQUENCE_FEATURES

static final long SEQUENCE_FEATURES

See Also:: Constant Field Values

Method Detail

parse

boolean parse(Protein p,
              long options,
              BA text)

Parameters:: text - the entire file contents It is a byte array and not a String Object for performance reasons.
Returns:: true: success, false: inappropriate file format