charite.christo.protein
Interface ProteinParser
- All Known Implementing Classes:
- DSSP_Parser, NumberedSequence_Parser, PDB_Parser, SBD_Parser, SingleFastaParser, StupidParser, XML_SequenceParser
public interface ProteinParser
HELP
Proteins are stored in plain text files. A typical error is to save protein files as MS-Word-Document
because the proprietary WORD-format is not recognized by STRAP.
Recognition of file formats in STRAP works by try-and-error disregarding of the file suffix.
At first the DSSP-format is assumed which in addition to the
amino acid sequence also contains the C-alpha coordinates and the
secondary structure definition.
INCLUDE_DOC:DSSP_Parser
In case the protein file does not comply with the DSSP format the PDB-format (WIKI:Protein_Data_Bank_(file_format)) is tried.
INCLUDE_DOC:PDB_Parser
If the file is not in PDB-format then the fasta-format is tested.
INCLUDE_DOC:SingleFastaParser
The first series of non blank characters should consist exclusively of digits followed by white space of any length and an amino acid sequence.
EMBL-, WIKI:Genbank- and WIKI:Swissprot -files follow this scheme and should be parsed correctly.
The header is almost ignored. We only look for the name of the compound and the organism to create some information texts.
WIKI:Genbank files usually contain nucleotide sequence rather than amino
acids and nucleotides will be seen in the protein alignment.
Genbank files can be interpreted with the dialog ITEM:charite.christo.strap.DialogGenbank.
For other nucleotide sequences the reading frame and the translated regions can be set manually
(ITEM:charite.christo.strap.EditDna).
Three nucleotide bases yield one amino acid.
Finally, when no specific format was recognized all letters in the file are used as one letter codes of amino acids.
File compression:
Files ending with .gz, .bz2, .Z or .zip will be decompressed automatically.
Problems:
- pdb1a6y.ent.Z: last residue in SEQRES is a MET and is not in ATOMS
- pdb1acb: SEQRES 1 I 70 THR GLU PHE GLY where is it ?
SEE_CLASS:PDB_Parser
SEE_CLASS:DSSP_Parser
SEE_CLASS:NumberedSequence_Parser
SEE_CLASS:ProteinParser
SEE_CLASS:SingleFastaParser
SEE_CLASS:StupidParser
SEE_CLASS:XML_SequenceParser
SEE_CLASS:SwissHeaderParser
IGNORE_SEQRES
static final long IGNORE_SEQRES
- See Also:
- Constant Field Values
SIDE_CHAIN_ATOMS
static final long SIDE_CHAIN_ATOMS
- See Also:
- Constant Field Values
SEQUENCE_FEATURES
static final long SEQUENCE_FEATURES
- See Also:
- Constant Field Values
parse
boolean parse(Protein p,
long options,
BA text)
- Parameters:
text
- the entire file contents
It is a byte array and not a String Object for performance reasons.
- Returns:
- true: success, false: inappropriate file format
'The most important classes are StrapAlign, Protein and StrapEvent.'