UniProt的Subcellular Location的格式

UniProt里面的亚细胞定位标注部分是有固定的格式可循的。这种格式是可以用正则表达式来标示的。这种可以遵循的格式最早起源于2007-10-23版本。

原始文章如下:(链接:http://www.uniprot.org/help/2007/10/23/release)

    Syntax modification of the 'Subcellular location' subtopic
    We have structured the 'Subcellular location' subtopic (CC SUBCE
LLULAR LOCATION lines in the flat file) in order to improve  the  co
nsistency  of nnotation and to allow to parse its content.
    The new format of SUBCELLULAR LOCATION in the flat file is:

    CC -!- SUBCELLULAR LOCATION:(( Molecule:)?( Location\\.)+)?
( Note=Free_text( Flag)?\\.)?

    Where:
        Molecule: Isoform, chain or peptide name
        Location = Subcellular_location( Flag)?(; Topology( Flag)?)?
(; Orientation( Flag)?)?
            Subcellular_location: SL-line of subcell.txt ID-record
            Topology: SL-line of subcell.txt IT-record
            Orientation: SL-line of subcell.txt IO-record
        Flag = \\(By similarity|Probable|Potential\\)
    
    Note: Perl-style multipliers indicate whether a pattern (as deli
mited by parentheses) is  optional  (?) or may occur 1 or more times 
(+). Alternative values are separated by a pipe symbol (|).

此外,自2014-10-01开始,UniProt开始启用ECO代码来表示测定方法了,原来的Flag部分已经基本消失。所谓ECO代码是指Evidence Codes Ontology,定义在:http://www.evidenceontology.org

所以,现在新生成的数据集,需要考虑ECO代码了,并且数据过滤步骤要有所改变了。

另外,很早以前,UniProt就启用了对亚细胞定位的各个字段的controlled vocabulary。在subcell.txt中定义。

此条目发表在生物信息学分类目录。将固定链接加入收藏夹。