UniProt里面的亚细胞定位标注部分是有固定的格式可循的。这种格式是可以用正则表达式来标示的。这种可以遵循的格式最早起源于2007-10-23版本。
原始文章如下:(链接:http://www.uniprot.org/help/2007/10/23/release)
Syntax modification of the 'Subcellular location' subtopic We have structured the 'Subcellular location' subtopic (CC SUBCE LLULAR LOCATION lines in the flat file) in order to improve the co nsistency of nnotation and to allow to parse its content. The new format of SUBCELLULAR LOCATION in the flat file is: CC -!- SUBCELLULAR LOCATION:(( Molecule:)?( Location\\.)+)? ( Note=Free_text( Flag)?\\.)? Where: Molecule: Isoform, chain or peptide name Location = Subcellular_location( Flag)?(; Topology( Flag)?)? (; Orientation( Flag)?)? Subcellular_location: SL-line of subcell.txt ID-record Topology: SL-line of subcell.txt IT-record Orientation: SL-line of subcell.txt IO-record Flag = \\(By similarity|Probable|Potential\\) Note: Perl-style multipliers indicate whether a pattern (as deli mited by parentheses) is optional (?) or may occur 1 or more times (+). Alternative values are separated by a pipe symbol (|).
此外,自2014-10-01开始,UniProt开始启用ECO代码来表示测定方法了,原来的Flag部分已经基本消失。所谓ECO代码是指Evidence Codes Ontology,定义在:http://www.evidenceontology.org
所以,现在新生成的数据集,需要考虑ECO代码了,并且数据过滤步骤要有所改变了。
另外,很早以前,UniProt就启用了对亚细胞定位的各个字段的controlled vocabulary。在subcell.txt中定义。