Biopython学习笔记[pdb篇]

由于主要自己做蛋白计算方面，所以主要关注的为PDB方面内容 Biopython是python用于做生物计算比较有名的底层包，这里总结自己的一些笔记，因为自己也是刚刚学，若有错误的地方请斧正。

1.PDB模块学习

1.1 读取pdb文件

我们以2b3p为例，我的文件结构如下

file -readpdb.py

dir -pdb

file –2b3p.pdb

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


#-*-coding:utf-8-*-

from Bio.PDB.PDBParser import PDBParser

parser = PDBParser(PERMISSIVE=1)

structure_id = "2b3p"
filename = "pdb/2b3p.pdb"

structure = parser.get_structure(structure_id, filename)

#获取头部信息
print(structure.header["resolution"])
print(structure.header["keywords"])

头部信息一般可以获取如下:

name, head, deposition_date, release_date, structure_method, resolution, structure_reference(参考文献)，journal_reference, author, compound(复合物)，has_missing_residues, missing_residues, 和astral

1.2 数据结构表示

如官方给的图所示:

Structure ,Model, Chain 和Residue 均为Entity 的基础类的子类，Atom 部分继承了Entity类，因为Atom没有children类

杂乱的 atoms 和residues分别属于 DisorderedAtom 和 DisorderedResidue 类，其属于DisorderedEntityWrapper 子类

获取模型:

1

print(structure.get_list())

输出

1

[<Model id=0>]

得到模型并获取链和残基:

1
2
3
4
5


model = structure[0]
print(chain.get_list())

chain = model['A']
print(chain.get_list())

输出

1
2


[<Chain id=A>]
[<Residue SER het=  resseq=2 icode= >, <Residue LYS het=  resseq=3 icode= >

获取当前位置的层级信息:

1

print(chain.get_full_id())

输出:

1

('2b3p', 0, 'A')

若需要获得这个实例的id可以使用

1

chain.get_id()

残基有许多有用的方法:

1
2
3
4


>>> residue.get_resname()       # returns the residue name, e.g. "ASN"
>>> residue.is_disordered()     # returns 1 if the residue has disordered atoms
>>> residue.get_segid()         # returns the SEGID, e.g. "CHN1"
>>> residue.has_id(name)        # test if a residue has a certain atom

原子的方法就更多了:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


>>> a.get_name()       # atom name (spaces stripped, e.g. "CA")
>>> a.get_id()         # id (equals atom name)
>>> a.get_coord()      # atomic coordinates
>>> a.get_vector()     # atomic coordinates as Vector object
>>> a.get_bfactor()    # isotropic B factor
>>> a.get_occupancy()  # occupancy
>>> a.get_altloc()     # alternative location specifier
>>> a.get_sigatm()     # 原子参数标准偏差
>>> a.get_siguij()     # standard deviation of anisotropic B factor
>>> a.get_anisou()     # anisotropic B factor
>>> a.get_fullname()   # atom name (with spaces, e.g. ".CA.")

例子: 旋转原子

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


from Bio.PDB.vectors import rotaxis
import math
n = residue["N"].get_vector()
c = residue["C"].get_vector()
ca = residue["CA"].get_vector()
# 转化为中心
n = n - ca
c = c - ca
# 旋转角，第一个参数为theta，第二个为vector
rot = rotaxis(-math.pi * 120.0 / 180.0, c)
# 转化为向量
cb_at_origin = n.left_multiply(rot)
cb = cb_at_origin + ca
print(cb)

输出的结果为:

1

<Vector 39.83, 16.82, -14.24>

结构object“巡航”

基础操作

查找结构的所有原子:

1
2
3
4
5


for model in structure:
    for chain in model:
        for residue in chain:
            for atom in residue:
                print(atom.get_name())

同样其有一个get_atoms()方法:

1
2


for atom in structure.get_atoms():
    print(atom)

当然chain也有这个方法，这里不展开了

获取residue列表:

1
2


for residue in structure.get_residues():
    print(residue)

获取所有的hetero残基:

1
2
3
4
5


for residue in chain.get_list():
    residue_id = residue.get_id()
    hetfield = residue_id[0]
    if hetfield[0] == "H":
        print(residue_id)

输出结果为:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18


('H_CRO', 66, ' ')
('H_ CD', 801, ' ')
('H_ CD', 802, ' ')
('H_ CD', 803, ' ')
('H_ CD', 804, ' ')
('H_ CD', 805, ' ')
('H_ CD', 806, ' ')
('H_ CD', 807, ' ')
('H_ CD', 808, ' ')
('H_ CD', 809, ' ')
('H_ACY', 701, ' ')
('H_ACY', 703, ' ')
('H_ACY', 704, ' ')
('H_ACY', 705, ' ')
('H_ACY', 706, ' ')
('H_ACY', 707, ' ')

Process finished with exit code 0

H的原因为hetero残基在残基的第一个框框里面为“H”开头

打印B factor大于10的所有原子坐标

1
2
3
4
5
6
7


for model in structure.get_list():
    for chain in model.get_list():
        for residue in chain.get_list():
            if residue.has_id("CA"):
                ca = residue["CA"]
                if ca.get_bfactor() > 10.0:
                    print(ca.get_coord())

提炼多肽

通过从结构内提炼多肽，提炼序列

1
2
3
4
5


from Bio.PDB.Polypeptide import PPBuilder

ppb = PPBuilder()
for pp in ppb.build_peptides(structure):
    print(pp.get_sequence())

输出结果为:

1
2


SKGEELFTGVVPILVELDGDVNGHKFSVRGEGEGDATNGKLTLKFICTTGKLPVPWPTLVTTL
VQCFSRYPDHMKRHDFFKSAMPEGYVQERTISFKDDGTYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNFNSHNVYITADKQKNGIKANFKIRHNVEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSVLSKDPNEKRDHMVLLEFVTAAGITHG

3. 分析结果

3.1 平均距离

1
2
3
4
5
6
7
8
9


residue1 = chain[2]
residue2 = chain[3]

ca1 = residue1["CA"]
ca2 = residue2["CA"]

# - 符号已经重载
distance = ca1 - ca2
print(distance)

输出的结果为:

1

3.799378

3.2 计算角度和二面角

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


import Bio.PDB.vectors as vec

n = residue1["N"].get_vector()
ca = residue1["CA"].get_vector()
c = residue1["C"].get_vector()
o = residue1["O"].get_vector()
# 计算角度
print(vec.calc_angle(n, ca, c))
# 计算二面角
print(vec.calc_dihedral(n, ca, c, o))

3.3 获取标准残基的内部坐标

TODO：待研究

1
2
3
4
5
6


model.atom_to_internal_coordinates()
for r in model.get_residues():
    if r.internal_coord:
        print(r,
              r.internal_coord.get_angle("psi")
              )

TODO:可以模仿制作拉试图

3.4 检测原子原子之间的联系

即临近搜索

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


from Bio.PDB.NeighborSearch import NeighborSearch
from Bio.PDB.Selection import unfold_entities

atoms = unfold_entities(structure,"A")
print(atoms)
# print(structure.get_atoms())
mysearch = NeighborSearch(atoms,10)
#
allinfo = mysearch.search_all(10.0,"R")
for all in allinfo:
    print(all)

可以搜索半径10埃范围的氨基酸残基。