THE RECEPTOR

INTRODUCTION

Very recently, a new peptide signaling system was discovered in B. subtilis phages that allows the phages to decide whether to lyse their host or integrate inside it as a prophage [1]. This signaling system, called arbitrium, was the first of its kind and consists of a signaling peptide called AimP that is sensed by an intracellular receptor, AimR. The AimR tested in [1] was derived from phage phi3T and was shown to be an activator of a gene called AimX in the absence of the AimP peptide. Upon binding, AimR from phi3T (phAimR)no longer binds to the AimX promoter and consequently, the AimX gene was repressed [1]. This AimR-AimP system was shown to be very specific and orthogonal to other AimR-AimP pairs, since AimP peptides from other phages did not bind to phAimR [1], making this system potentially very attractive for engineering synthetic cell-cell communication. Additionally, >100 AimR-AimP homologs were found in other B. subtilis bacteriophages, yet only 2 were characterized [1]. When this project started, nothing was known about the structural and mechanistic details on how the AimR-AimP system regulates gene expression, except that AimR had a tetratricopeptide repeat (TPR) domain and that AimR is structurally related to the RNPP family of quorum sensing systems of Gram-positive bacteria [1]. Over the course of this project, two papers appeared presenting detailed structural models of AimR derived from Phi3T (phAimR) and AimR derived from SpBeta (spAimR) [2, 3]. Interestingly, the papers showed that the two AimR molecules differed strongly in both the structural details of how they regulated gene expression and the infection dynamics controlled by them [2, 3]. These results suggest it is imperative that we understand (i) how the AimR homologs regulate gene expression and (ii) which AimR homologs can be repurposed into synthetic biology tools in order fully utilize the potential of this discovery.

Here we present an integrated approach to predict the 3D structure of any AimR homolog, use these structures to gain insights into how the AimR protein regulates downstream gene expression in the natural context and then use this knowledge to repurpose the AimR protein as a synthetic communication system in silico and in vivo inside a new context. This modeling approach will prove extremely valuable in helping to decide which of the >100 AimR homologs have interesting properties for synthetic biology and how to fully exploit their potential.

GENERATING STRUCTURAL MODELS FOR AimR

The first step in generating deep insights into the structure of the AimR homologs is to predict the 3D structure of the homologs. To this end, a wide variety of prediction softwares were used to predict the structure of phAimR and compared with the experimentally determined structure of phAimR. The results are shown in Table 1.

**Table 1**: Protein structure prediction. comparing different *in silico* protein structure prediction tools including SWISSMODEL [4], CPHmodels [5], I-TASSER [6], Phyre2 [7]. Protein structures visualized by “UCSF Chimera” software [8].

From these results, we have concluded that I-Tasser seemed to give the prediction that was the closest to the real AimR structure. However, the prediction quality of I-Tasser using its own quality measure (C-score) indicated that this model was still only describing the global topology of the protein. Since I-Tasser is known to not function optimally on multidomain proteins [9], we decided to split AimR into its two domains (N-terminal DNA binding and C-terminal peptide binding domain) and rerun the split domains using I-Tasser. The domain boundaries were predicted using the ThreaDomEx server [10]. The results of this analysis are shown in Figure 1.

**Figure 1**: ThreaDomEx prediction of domain boundaries of the spAimR (x-axis) Position of the amino acid in AimR sequence (Vertical axis) Domain conservation score.

Although the automatic annotation system suggested that AimR consists of 3 domains, visual inspection of the graph suggests (i) an N-terminal region of low domain conservation, probably a loop (ii) a major N-terminal DNA binding domain (iii) a connection lowly conserved domain, probably a loop again and (iv) a major conserved C-terminal domain. This is perfectly in line with the experimental structure [2]. Based on these results, phAimR was split into 2 parts: one from 1-191 and the other from 192-378. These domains were rerun using I-Tasser and compared to the experimentally verified protein structure. The results are shown in Figure 2.

**Figure 2**: Overlay of experimental phAimR and predicted structure by running the two domains separately. We show near perfect overlay between the predicted and the experimental structure.

Visually, the splitting of AimR into its two major domains has vastly improved the prediction quality, with near perfect overlap. Next, the degree of overlap between the AimR domains predicted by I-Tasser was quantitatively compared to the experimental structure by calculating the root mean square distance (RMSD), which should be minimal. The results for I-Tasser and CHmodels are shown in Figure 3. The results show that I-Tasser produces a model that is very close to the real experimental structure, confirming the visual observation from above.

**Figure 3**: Comparing structures calculated by I-tasser and CPHmodel versus experimental result (PDB ID 5zvv) by measuring RMSD between corresponding a-carbon of structures. (Top) N-Terminal domain of AimR (Bottom) C-terminal domain of AimR.

In summary, the modelling team finds that I-Tasser, while computationally expensive, is the best protein structure prediction tool for new AimR homologs. Our results also show that prediction accuracy can be improved drastically by predicting the structure of the DNA and peptide binding domains of the AimR homolog separately.

EXPLORING STRUCTURAL AND MECHANISTIC DETAILS OF phAimR AND spAimR

Analyzing peptide recognition and structural transitions

Once the structures of the AimR homologs are obtained, they can be interrogated to elucidate the molecular details of their function. This has been done in great detail in the very recently published articles (September 17^th and October 15^th 2018) detailing the experimental structure of spAimR (discussed here) and phAimR [2, 3]. Therefore, we decided to discuss the strategies used in this paper to elucidate the molecular details of gene regulation by AimR and propose software tools that can fill in the gaps left by the paper.

The first step is to open the protein structures in a molecular visualization tool for inspection. In this work, we found that Chimera works well and used it for all of our work but other tools exist [8]. Once the protein structures are opened, the protein structure can be interrogated in various ways to obtain more information. Firstly, since DNA is negatively charged, transcription factors tend to have positively charged patches to mediate interactions with the DNA. The electrostatic surface potential of a protein can be visualized in any molecular visualization tool and gives a clue as to where the DNA binding site can be found. Using this method, the authors have identified which amino acids are involved in DNA binding, which were experimentally confirmed afterwards [3]. Another powerful tool for understanding the structure is simple visual inspection of the structure. By doing this, the authors saw that the interaction between the AimR monomers was mediated by a C-terminal capping helix that was interacting with its counterpart on the other monomer and that it was mainly mediated by Van Der Waals interactions [3]. Next, the structure without the peptide (apo-spAimR) was compared with the structure containing the peptide. The authors simply cocrystallized the peptide and its ligand, but this is not available for predicted AimR homologs. Instead, the predicted AimR homolog can be docked with its cognate peptide using a variety of peptide docking tools (for a review: see [12]). Once the peptide is docked, visual inspection can help the modeller decide which amino acids are involved in peptide binding and what the basis of this interaction is. The authors of the paper used this to visualize that the peptide is bound by an extensive network of hydrogen bonds and hydrophobic contacts with a number of AimR amino acids, which were experimentally confirmed to be necessary for peptide binding. Incidentally, the most important amino acids were confirmed computationally using a multiple sequence alignment of 8 different AimR homologs. A simple, but powerful way to investigate the effect of peptide binding on the AimR conformation, is to simply overlay the peptide bound and the apo form of the AimR homolog. This was used very effectively in the paper to show that the effect of peptide binding on the AimR structure was very different for spAimR and phAimR. Whereas peptide binding only causes a slight opening of the DNA binding domain in spAimR, it causes a bigger structural change and even dissociation into monomers for phAimR [2]. The paper also demonstrated that this different mechanism has an impact on infection dynamics and thus regulation, underlining the need for deep structural understanding of the AimR homologs prior to use in a synthetic context.

In summary, our work on peptide recognition and conformational changes induced by it on phAimR got overshadowed by the very of the newly obtained experimental structures of phAimR and spAimR. Nevertheless, we have discussed the strategies used in these paper and how they can be applied to obtain detailed information on any AimR homolog and suspect that this will be useful for the iGEM community.

Analyzing DNA binding modes of AimR by docking

In order to expand on the results obtained by [2, 3], we wanted to see if we could computationally predict the exact binding site of phAimR and spAimR, the latter of which was delineated in some detail in [3]. Knowing the exact binding site will greatly facility the use of AimR as a synthetic tool and is very cumbersome to obtain experimentally, partially due to the low number of known binding sites which precludes statistical analysis of ChIP data. As a consequence, cumbersome DNAse footprinting assays or equivalent methods have to be used. Afterwards, we also wanted to assess the binding strength and visualize the interactions between protein and cognate DNA to see if AimR binds its site strongly enough to be repurposed into a repressor (see 4.1). The cognate DNA used for docking was pAimX(full) and pAimX(short) used by the experimental team. The pAimX promoter sequences were first converted into 3D structures using the 3D-DART server [13]. This 3D structure was subsequently docked with the phAimR and spAimR using the Chimera docking tools [8]. Both apo and ligand bound versions were docked. Figure 4 and 5 show the interactions of spAimR with the pAimX promoters and Figure 6 shows the interaction of phAimR with the pAimX(full) promoter.

**Figure 4**: Different phage SPbeta AimR forms binding to AimX promoter. The apo form of AimR represented in light brown, the peptide-bound (peptides with a green surface) form in orange. Chain A in both forms is bounded to the AimX promoter (pink and blue). In Figure 4 we can see the side (4.1), front (4.2) view of this binding mechanism.

**Figure 5**: Different phage SPbeta AimR forms binding to AimX(full) promoter. The apo form of AimR represented in light brown, the peptide-bound form in blue (chain A in dark blue and chain B in light blue) and peptides represented with a green surface. The AimX promoter whole sequence in yellow which contains the short sequence in red bounded to both forms of AimR.

**Figure 6**: Different phage phi3T AimR forms interaction with AimX promoter. The apo form of AimR represented in light brown, the peptide-bound form in blue and peptide represented with a green surface. The AimX promoter whole sequence in yellow which contains the short sequence in red.

These results show that the docking results do not support experimentally observed binding dynamics for both spAimR and phAimR [2, 3]. The absence of binding by apo-phAimR may be explained by the reported more closed conformation which seals of the DNA binding site for the docking tool, but not in vivo [3]. Next, we attempted to establish the predicted binding sequences (i.e. the DNA that was found inside the binding cleft) in all these conditions. As extra control, we attempted to dock spAimR with its experimentally determined binding region, removed the predicted binding site and reran the docking algorithm to see if it would predict a new binding site. These results are found in Figure 7. The results show that the predicted binding sites are different in every case. Hence, we concluded that the protein-DNA docking tools in our hands were sufficiently sensitive to dock the correct DNA binding domain of AimR to the DNA, but that it was not sufficiently sensitive to discriminate the specific AimR binding site.

Protein	DNA	Predicted binding sequence
spAimR	pAimX(full+short)	AGTTCCAGAAA
Ligand-phAimR	pAimX(full+short)	CTAATTT
spAimR	Experimental binding site	TTAGGTTTTAA
spAimR	Experimental binding site - predicted	ATAACATCTAGT

Figure 7: Summary of predicted binding sequences for spAimR and phAimR for binding to pAimX and the experimentally determined spAimR site.

In summary, we have attempted to expand the results from literature by studying the interactions between spAimR and phAimR and their cognate binding sites. Our results showed that our docking tools were insufficient to do this job and we recommend trying a different approach.

USING STRUCTURAL KNOWLEDGE TO GUIDE EXPERIMENTAL DESIGN

AimR as an activator in E. coli: Predicted interactions with E. coli rpoA

Using the phAimR structure, we wanted to know whether phAimR would still function as an activator in E. coli. A major mechanism for transcriptional activation at bacterial promoters is by the interaction of the transcriptional activator with the α subunit of RNA polymerase [11]. This is especially the case for transcription factors that bind outside the -35 region of their cognate promoter [11]. In order to test whether AimR stimulates gene expression in this way, we’ve performed protein-protein docking using Chimera between rpoA from B. subtilis and phAimR. The rpoA models used here were generated using I-Tasser since we couldn’t find any rpoA structures that were not in complex with a transcription factor or another protein. To do this, we obtained the rpoA protein sequences from Biocyc [14, 15]. The results of this docking are found in Figure 8.

**Figure 8**: Interactions between *B. subtilis* rpoA and phAimR (Top) Interaction with apophAimR (Bottom) Interaction with ligand bound phAimR.

These results suggest that the CTD of the α subunit of B. subtilis RNA polymerase can readily interact with both ligand bound and apo phAimR. However, an extra interaction is observed for the apo phAimR with the B. subtilis rpoA that is not present for the ligand-bound version. This extra interaction might be necessary for stably fixing RNA polymerase to its cognate promoter. Next, we wanted to see how phAimR binds to the α subunit of E. coli RNA polymerase. Hence, rpoA of E. coli was docked with phAimR. The docking results are found in Figure 9.

**Figure 9**: Interactions between *E. coli* rpoA and phAimR (Top) Interaction with apophAimR (Bottom) Interaction with ligand bound phAimR.

These results show that AimR can interact with E. coli rpoA in its ligand bound and apo form but only with the CTD tail. Neither structure gives the extra interaction with rpoA observed for apo phAimR. Hence, we concluded that the ability of AimR to interact with RNA polymerase and stimulate gene expression in E. coli might be impaired.

In summary, we have docked phAimR with both B. subtilis and E. coli rpoA to see if AimR interacts with rpoA as part of its regulatory mechanism and to see if this interaction is maintained in E. coli. We have observed that an interaction between apo-phAimR and rpoA of B. subtilis is not found in the ligand bound case or in interaction with rpoA of E. coli. We concluded that AimR activation may not be maintained when using AimR in E. coli and recommended the experimental team to redesign their genetic construct as a repressor so this loss of interaction would not give any problems. Sadly, our attempt to computationally estimate phAimR-DNA interactions yielded inconclusive results that prevented us from supporting this choice further.

Predicted AimR toxicity: spurious binding sites in the E. coli genome

An important issue when trying to express a heterologous gene inside a new host is the risk of toxicity caused by overexpression of the protein. We did not find a computational service that would help us predict the risk of toxicity when overexpressing AimR in E. coli, but we did find a publication that determined the factors that are predictive of overexpression toxicity [16]. The authors found that transcription factors tend to be toxic when overexpressed and suggested that their binding to non-native DNA sites and saturation of the transcription machinery could cause toxicity. Hence, AimR is at risk of causing toxicity in E. coli. However, an alternative explanation that we propose is that spurious binding of the transcription factor to the genome can interfere with the expression of native genes. In order to get an idea of how much spurious AimR binding sites are available in the E. coli genome, we attempted to dock AimR to its cognate DNA, but didn’t have success. We did determine that the size of the binding site of one AimR monomer is roughly 11 nucleotides. Hence, we determined how many times one of these binding sites occurred in the genome to get an idea of the risk of AimR binding to the E. coli genome. For this end, we searched the predicted binding site of the spAimR docked with its experimental sequence and searched for it in the E. coli genome. The results are found in Table 2.

Table 2: Estimating predicted spurious binding of AimR and the potential effect on host survival (Left to right) Coordinates of the spurious binding site, associated gene, binding location relative to this gene and essentiality of this associated gene for survival. This last information was taken from Ecocyc [14].

Predicted binding region (Left)	Predicted binding region (Right)	Associated gene	Location (CDS/intergenic)	Essential?
Binding sequence: TTAGGTTTTAA
113468	113478	guaC	CDS	No
149902	149912	yadC	CDS	No
827585	827595	ybhF	CDS	No
4062739	4062749	yihN	CDS	No

These results show that binding sites that have a size corresponding to the real AimR binding site, may occur readily in the E. coli genome and inside genes. This analysis can thus reveal whether important processes in the host cell may be perturbed by overexpression of AimR.

In summary, despite not having the exact binding site of AimR, we estimated that a binding site of the size of AimR may readily occur in the E. coli genome and more importantly inside native genes. We have developed a protocol to qualitatively assess where the protein can bind and what processes it may perturb. This, combined with the observation by [16] that transcription factors tend to be toxic when overexpressed, led us to recommend the experimental team to use low-copy plasmids or genome integrations of AimR when using them in E. coli.

REFERENCES

[1] Erez Z, Steinberger-Levy I, Shamir M, Doron S, Stokar-Avihail A, Peleg Y, Melamed S, Leavitt A, Savidor A, Albeck S, Amitai G, Sorek R. Communication between viruses guides lysis-lysogeny decisions. Nature (2017) 541, 488-493.

[2] Dou C, Xiong J, Gu Y, Yin K, Wang J, Hu Y, Zhou D, Fu X, Qi S, Zhu X, Yao S, Xu H, Nie C, Liang Z, Yang S, Wei Y, Cheng W. Structural and functional insights into the regulation of the lysis-lysogeny decision in viral communities. Nat Microbiol (2018) 3, 1285–1294.

[3] Wang Q, Guan Z, Pei K, Wang J, Liu Z, Yin P, Peng D, Zou T. Structural basis of the arbitrium peptide-AimR communication system in the phage lysis-lysogeny decision. Nat Microbiol (2018) 3, 1266–1273.

[4] Waterhouse A, Bertoni M, Bienert S, Studer G, Tauriello G, Gumienny R, Heer FT, de Beer TAP, Rempfer C, Bordoli L, Lepore R, Schwede T. SWISS-MODEL: homology modelling of protein structures and complexes. Nucleic Acids Res (2018) 46, W296-W303.

[5] Nielsen M, Lundegaard C, Lund O, Petersen TN. CPHmodels-3.0--remote homology modeling using structure-guided sequence profiles. Nucleic Acids Res (2010) 38(Web Server issue):W576-W581.

[6] Yang J, Yan R, Roy A, Xu D, Poisson J, Zhang Y. The I-TASSER Suite: protein structure and function prediction. Nat Methods (2015) 12, 7-8.

[7] Kelley LA, Mezulis S, Yates CM, Wass MN, Sternberg MJ. The Phyre2 web portal for protein modeling, prediction and analysis. Nat Protoc (2015) 10, 845-858.

[8] Pettersen EF, Goddard TD, Huang CC, Couch GS, Greenblatt DM, Meng EC, Ferrin TE. UCSF Chimera--a visualization system for exploratory research and analysis. J Comput Chem (2004) 25, 1605-1612.

[9] Roy A, Kucukural A, Zhang Y. I-TASSER: a unified platform for automated protein structure and function prediction. Nat Protoc (2010) 5, 725-738.

[10] Wang Y, Wang J, Li R, Shi Q, Xue Z, Zhang Y. ThreaDomEx: a unified platform for predicting continuous and discontinuous protein domains by multiple-threading and segment assembly. Nucleic Acids Res (2017) 45, W400-W407.

[11] Browning DF, Busby SJ. The regulation of bacterial transcription initiation. Nat Rev Microbiol (2004) 2, 57-65.

[12] Ciemny M, Kurcinski M, Kamel K, Kolinski A, Alam N, Schueler-Furman O, Kmiecik S. Protein-peptide docking: opportunities and challenges. Drug Discov Today (2018) 23, 1530-1537.

[13] van Dijk M, Bonvin AM. 3D-DART: a DNA structure modelling server. Nucleic Acids Res (2009) 37, W235-W239.

[14] Keseler IM, Mackie A, Santos-Zavaleta A, Billington R, Bonavides-Martínez C, Caspi R, Fulcher C, Gama-Castro S, Kothari A, Krummenacker M, Latendresse M, Muñiz-Rascado L, Ong Q, Paley S, Peralta-Gil M, Subhraveti P, Velázquez-Ramírez DA, Weaver D, Collado-Vides J, Paulsen I, Karp PD. The EcoCyc database: reflecting new knowledge about Escherichia coli K-12. Nucleic Acids Res (2017) 45, D543-D550.

[15] Caspi R, Altman T, Billington R, Dreher K, Foerster H, Fulcher CA, Holland TA, Keseler IM, Kothari A, Kubo A, Krummenacker M, Latendresse M, Mueller LA, Ong Q, Paley S, Subhraveti P, Weaver DS, Weerasinghe D, Zhang P, Karp PD. The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of Pathway/Genome Databases. Nucleic Acids Res (2014) 42, D459-D471.

[16] Singh GP, Dash D. Electrostatic mis-interactions cause overexpression toxicity of proteins in E. coli. PLoS One (2013) 8, e64893.

Previous		Next

Server	Prediction	Evaluation
SWISS-MODEL: fully automated protein structure homology-modeling server
CPHmodels: a protein homology-modeling server
I-TASSER: provides a hierarchical approach to protein structure and function prediction
Phyre2: a protein homology recognition server

Team:Evry Paris-Saclay/Model