What is a protein?

I was recently involved in a discussion between two scientists who disagreed on the relative proportions of symmetrical heteromeric proteins versus symmetrical homomeric ones. Both sides used the PDB as a source of information about known protein structures. After a little while, it was clear that the disagreement was not so much based on scientific ground than on semantics. In particular they used the word “heteromer” to describe two very different types of proteins. One of them called heteromer a protein made of different, although homologous, proteins. Thus, the different monomers of the complex displayed the same structure. Examples would be hemoglobin, ligand-gated ion channels etc. The other scientist called the former proteins homomers, considering the differences in sequences slightly irrelevant for the formation of complexes (which made sense in the context of the conversation). She called heteromer a protein complex made of polypetides adopting different conformations. Examples would be ATP synthase, Aspartate transcarbamoylase, etc.

This disagreement is one facet of another, larger, debate, namely what is a protein? This debate, like the similar one “what is a species“, might look pretty artificial and only good for endless, unproductive, argumentations. However, as witnessed by the disagreement described above, fuzzy definitions of that sort actually affect the result of research in the genomic era.

The Wikipedia page protein starts by telling us that “proteins are large biological molecules consisting of one or more chains of amino acids”. This is about the only definition which is not controversial. Anything more detailed bumps in counter-examples and contradictions. However, although the statement above is true, not all “large biological molecules consisting of one or more chains of amino acids” are commonly considered proteins. In this post, I would like to discuss problematic examples of “proteins” and test various definitions.

Is a protein a set of covalently bound amino-acids? Such a definition would cover all polypeptides, that are sets of amino acids linked by the peptide bond, and would also cover polypeptides linked by other types of covalent bonds, such as insulin, that is made of two polypeptides linked by disulphide bonds (the two polypeptides originate from the cleavage of the insulin gene product). However, this definition is too narrow to encompass many protein complexes currently considered as bona fide proteins. Examples would be hemoglobin already mentioned above, formed of four protein subunits associated through non-covalent bonding, or nicotinic receptors, formed of five proteins subunits. Those complexes are considered as protein quaternary structures. (Interestingly, nicotinic receptors of the Torpedo electric organ are sometimes dimerised through disulphide bonds. However, those dimers are not considered as bona fide proteins.)

Structure of hemoglobin

Structure of hemoglobin. By Richard Wheeler 2007.

The examples above are made of subunits having the same overall 3D (or tertiary) structure. In theses cases, this is thought to be come from a common ancestry for the genes coding the different subunits. What about defining a protein as structures formed by a set (which cardinal could be 1) of polypeptides encoded by homologous genes? This expanded definition would cover nicotinic receptors, but not the complex between nicotinic receptors and rapsyn, that associate non-covalently, but are considered different proteins. Unfortunately, this new definition cannot cover proteins such as Aspartate transcarbamoylase, that is formed of catalytic and regulatory subunits encoded by non-homologous genes. Conversely, microtubules are polymers of tubulin. While tubulin and tubulin dimers are commonly called proteins, microtubules are generally not.

Structure of Aspartate transcarbamoylase

Structure of Aspartate transcarbamoylase. Adapted from Kantrowitz 2012

But hold on … All the examples of protein given above seem to exist only as the polymeric form, while the counter-examples are labile. Could the stability be the criterium then? A protein would be a stable structure formed by a set (which cardinal could be 1) of polypeptides. Sadly not. Many protein complexes generally accepted as valid proteins are labile. For instance, the regulatory and catalytic subunits of PKA are not always bound to each other. Their separation is the key to the enzyme activation by cAMP.
Conversely, the subunits of the proteasome are generally, if not always, bound together. However, the proteasome is not generally considered as one protein. Other examples are the polypeptides forming the capside of viruses.

Are we therefore doomed to face an arbitrary definition, changing based on the protein and the preferences of the scientists involved in the relevant research? What would be the alternatives?

In a genomic era, dominating by nucleic acid sequencing and a gene centric view of biology, a practical approach may be the “UniProt” one, that is defining a protein as a gene product. If several polypeptide of different sequences are encoded by a gene, they are isoforms (I will not discuss what is a gene here. The topic is too complex and controversial, and I am not knowledgeable enough. However, this is of course very relevant. For instance UniProt offers different entries for the polypeptides encoded by a single polycistronic messenger RNA, because they are encoded by different “genes”, although producing a single messenger). Effectively, a protein would become a polypeptide encoded in the genome. Several gene products would mean several proteins. Hemoglobin would become a protein complex made of two proteins (alpha and beta globin). Although operationally appealing, and very useful for automated processing, this definition would be sometimes at odd with usages. For instance, the proopiomelanocortin is encoded by one gene, but is cleaved to produce 11 different polypeptides. Those polypeptides have different physiological roles and are considered different proteins. Although, it sounds odd to consider a nicotinic receptor as made up of several proteins.


Structure of proopiomelanocortin.

Another approach would be to admit that the definition of a protein is context dependent. What a protein actually is evolves in time, and depends on the functional context. We know what PKA is, because we name it. And what we name PKA is the catalytic subunit, sometimes in complexes also containing regulatory subunits, sometimes on its own. This position is very appealing for me as a conceptualist. If my arm is cut, people will still consider me as the same individual. And so will they for my child self and my elderly self, despite the absence of any atom in common. However, in the present case this is a non-definition. It does not solve any of the incoherent cases described above. And it does not help when it comes to systematically characterising proteins.

If one want to keep most of the existing nomenclature, a third spossible definition may be useful. A complex of polypeptides would be a protein if it does not survive the deletion of one of the polypeptides without replacement. If one knockouts the gene of alpha-globin, the result is not the formation of dimers of beta-globin. Therefore hemoglobin is a protein. If one knockouts the gene encoding a subunit of nicotinic receptor, the result is not the formation of nicotinic receptors with gaps where this subunit lies (it will most likely be replaced by another subunit, or the pentamers will not be produced at all). Therefore nicotinic receptor is a protein. If one knockouts the gene encoding rapsyn, the pentameric nicotinic receptor will still form. Therefore nAChR+rapsyn is not a protein, but a protein complex made up of two proteins. Interestingly, with that definition the microtubule becomes a protein, as do the proteasome. The PKA is not a protein anymore though. The catalytic subunit is, as is the regulatory subunit, and 2C2R is a protein complex made up of two proteins.

What do you think? What is your definition(s) of a protein?