What is a protein?

I was recently involved in a discussion between two scientists who disagreed on the relative proportions of symmetrical heteromeric proteins versus symmetrical homomeric ones. Both sides used the PDB as a source of information about known protein structures. After a little while, it was clear that the disagreement was not so much based on scientific ground than on semantics. In particular they used the word “heteromer” to describe two very different types of proteins. One of them called heteromer a protein made of different, although homologous, proteins. Thus, the different monomers of the complex displayed the same structure. Examples would be hemoglobin, ligand-gated ion channels etc. The other scientist called the former proteins homomers, considering the differences in sequences slightly irrelevant for the formation of complexes (which made sense in the context of the conversation). She called heteromer a protein complex made of polypetides adopting different conformations. Examples would be ATP synthase, Aspartate transcarbamoylase, etc.

This disagreement is one facet of another, larger, debate, namely what is a protein? This debate, like the similar one “what is a species“, might look pretty artificial and only good for endless, unproductive, argumentations. However, as witnessed by the disagreement described above, fuzzy definitions of that sort actually affect the result of research in the genomic era.

The Wikipedia page protein starts by telling us that “proteins are large biological molecules consisting of one or more chains of amino acids”. This is about the only definition which is not controversial. Anything more detailed bumps in counter-examples and contradictions. However, although the statement above is true, not all “large biological molecules consisting of one or more chains of amino acids” are commonly considered proteins. In this post, I would like to discuss problematic examples of “proteins” and test various definitions.

Is a protein a set of covalently bound amino-acids? Such a definition would cover all polypeptides, that are sets of amino acids linked by the peptide bond, and would also cover polypeptides linked by other types of covalent bonds, such as insulin, that is made of two polypeptides linked by disulphide bonds (the two polypeptides originate from the cleavage of the insulin gene product). However, this definition is too narrow to encompass many protein complexes currently considered as bona fide proteins. Examples would be hemoglobin already mentioned above, formed of four protein subunits associated through non-covalent bonding, or nicotinic receptors, formed of five proteins subunits. Those complexes are considered as protein quaternary structures. (Interestingly, nicotinic receptors of the Torpedo electric organ are sometimes dimerised through disulphide bonds. However, those dimers are not considered as bona fide proteins.)

Structure of hemoglobin

Structure of hemoglobin. By Richard Wheeler 2007.

The examples above are made of subunits having the same overall 3D (or tertiary) structure. In theses cases, this is thought to be come from a common ancestry for the genes coding the different subunits. What about defining a protein as structures formed by a set (which cardinal could be 1) of polypeptides encoded by homologous genes? This expanded definition would cover nicotinic receptors, but not the complex between nicotinic receptors and rapsyn, that associate non-covalently, but are considered different proteins. Unfortunately, this new definition cannot cover proteins such as Aspartate transcarbamoylase, that is formed of catalytic and regulatory subunits encoded by non-homologous genes. Conversely, microtubules are polymers of tubulin. While tubulin and tubulin dimers are commonly called proteins, microtubules are generally not.

Structure of Aspartate transcarbamoylase

Structure of Aspartate transcarbamoylase. Adapted from Kantrowitz 2012

But hold on … All the examples of protein given above seem to exist only as the polymeric form, while the counter-examples are labile. Could the stability be the criterium then? A protein would be a stable structure formed by a set (which cardinal could be 1) of polypeptides. Sadly not. Many protein complexes generally accepted as valid proteins are labile. For instance, the regulatory and catalytic subunits of PKA are not always bound to each other. Their separation is the key to the enzyme activation by cAMP.
Conversely, the subunits of the proteasome are generally, if not always, bound together. However, the proteasome is not generally considered as one protein. Other examples are the polypeptides forming the capside of viruses.

Are we therefore doomed to face an arbitrary definition, changing based on the protein and the preferences of the scientists involved in the relevant research? What would be the alternatives?

In a genomic era, dominating by nucleic acid sequencing and a gene centric view of biology, a practical approach may be the “UniProt” one, that is defining a protein as a gene product. If several polypeptide of different sequences are encoded by a gene, they are isoforms (I will not discuss what is a gene here. The topic is too complex and controversial, and I am not knowledgeable enough. However, this is of course very relevant. For instance UniProt offers different entries for the polypeptides encoded by a single polycistronic messenger RNA, because they are encoded by different “genes”, although producing a single messenger). Effectively, a protein would become a polypeptide encoded in the genome. Several gene products would mean several proteins. Hemoglobin would become a protein complex made of two proteins (alpha and beta globin). Although operationally appealing, and very useful for automated processing, this definition would be sometimes at odd with usages. For instance, the proopiomelanocortin is encoded by one gene, but is cleaved to produce 11 different polypeptides. Those polypeptides have different physiological roles and are considered different proteins. Although, it sounds odd to consider a nicotinic receptor as made up of several proteins.


Structure of proopiomelanocortin.

Another approach would be to admit that the definition of a protein is context dependent. What a protein actually is evolves in time, and depends on the functional context. We know what PKA is, because we name it. And what we name PKA is the catalytic subunit, sometimes in complexes also containing regulatory subunits, sometimes on its own. This position is very appealing for me as a conceptualist. If my arm is cut, people will still consider me as the same individual. And so will they for my child self and my elderly self, despite the absence of any atom in common. However, in the present case this is a non-definition. It does not solve any of the incoherent cases described above. And it does not help when it comes to systematically characterising proteins.

If one want to keep most of the existing nomenclature, a third spossible definition may be useful. A complex of polypeptides would be a protein if it does not survive the deletion of one of the polypeptides without replacement. If one knockouts the gene of alpha-globin, the result is not the formation of dimers of beta-globin. Therefore hemoglobin is a protein. If one knockouts the gene encoding a subunit of nicotinic receptor, the result is not the formation of nicotinic receptors with gaps where this subunit lies (it will most likely be replaced by another subunit, or the pentamers will not be produced at all). Therefore nicotinic receptor is a protein. If one knockouts the gene encoding rapsyn, the pentameric nicotinic receptor will still form. Therefore nAChR+rapsyn is not a protein, but a protein complex made up of two proteins. Interestingly, with that definition the microtubule becomes a protein, as do the proteasome. The PKA is not a protein anymore though. The catalytic subunit is, as is the regulatory subunit, and 2C2R is a protein complex made up of two proteins.

What do you think? What is your definition(s) of a protein?


One thought on “What is a protein?

  1. The initial hetero/homodimer issue isn’t, to me, really about the definition of ‘a protein’, but about the criteria for when we might consider two proteins with different primary sequence to be ‘the same’ protein. On one level, it’s quite obvious that two proteins with distinct primary sequence are not ‘the same’ protein, but they are both proteins. However, if one chooses to use ‘protein’ to imply a specific functional capacity, and not a specific chemical construction, we’re going beyond “What is a protein?” and into “What do we use the word ‘protein’ to mean?”. My opinion in this case is that it would always be strictly correct to consider a multimer where components differed in primary sequence to be a heteromer. However, where there is no evidence of a functional (or, more specifically, structural) distinction between components, I would accept description as a homomer (say, if introducing an ‘orthologue’ from a closely-related organism). That would be especially the case if in, say, a complex of four such proteins the orientation, affinity constants, downstream functions, etc. of the multimer were unaffected by the primary sequence differences that are observed. But, while, it’s a grey area, for me, I habitually use ‘homomer’ only to mean a protein interacting with another copy of itself (at the amino acid level; I don’t always consider chemical/post-translational state).

    That does also raise the question about whether a protein that can adopt two distinct folds is the same protein in each case. Of course it *is* at one level – the primary sequence – but if the secondary or tertiary structure differ then it’s potentially as different, in practice and chemistry, as allotropes of carbon or sulphur. Carbon is carbon whether it’s graphite or diamond, but graphite is not diamond. The same question can apply if we consider protonation states, or post-translational modifications that affect its ‘global’ interactions.

    This, and much of the discussion, elides from the idea of defining protein as “a set of covalently bound amino-acids”, which is nice, generic, and mostly clear (so long as we’re specific about the type of covalent bond ;)), to “how do we define ‘a protein’ as a single functional unit?”. It is, essentially, a linguistic issue, and my current position is that we’re hampered by the shifting and flexible use of (any) language, and our incomplete – and sometimes wildly differing – ideas and understanding of what we mean when we talk about a ‘functional unit’ in biology.

    FWIW, I like the conceptualist viewpoint, and attempt myself to see ‘enzymatic action X’ as a function that a gene product happens to perform (placing it in a specific biochemical context potentially at a specific point in time), rather than as an absolute label of “what it is”. But we all use shorthand (“that’s a secreted protein; that one’s a membrane protein” etc. – I’m often very sloppy about it), and this approach suffers in exactly the way you name: it doesn’t describe *things*. We face the same problem when defining protein motifs: a motif may confer a function Y, or be essential for function Z (all under specific conditions), but is it a motif *for* Y or Z?

    I don’t have any pat answers, I’m afraid. I do like the question, though 😉

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s