2007-01-16

Tagging Biology

(Disclaimer: I don't remember if I've seen what I'm writing about elsewhere, although there's a strong chance I have, so I apologize if others have said the same thing before, and I'm failing to acknowledge that.)

A few months back I wrote about Clay Shirky's article Ontology is Overrated. His main point is that sometimes it's just not appropriate to try and organize information in a hierarchical category structure (i.e. an ontology).

Perhaps biology is one of those categories. Species classification is one of the most famous and well-known ontologies in existence. We all learn about it in school. At the top you have your Kingdoms, then Phyla, Classes, Orders, Families, Genii, and finally Species.

To the best of my knowledge, biology today has moved away from thinking about this sort of thing, and has focused far more on goings-on at the cellular level. However, organizing and keeping track of the different species, is still critically important, since our understanding of genetics is inextricably tied to study of the end-product (the phenotype - i.e. the animal/plant/whatever).

The problem with this classification of species is everything Shirky mentions in his article: the classification system is artificial, and sometimes species don't fit nicely into the elaborate tree humanity has constructed. He gives the example of deciding whether "Books" belong under "Art" or "Entertainment" - an artificial question. Books are books - they don't intrinsically fit under either category. One book may be art, and another may be entertainment, and another may be a bit of both, and yet another may be neither (a textbook, for example). I don't have any specific biological examples, unfortunately, but it is certainly reasonable to expect that once you get down to the nitty-gritty details, and are classifying based on subtleties in bone structure, you're going to run into problems of species belonging to more than one (or no) spots in the classification tree.

In my opinion, a tagging approach would be much more effective way to organize the different species. The same properties that are used to determine where a species "fits" in the classification tree would be used as tags. For example, some tags might be "warm-blooded", "invertebrate", "eukaryotic", and "egg-laying". A scientist analyzing a recently discovered species would simply list all the various attributes she notices and associate them with the species. A species database might then list all the various species with similar tags. Knowledge of those species may be applied to the new ones. This sort of system would highlight, immediately, what species have in common with each other, and what they don't. The task of figuring out where the new species belongs in some convoluted system doesn't appear here. The scientist merely documents attributes in a systematic fashion, which is something she's doing anyway.

Of course, all this is going on anyway - in the scientist's head. The scientist obviously knows to compare a newly discovered bacterium with other bacteria (as opposed to a reptile). And she also knows to compare the new bacterium with bacteria that share many attributes with the new one, and not to compare it with ones that have less in common. But, under a classification system, the scientist is forced to figure out where in the species tree the new bacterium belongs (which may be a cause for some debate), and compare the new bacterium with its determined "relatives". Why bother with that first step? Just enter what you know about the bacterium as a list of tags, and the software will spit back matches. The most tedious and pointless step in the process has been removed, and real work can proceed.

You may be concerned that, we are losing something by disposing of the highly familiar species tree. We're not*. The various relationships between the species will be preserved in the tags (one way to look at the tree now is to say that the more tags two species have in common, the closer together they are on the tree). And that's really all you need. The tree doesn't add anything else except constraints and constructs like "relatives". With today's technology, the tree is an inferior way of storing information about the species.

Last but not least, there is the issue of tag management. The number of nuances and subtleties that exist as differences between species is staggering. Listing all the differences between a housecat and a tiger is quite a job, and those species are very close, relatively. The complete list of tags in the system might become unwieldy (one downside to tagging is its sensitivity to error - "warm-blooded" and "warm blooded" may be seen as different). So that would be one obstacle. But with "auto-complete" and similar technologies, it shouldn't be a big one.

--YY

* Well, to be fair, terms like "mammals" and "reptiles" come from names of branches of the tree. We can simply replace each definition with a set of tags (i.e. "An animal is a reptile if it has all of the following tags: 'lays eggs', 'cold-blooded', etc").

p.s. If there are any biology people reading this, I'm curious if I'm being completely off the mark or blindingly obvious and obtuse.

3 comments:

yitz said...

i always thought the biological classification tree was mapping out the evolution of species and their deviations over time..

(of course it didn't begin that way, but since darwin it should have been rearranged that way)

if its anything other than that, it's silly. imho.

keep in mind though, that the classification also helps in referring to it, from the proper biological name alone, you know a lot about the organism in question.

just some thoughts

Anonymous said...

yitz is right; biological taxonomy is meant to reflect evolutionary relationships, which are overwhelmingly hierarchal.

Tags too are artificial labels; [warm-blooded] describes a collection of traits (homeothermy, endothermy, and tachymetabolism) that are correlated but not obligatorily linked. We say bats and birds and bees have [wings], but bats' big webbed hands are nothing like insects' exoskeletal projections. Tags aren't readily useful for variables, like protein sequences or coloration. Counting tags assumes that all similarity is equally significant. In short, they don't easily describe a hierarchy.

The exact boundaries of taxa can be fuzzy, and sometimes we get it wrong, but all models lie. In this case an ontology really is what we want.

YodaYid said...

Yitz and RLB - thank you for your informative and clarifying comments. An evolutionary tree does make a ton of sense, of course. And the evolutionary relationships would be lost in a tagged structure (which only stores similarities, not relationships). So I was definitely wrong about that.

Does the ontology serve any other purpose, though? Do zoologists studying different kinds of beetles refer to the tree at all? That's the sort of situation I had in mind. A catalog of the species that would serve as a reference to biologists/zoologists.

In writing the original post, I envisioned a web page or something which lists the traits for each species, and gives the "10 most similar species", based on the trait list. This list would correspond to the closest relatives in the tree (although it would be interesting to see if there were discrepancies). I don't know if that would be of any use to biologists, but I think this method would be faster and more precise than an ontology in uncovering relationships.

Just one note to real live: the tags I came up with in my examples would not be the ones that would actually be used - they would be the terms that most accurately describe the traits. Essentially each tag would correspond to single trait. In the example you gave, [homeothermy], [endothermy], and [tachymetabolism] would be the tags, not [warm-blooded], which I've now learned is not even a trait ;-) Same thing with [winged] - it's a layperson's trait - biologists would pick tags that reflect the state of the art. Which means that as our understanding of traits are refined, the tags would be refined as well, and the different relationships between species would automatically become more subtle and accurate along with the tags themselves.

Last but not least - Yitz - yes, the classification name encodes information about the species, but how? By referring to other species and triggering a recall in our minds. For example, if you discover some exotic species of feline, then, by virtue of classifying it as a feline, you're evoking all sorts of traits that people are already familar with. A database of traits is a simply a more precise (albeit longhand) way of doing the same thing.