[UPDATE 28th March 2010: I have now added a books submodule to the python abouttag module, and updated the examples below to reflect the syntax for that.
Simple usage is:from abouttag.books import book print book(u'Fugitive Pieces', u'Anne Michaels') book:fugitive pieces (anne michaels)
Add extra authors as extra arguments (see examples below).]
I want to tag my favourite book. I want to proclaim my enduring love of Fugitive Pieces, by Anne Michaels , so that no man, woman or child can ever have any doubt that this, for me, is the finest novel ever written. I want to give this book a rating of a perfect 10.
njr/rating = 10
So where do I put it?
I previously suggested (fool that I am) that a good place to put it might be one an object whose about tag is “isbn:0 7475 3282 6”, for that is the ISBN number of the well-thumbed copy in front of me, and what better way of identifying a book could there possibly be than an International Standard Book Number?
Well, several, it transpires. The problem is actually apparant as soon as you look for the ISBN number. My copy actually says:
ISBN 0 7475 2939 6 (hardback)ISBN 0 7475 3282 6 (paperback)
How fantastic is that? Not one unique International Standard Book Number, but two.
It gets worse.
If I go to http://amazon.co.uk I find
ISBN-10: 0747534969ISBN-13: 978-0747534969
on one edition,
ISBN-10: 0747529396ISBN-13: 978-0747529392
ISBN-10: 0747599254ISBN-13: 978-0747599258
on another, and
ISBN-10: 0747590095ISBN-13: 978-0747590095
on another. If I go to http://amazon.com, I find
ISBN-10: 0679776591ISBN-13: 978-0679776598
What about http://amazon.ca? After all, Anne Michaels is Canadian.
ISBN-10: 0771058829ISBN-13: 978-0771058820
This illustrates a number of interesting points.
ISBN numbers are (it transpires) at the wrong level of the book hierarchy. There has been some excellent work under the monicker of FRBR (Functional Requirements for Bibliographic Records ). Among other things, this distinguishes between four levels in the ‘book’ hierarchy:
- The work — In this case, Fugitive Pieces by Anne Michaels. (This is the level we probably want in most cases for FluidDB.)
- An expression — Something like a rendering or translation, or concrete form of a work. In the case of a book, typically a text, in a particular language.
- A manifestation — an edition, by a publisher; it transpires that ISBN numbers identify manifestations (editions), not works or expressions.
- An item — the 294-odd pages between soft covers sitting on the desk in front of me is an item — a physical book; an instance of a manifestation, in geek-speak.
Even if we were interested in tagging particular editions (manifestations) of a book, There would still be some normalization questions.
Inside my book (‘item’), the ISBN number for the paperback edition (which mine is) is listed as
ISBN 0 7475 3282 6
On the back cover, above the bar code, it is listed as
On amazon.co.uk, this edition would presumably be listed as:ISBN-10: 0747532826ISBN-13: 978-0747532826
if it were there at all.
So even here, if people want to tag the same object, some normalization is required. It wouldn’t really matter which form we adopted, as long we chose one. For the sake of argument,ISBN 0-7475-3282-6
So if we are constructing a convention for about tags for some category of items, the following might be desirable:
- The tag should sit the right level of relevant hierarchies — ideally, there should be a one-to-one correspondence between the different items in the category and about tags.
- Trivial formatting differences should be removed by normalization: for example, in the case of ISBN numbers, adopting a convention such as ISBN (in capitals) followed by SPACE followed by SINGLE DIGIT followed by HYPHEN followed by FOUR DIGITS followed by HYPHEN followed by FOUR DIGITS followed by HYPHEN followed by SINGLE DIGIT.
- It should be easy to determine the relevant about tag for a given object. This is similar both for tagging (information creation) and finding (information retrieval): in both cases, if we know what it is we’re talking about, it should be easy to figure out what the FluidDB GET will retrieve it, or PUT transaction will tag it.
If the ISBN number is not really a suitable basis for an about tag for books, what might be? After quite a lot of reading, it’s not obvious to me that anything exists today that really comes close to satisfying the requirements above.
However, if we narrow our scope a little, say to western-langage books, we can start to think about the following:
- A book, at the top (work) level of the hierarchy, seems to me to to be identified by a title and an author. So perhaps basing the about tag on those two directly would make sense.
- It seems certain that a degree of normalization will be helpful, as a minimum reducing ambiguity of punctuation, different kinds of dashes etc. And (at the risk of trampling all over cultural sensitivities, there are clearly some pros and con associated with more extreme normalization, such as adopting a standard case, possibly removing accents (in a well defined way) etc. (Note, this is purely for the purpose of defining a normalized about tag; there is no suggestion here that either author or title should be stored in anything but their full unicode glory.)
- Clearly, there are various ways in which clashes and ambiguities could occur with such a scheme (How are subtitles handled? What about editions? What about books with the same name and author? Or even the same name and author after normalization? But we have to start somewhere.)
Starting from these two simple ideas, we might construct a conceptual about tag of the general form:
If we initially assume no normalization and that the author will appear as on the cover of the book, Fugitive Pieces would then become
book:Fugitive Pieces (Anne Michaels)
Sadden me though it does, on some reflection I think that the benefits of fairly severe normalization out-weigh the disadvantages. From my research to date, it appears that the nearest there is to a standard for normalizing bibliographic information is work done by NACO, the Name Authority Cooperative Program of the Program for Cooperative Cataloging . There appears to be a relatively well-defined process for ‘NACO normalizing’ a bibliographic record. This is described in some detail in a paper by Thomas Hickey et al. . There is also sample code available in both python and java , though it appears to be subject to somewhat restrictive license terms.
For details of how NACO normalization works, the reader is referred to Hickey et al.‘s paper, but the core ideas are:
- all text is converted to lower case;
- most punctuation is removed;
- all diacritics (accents) are removed, together with the character they modify;
- multiple leading and trailing whitespace is removed.
To me, 1, 2 and 4 all seem, while ugly, to confer some obvious benefits. I can also see a reasonably strong case for removing diacritics (accents), in term of increasing the likelihood of correct matches. I am, however, amazed and horrified at the idea that not only should diacritics be removed, but the letter(s) they modify should be discarded also. This seems to me to be not so much of questionable benefit as positively harmful. I should perhaps have added to the embryonic requirements for the perfect about tag:
- Comprehensibility: as well as being easy to work out what the appropriate about tag for a given object is, it should be fairly easy to work out the identity of the object to which a given about tag corresponds.
I therefore propose a NACO-like normalization that attempts to follow the same rules as NACO, except that it retains wherever possible the letters modified by accents. So é, è and ê all become e, æ becomes ae, ø becomes o and so forth. I realise this will be painful to those whose languages make more extensive use of diacritics than does English, and for whom an ø is no more reducible to an o than is a ≠ to an =, and can only plead base pragmatism in my defence. (I have an implementation of this which I will add to the abouttag package previously mentioned when I start publishing to FluidDB with it; but part of the point of this post is to see whether anyone has different ideas; so I’m not pushing it yet.)
With this, our normalized form for Fugutive Pieces becomes
(Note that the title and author are normalized, but the entire tag is not, so the colon and parenetheses survive.)
To list a few others, we get: