To try to answer the question “What’s in Fluidinfo?”, I’ve analysed the about tags present as of 5th April 2011.
The following analysis is approximate and incomplete in at least the ways I will try to outline below. But let’s start with a first approximation of the truth.
Over the last six months or so I’ve built up a way of classifying about tags in Fluidinfo to show. This is what that (very crude) analysis shows.
Description | Class | Count | % |
---|---|---|---|
Twitter user [e.g. twitter.com:uid:6961642] | twitter-uid | 812,976 | 63.31% |
URL [e.g. http://google.com] | URL | 201,133 | 15.66% |
Genetic algorthm ID | fit-id | 61,801 | 4.81% |
Domain name [google.com] | domain | 43,148 | 3.36% |
Geonet location [e.g. GEOnet-2592642_-3566799] | location | 31,716 | 2.47% |
Bible-verse [e.g. Genesis:1:1] | bible-verse | 31,098 | 2.42% |
Fluidinfo tag [e.g. Object for the attribute njr/foo] | fi-tag | 25,355 | 1.97% |
[So-far unclassified] | other | 22,381 | 1.74% |
namespace [e.g. Object for the namespace timbray] | fi-ns | 11,863 | 0.92% |
openstreetmapnode [e.g. openstreetmap_node_35497544] | osm-node | 11,427 | 0.89% |
nyse-ticker [e.g. $appl] | nyse | 6,663 | 0.52% |
Fluidinfo user [e.g. Object for the user named ntoll] | fi-user | 5,096 | 0.40% |
uk-gov metadata ID | uk-gov | 4,007 | 0.31% |
us-gov metadata ID | us-gov | 3,747 | 0.29% |
book* [e.g. book:animal farm (george orwell)] | book | ≫3,551 | 0.28% |
32 hex digits [e.g. b851561b61654a1d997437b3ba8266fe] | uuid | 2,117 | 0.16% |
author* [e.g. author:elizabeth vitt] | author | ≫2,069 | 0.16% |
Net::FluidDB about tag | net-fluiddb | 2,019 | 0.16% |
Bach (J.S.) BWV number [e.g. BWV1027] | bach-work | 1,201 | 0.09% |
Someone wayner knows [e.g. D1091061014-91] | d109 | 320 | 0.02% |
Chemical element [e.g. element:Mercury] | element | 121 | 0.01% |
Data field [e.g. field:surname in table:books] | field | 100 | 0.01% |
MAC address [e.g. mac 00:00:00:00:00:03] | mac-address | 63 | 0.00% |
GA fitness [e.g. fitness:id:fit::b-cec-y2] | fitness | 62 | 0.00% |
State [e.g. state:usa:west-virginia] | state | 49 | 0.00% |
Something from fluidrb:guillermo | guillermo | 26 | 0.00% |
Planet [e.g. planet:Mercury] | planet | 23 | 0.00% |
Data table [e.g. table:elements] | table | 9 | 0.00% |
Film (movie) [e.g. film:star wards episode i the phantom menace (1999)] | film | 7 | 0.00% |
TOTAL | (total) | 1,284,141 | 100% |
Points to note particularly:
- There are 22,381 about tags (c. 1.74%) that I haven’t classified (of which many are books and authors; see blow)
- There are many (some hundreds of thousands) of objects in Fluidinfo with no about tag; those are excluded from this analysis
- The analysis is all based on pattern matching on the about tag and as such is approximate and makes assumptions.
- This analysis is only concerned with objects in Fluidinfo that have about tags; most of the data in Fluidinfo is stored as tags and tag values within the system, and some people prefer anonymous objects.
We can view this visually too. Although pie charts are generally not my favourite way of showing data, in this case I think it works quite well.
It’s interesting also to look at just the objects that have been added in the last 3 months or so (about 350,000 objects). These are as follows:
Although there will be a small number of misclassificaions in the figures above, clearly the goal of any exercise such as this is to classify as much as possible correctly, and while 1.74% unclassified doesn’t sound a lot it’s still over 20,000 objects. The thing that makes them hard, at least using the methodology of analysing about tags, as that (as far as I can see) there’s little in the structure of the remaining about tags to allow them to be classified.
Looking at them by eye, however, it’s immediately obvious that a lot of them (perhaps even most of them) are concerned with book in one form or another.
A lot of them are clearly book titles. Fluidinfo user librivox.org has tagged 4301 books with (among others) a librivox.org/book/title. An example is the object with about tag how to cook fish.
fdb count -q 'has librivox.org/book/title'
4301 objects matched
Total: 4301 objects
If we have a look at the tags on this object, we see this:
[10]> fdb tags -a 'how to cook fish'
Object with about="how to cook fish" (id 16e181f4-1966-4307-9a9b-e06679d56e76):
/fluiddb/about="how to cook fish"
/librivox.org/book/category="Non-fiction"
/librivox.org/book/author="Green, Olive"
/librivox.org/languages/english
/librivox.org/genres/cookery
/librivox.org/book/completion-timestamp=1181174400.0
/librivox.org/book/description="<p>Olive Green is the pseudonym for the prolific late 19th Century/early 20th Ce"
/librivox.org/book/related-authors={Non-primative type}
/librivox.org/book/author2="Reed, Myrtle"
/librivox.org/book/id=530
/librivox.org/book/copyrightyear=1908
/librivox.org/book/etext="http://www.gutenberg.org/etext/18542"
/librivox.org/book/completion-month=6
/librivox.org/book/completed="Thu, 07 Jun 2007 00:00:00 -0700"
/librivox.org/book/language="English"
/librivox.org/book/title="How to Cook Fish"
/librivox.org/book/zipfile="http://www.archive.org/download/how_cook_fish_librivox/how_cook_fish_librivox_64"
/librivox.org/book/completion-year=2007
/librivox.org/book/genre={Non-primative type}
/librivox.org/genres/advice
/librivox.org/genres/instruction
/librivox.org/book/rssurl="http://librivox.org/bookfeeds/how-to-cook-fish-by-olive-green.xml"
/librivox.org/book/completion-day=7
So clearly librivox.org is simply using the title of the book (in lower case) as the about tag. Librivox also has an object for each author. So, for example, we see:
[11]> fdb tags -a 'olive green'
Object with about="olive green" (id c9e66af9-e25b-4c7e-853e-d49f2a9305c7):
/librivox.org/author/genre={Non-primative type}
/fluiddb/about="olive green"
/librivox.org/genres/instruction
/librivox.org/author/related-collaborators={Non-primative type}
/librivox.org/author/name="Green, Olive"
/librivox.org/author/languages={Non-primative type}
/librivox.org/author/related-titles={Non-primative type}
/librivox.org/languages/english
/librivox.org/genres/cookery
/librivox.org/genres/advice
/librivox.org/author/collaborators={Non-primative type}
/librivox.org/author/titles={Non-primative type}
Again, we can count these:
fdb count -q 'has librivox.org/author/name'
1692 objects matched
Total: 1692 objects
So of our 22,381, that leaves 22,381 – 4,301 – 1,692 = 16,388.
There are also a set of books using an about tag convention of the form "Title//Author" (e.g. Foundation//Isaac Asimov. I’d forgotten, but I created those before coming up with the book-1 and book-u conventions. But there are only 60 of those.
More interestingly, despite the fact that oreilly.com has used the book-u conventions, there are also objects for many or all of the O’Reilly books that just use the plain titles. Let’s have a look at one:
fdb tags -a "head first python"
Object with about="head first python" (id 5b726d3a-4a65-4ea2-bb0a-eb17a0883672):
/fluiddb/about="head first python"
/oreilly.com/related-objects="book:head first python (paul barry)"
/njr/index/about
So this appears to be a kind of redirect: the key tag is oreilly.com/related-objects, which acts as a redirect to the object itself. If we count those:
fdb count -q 'has oreilly.com/related-objects'
6802 objects matched
Total: 6802 objects
we find 6,802 redirects, leaving 9,586 unclassified objects, or about 0.74%.
We could go further, but I’m going to leave it here for now. The grand total for the numer of book objects is now the 3,551 starting book:, the 4,301 from librivox.org and the 60 // books, giving a total of 7,912. Similarly, for authors, there are the 2,069 tagged author: plus the 1,692 “naked” authors from librivox.org giving a total of 3,761.
I’ll leave with a summary of my best assessment of what’s in Fluidinfo right now, taking into account the extra digging described above.
Description | Class | Count | % |
---|---|---|---|
Twitter user [e.g. twitter.com:uid:6961642] | twitter-uid | 812,976 | 63.31% |
URL [e.g. http://google.com] | URL | 201,133 | 15.66% |
Genetic algorthm ID | fit-id | 61,801 | 4.81% |
Domain name [google.com] | domain | 43,148 | 3.36% |
Geonet location [e.g. GEOnet-2592642_-3566799] | location | 31,716 | 2.47% |
Bible-verse [e.g. Genesis:1:1] | bible-verse | 31,098 | 2.42% |
Fluidinfo tag [e.g. Object for the attribute njr/foo] | fi-tag | 25,355 | 1.97% |
namespace [e.g. Object for the namespace timbray] | fi-ns | 11,863 | 0.92% |
openstreetmapnode [e.g. openstreetmap_node_35497544] | osm-node | 11,427 | 0.89% |
[So-far unclassified] | other | 9,586 | 0.74% |
book [e.g. book:animal farm (george orwell); animal farm] | book | 7,912 | 0.61% |
nyse-ticker [e.g. $appl] | nyse | 6,663 | 0.52% |
Fluidinfo user [e.g. Object for the user named ntoll] | fi-user | 5,096 | 0.40% |
uk-gov metadata ID | uk-gov | 4,007 | 0.31% |
author* [e.g. author:elizabeth vitt] | author | 3,761 | 0.29% |
us-gov metadata ID | us-gov | 3,747 | 0.29% |
32 hex digits [e.g. b851561b61654a1d997437b3ba8266fe] | uuid | 2,117 | 0.16% |
Net::FluidDB about tag | net-fluiddb | 2,019 | 0.16% |
Bach (J.S.) BWV number [e.g. BWV1027] | bach-work | 1,201 | 0.09% |
Someone wayner knows [e.g. D1091061014-91] | d109 | 320 | 0.02% |
Chemical element [e.g. element:Mercury] | element | 121 | 0.01% |
Data field [e.g. field:surname in table:books] | field | 100 | 0.01% |
MAC address [e.g. mac 00:00:00:00:00:03] | mac-address | 63 | 0.00% |
GA fitness [e.g. fitness:id:fit::b-cec-y2] | fitness | 62 | 0.00% |
State [e.g. state:usa:west-virginia] | state | 49 | 0.00% |
Something from fluidrb:guillermo | guillermo | 26 | 0.00% |
Planet [e.g. planet:Mercury] | planet | 23 | 0.00% |
Data table [e.g. table:elements] | table | 9 | 0.00% |
Film (movie) [e.g. film:star wards episode i the phantom menace (1999)] | film | 7 | 0.00% |
TOTAL | (total) | 1,284,141 | 100% |