Executive Thinking

Dependency Hell

There is a major problem I have with every single Linux distribution I've ever used. Before very long I've managed to get myself into what is known as dependency hell.

It always starts innocently enough. I have some project in mind that requires the latest feature of some piece of open-source software. By far the easiest way to install software in Linux is to use one of the many package management tools such as apt-get, yum, yast, urpmi or emerge. These automate the process of installing pre-compiled files and save the user the difficulty of knowing how best to compile a given piece of software for their particular distribution.

The trouble with these tools are that they often insist that a given piece of software can only be installed in an environment that has a particular set of libraries available. This is not really a problem in Linux, in that it can happily host multiple versions of a given library in a seamless way. Theoretically, if you need a library you don't have, you can simply install it and you're good to go.

In practice its much harder than that. None of the package managers I've ever used have a way of saying that you want to install a second (or third or fourth) version of a library. They always want to remove the old version before installing the new one. This, I suppose, reduces the problem of having a dozen different (and presumably unused) versions of a single library on a machine.

In practice though, it leads to a far worse problem. You often discover that your current versions of a dozen different tools rely on a currently installed library. If you want to upgrade the library, you have to upgrade all of these tools as well. Of course, these new tools may themselves depend on other new libraries that need to be installed and so on and so on.

I once installed a (I thought) simple upgrade to a program because it had a new feature I desperately needed. By the time all of the dependencies were resolved, my entire operating system had jumped two versions. I had started out running Fedora Core 2 and was now running Fedora Core 4. Or rather, a horrible mongrelized version of Fedora Core 2 and Fedora Core 4 which happened to think it was Fedora Core 4.

This is the equivalent in the Microsoft world of trying to install a new program in Windows 95, only to discover yourself running Windows 2000 when you're done.

There are many ways that one might try to combat the problem of dependency hell. The current system that most distributions use is a system of repositories. These hold versions of the software that are all consistent, and can be safely installed together. Of course that means that if a new piece of software requires a new library feature, one that isn't supported by the base libraries of the repository, then that version of the software doesn't get into the repository.

When one is facing a software problem that is fixed by the latest version of some piece of software, but that isn't in the repository, that can be very frustrating. The temptation is to link to one of the experimental or 'unstable' repositories to grab the version of the software you need. Three times in four, it works. You install the program and a couple of supplemental libraries and you've happily solved your problem. The fourth time, you end up back in dependency hell.

There are other obvious ways to with aspects of the problem, from having an explicit language to describe how library APIs change, so that library dependencies can be derived rather than being set by the software author, to having a far more strongly-typed and object-oriented operating system that could more easily deal with these issues. Both of these are difficult and manpower-intensive changes to implement and what you'd have when you were done would arguably not be Linux any more.

A far simpler solution and one that would go a long way toward solving the problem would be to add a pair of features to most package managers. The first would be a flag to specify that you wish to install a parallel version of a package, be it a software tool or just a library. Thus, I would finally have a simple way of telling my distro that I need 3 different versions of GCC installed for development testing purposes.

The second feature would be a way of installing just a library. Often you'll install a package and discover that it needs Library X. Library X is part of Program X and you have to install that program (which requires you to satisfy its dependencies first), even if you never want to use that program. Sometimes you have to the option of installing Development Kit X instead, which also contains Library X and is designed to let you write programs that use Library X. The thing is, you don't want to write programs that use Library X, you already have one that you simply want to install.

By paring down the number of files you need to install to satisfy a given set of dependencies, and by having tools that are willing to explicitly manage multiple versions of the same installed library, we could go a long way to eliminating dependency hell.
Executive Thinking

File Indexing Service.

There is an online media database called freedb that performs a very useful service. Almost anyone who has ever converted an Audio CD to MP3 format has used freedb, even if they weren't aware of it at the time.

The service stores metadata about the songs that can be found on audio CDs. Since the CDs themselves only store the raw song data, information about the names of the tracks, the musicians, even the name of the CD itself have to be sought elsewhere. Rather than forcing everyone who has ever 'ripped' a CD to enter this data manually, the freedb service lets you enter the information into a central repository.

That way, 99.9% of the time, when you put a CD into a ripping program, it will find an already-existing entry in the database for you to work with. Only in cases where the CD is so new, or so obscure, that no one has seen it before will you have to enter the data by hand. Once you've done so, you can then upload it to the central repository so that the next person won't have to do the same.

Now, CDs are not the only digital objects that could benefit from an online database of meta-information. Anyone who has ever used a P2P client has come across the problem of files that are misnamed, misidentified, corrupt or just plain not what you thought they were from the name. A similar service that all P2P services could hook into to store and retrieve metadata on arbitrary data files could be a boon to P2P users everywhere.

And I'm not just talking about a boon for pirates and illegal downloaders. If the P2P system is ever going to be seen as a legitimate method of distributing software and music, then it will be necessary to have a way of distinguishing between content that is free for everyone and content that one needs to have purchased.

As a slight aside, this doesn't necessarily mean downloading copyrighted works should be forbidden. I have a large number of books that I have purchased. My understanding of my rights to that content means that I have the right to an electronic copy for backup purposes. I could, of course, use an OCR system to scan in the book, but its usually just easier to download it off the net. Besides, its slow and time-consuming to search a physical book for a particular fact you remember it having, while searching a text file is usually much faster and easier. Ultimately the only person who can actually know if they have the right to download a particular file is the person that is doing it, and I would like to make it easier for them to know where they stand. For the rest of this article I'll simply assume that its a good idea to let people be better informed of the details of files they encounter on the Internet, including their copyright status, as a thorough debate of this issue could fill several books (and has).

Considering the acrimony of the debate on issues surrounding P2P distribution and the confrontational natures of the various parties involved, anyone that provided this service would face a number of technological hurdles. To deal with high traffic loads, denial-of-service attacks, database poisoning attacks, and a number of other dirty tricks, the system would need to be carefully designed.

It should be distributed across multiple servers in different geographical locations and use a collaborative filtering system to help maintain a high degree of data accuracy. At the same time, it needs to be able to accept information updates from large numbers of users who may well have legitimate reasons to do so anonymously due to the regulations of their home countries. These problems are all solvable, although they would require some hard work on the developers part.

The results would be a very useful service, and the owner of that service would find themselves in possession of a valuable database. The predecessor to freedb sold all rights to their database and made a fortune, albeit at the cost of bringing the service off line. The real success of a P2P file metadata database would be when it becomes so useful that copyright owners are willing to pay to have authorities data inserted into the database.
Executive Thinking

Audio Games

Here is an idea that has been kicking around in my mind ever since not long after I first saw a simple 3D maze game for the Commodore PET computer back in 1980. The idea was a simple one: create a video game without the graphics. I thought it would be interesting to see how well someone could navigate a maze if their only clues were provided by a pair of stereo headphones. The maze would have to have some characteristic sources of noise that could be used as clues to when you were getting closer or farther from a given point, and what direction you were virtually facing.

With time, this idea evolved and grew into a very Doom-like scenario where one is hunting monsters through a pitch-black series of caverns and tunnels, and some of them are hunting you. There would be careful attention given to the sounds echoing off different surface types, and the clue noises that would let you know that another creature was in the area.

Of course, by the time you are working on something this sophisticated, you will need the dedication of a full blown sound engineer to get the modelling right, as well as the usual complement of level designers and game coders.

It would also help to have a catchy name. At one point I was calling the game 'Alone in the Dark', but that title has long-since been used elsewhere for a computer game, so I suppose a new name would have to be devised if anyone were to actually produce this.

The big issue that has prevented this being created up till now has always been the cost of production. I had always envisioned it as something the size of a walkman with high-quality stereo output. The trouble is that high-quality sonic modelling of an area is a fairly hefty computing task. Far more difficult than I ever realized back in 1980.

On the other hand, programmable walkman-sized devices with quality stereo output have recently become commonplace, as exemplified by the Apple iPod. Most of these devices can have additional software installed on them, so the prospect of writing audio-only games for them has become less improbable.

Now an obvious question might be: why not implement this on a desktop PC?  In fact, until this moment the option hadn't occurred to me. When I first came up with this idea there were no stereo outputs available on PC's, nor were there any for many years after. I got used to thinking of the program as requiring some sort of dedicated hardware. Nowadays though most computers have sound chips and stereo output that could probably handle the computing load easily.

Of course, there's a big difference in user experience when sitting in front of a computer than when wearing a dedicated device, but it would certainly make sense to write and test a prototype on a desktop machine. That way you could demo it to a an mp3 maker as a possible item to bundle into their latest offering so as to make it stand out from the competition.

How well it would sell another question indeed. For most of my suggestions in these essays, I have a good estimation for how well some product will sell or be adopted. For this game, I have no idea. I only know that I would want to play it.
Executive Thinking

Quack

One of the oldest still-playable computer games is the venerable Nethack (and its various relations like Slash'EM, the only variant still under production). What it has going for it is enormous replayability due to a huge set of monsters items and spells, complex semi-randomizable interactions, and auto-generated levels with carefully balanced  creature strengths, power progression, and treasure probabilities.

What it sorely lacks is any sort of graphic interface. Nethack is a text-only game that has only relatively recently even allowed for colour. There have been many attempts to add isometric views, tiled maps and the like but they have all failed due to the clunkiness of the resulting interface, and the fact that in some pretty deep parts of the code, it really thinks that humanoids look like the letter @.

Nethack has long been an open-source project, although any spelunking through the actual code base will rapidly reveal that multiple generations of coders over the last 30 years have rendered its internal structure to be well-nigh incomprehensible.

Now, a game with the opposite set of problems is Quake. There was a game with wonderfully interactive 3D graphics, a fully immersive fantasy experience, but fixed levels and almost zero replayability. Still, when it came out in 1996, it was a sensation. The Quake game engine has long since been released under the GPL and has been updated and tweaked in various different ways in a huge variety of open source projects. There are now a fair number of other open source game engines which one could use instead, such as Cube, Saurbraten and Crystal Space, but since Quake was the engine of choice when I first came up with this idea years ago, I'll continue to use it as the suggested basis. In practice I'd spend a fair amount of time assessing the various alternative before making a choice.

At some point it occured to me that it would be an interesting experience to marry Quake and Hack together into a sort of 'Quack' game. Steal the dungeon generation as well as the vast sets of items, monsters and interactions from Nethack, and use the real-time play and immersive 3D effects of Quake to interact with the results.

Now, this wouldn't be a trivial operation. Nethack has been tweaked so much that one would be better off taking one of the various data-dumper programs that have been written for Nethack and using it to extract all relevant datasets and throwing most of the code away than to try to actually port it directly. Still there is a treasure-trove of game design there that can be salvaged.

There would also have to be some careful user interface design. 3D worlds practically cry out for real-time play but Nethack was always more of a thinking game than a reaction game. One would have to make it practical to scout out areas without alerting monsters, pause the game when necessary and generally make it possible to think ones way out of trouble, rather than try to hack and slash through it.

So, the work would not be trivial, but I think it would be very rewarding and since most of the design work would already be done, the job would be a "simple matter of coding". If done right I predict the results would be very popular, and quite lucrative to produce due to possible derivatives, even if one ended up giving it away (after all, it would be derived from open source products).
Executive Thinking

Online Form Design.

This is an idea I had a number of years ago, but that I shelved because the then-current state of the art in interactive websites wouldn't easily support the idea. Now that AJAX has become the next big thing, it would seem to be time to dust it off and take another look at it.

The general concept is to produce a website that allows one to easily generate complex forms. By 'complex' I mean things that are beyond the abilities of plain HTML, and that require a constraint engine. With a constraint engine you can say things like "All columns are to be sized so that they are wider than their contents" and have it handle all the picky details, including having intelligent defaults when the constraint turns out to be impossible to satisfy.

I had originally thought that OpenAmulet would be the perfect constraint engine to use for this task, but that project seems to be moribund. Nevertheless, there are other alternatives.

In any case, one would go to the website and construct forms and pages for all sorts of endeavours: graph paper (polar, log/log, hierarchical hex, whatever), personal planner pages (such as Day Runner and Day Timer produce), various sorts of calendars, project planning schedules, Role-playing character sheets, etc.

These forms could have a certain amount of intelligence behind them so that a Monthly Calendar form could configure itself for any given month, or instead, a form could be parameterized by what particular customizations you might want to add on a case-by-case basis (such as a company logo).

These forms could be downloaded both in their native language (which would only really be useful for archival purposes) or as PostScript or PDF files. These latter formats would allow one to print the forms locally whenever needed.

Registered users would be able to store the forms they created on the website, for later re-use, or for sharing purposes. Users would be able to generate lists of useful (to them) forms and rate the value of different forms produced. One could also search for existing public forms rather than make your own.

The most popular forms would get showcased on the main page, as would their designers.

The site would make money by advertising a number of paper and stationary related services and supplies, and would also have an agreement with a printing company to produce and ship short-run sets of forms on demand. Thus, although you could design and download a bizarre hex-based polar-log chart for free, if you needed several thousand copies, or cardboard-backed pads of the forms, you could buy that through the website.
Executive Thinking

Walkabout.

I am, quite naturally, very happy with Google's Map service. It has many things to recommend it, including ease-of-use and an open API. One thing it doesn't currently have is any acknowledgement that folks get around by means other than by car. I have heard that there are plans in the works for Google Maps to start including things like bus routes, subways and commuter rail lines on its maps, and thats definitely a step in the right direction.

Still, it does me very little good when what I want to know is how to walk somewhere. Out here in the wilds of Montreal's West Island, there are many streets without sidewalks, or with sidewalks on only one side of a street. There are also numerous parks with paths through them (and this matters in winter, when the paths are the only plowed -- and therefor navigable -- way through the park), and sidewalks that take shortcuts between streets.

One simple example of this is the fact that Google Maps shows a trio of dead-end roads a few blocks north of my house. What they don't show is that these roads are all connected by bicycle and walking paths, and form a convenient shortcut when walking to the store.

In many ways things are worse in the city centre. There are numerous alleys between buildings that are never shown on the urban maps, but which one can walk or ride down, often saving a block from one's trip. There is also our famous underground city with its huge numbers of paths, tunnels and connecting buildings, none of which Google maps.

All of this would be useful information to have, but I can't really blame Google for not providing it. After all, Google isn't a cartographer. They buy their geographic information from numerous suppliers, and no one bothers to collect walking information to resell.

That doesn't mean it can't be collected though. What is too expensive to do when one is a cartographic information company is not necessarily too expensive when one can harness thousands of enthusiastic volunteers from the internet.

So, today's idea is to build a web-site that lets folks edit Google map overlays in a wiki-like manner. The site would allow for the upload of data from GPS units, or let people draw directly on the maps. There would be ways to add annotations to, for example, distinguish between a hiking trail, a sidewalk and a back alley.

The resulting user-created maps could then be displayed with a Google interface, and could even export a similar API so that folks could build upon the data presented.

If one ended up being quite lucky, then Google (or a cartographic service) might buy out the web page. Even if this did not happen (and having a business plan that requires being "discovered" by someone with deep pockets is seldom a good idea), I suspect that there are probably as many uses for user-created mapping services as there are for user-created text services, and the huge number of extant wikis show just how popular the latter are.
Executive Thinking

Next Generation Online Social network.

Recently I wrote a proposal for the creation of the next generation of online social networks. Seeing as how I've not received any response to that proposal, and I think the ideas therein are good, I'm going to talk about it here.

To start with, the current generation of existing social networks are narrowly focused on particular groups. Teenagers tend to gather on MySpace. Business professionals use sites like Ryze and LinkedIn. Writers tend to like LiveJournal. Each of these groups is attracted to a system that caters to its preferred modes of social interaction, and as a result the look-and-feel of these networks are quite different.

In addition in existing networks, there is inevitably an element that refuses to follow (or is oblivious to) the social contract that has been set in place. This gives rise to problems with such disruptive users as spammers, trolls, stalkers, and serial complainers.

To tackle the first of these issues, look and feel can be easily separated from the basic functioning of the social network software, so that one can provide one system that is almost everything to almost everybody. Each participating member would see a different user interface depending on which group they interact with. There need not even be any overt indication that such social networks as (for example) www.ravers.com and www.christianconservatives.com are hosted on the same network and handled by the same software.

In effect, there would be a network of networks; each networked group free to build its own culture in its own online space. The interconnected nature of these networks would only become apparent when someone joined multiple groups and found that their one account let them switch seamlessly between their different groups.

To aid in this seamless interoperation, the software would allow a user to carefully partition what information was public and which was private to particular individuals, groups, and the network as a whole. A businessman might simultaneously be a member of groups customized for marketing directors, golf enthusiasts and homosexuals, but might well wish to present themselves very differently to these three groups. They might (for example) have a staid business profile for the first group, a flashy and deliberately tacky profile for the second, and a completely anonymous one for the third.

Once such a basic network-of-networks with its security and privacy models are defined, there is a huge range of additional features that can be provided on top of them. Just a few among these are:

  • Fllavoured Contacts. Most social networks allow one to have links to 'friends', but this would be too general a connection for a truly universal social network. We would allow links to be distinguished by the relationship they implied. One could have 'friend' links, 'business contact' links, 'sports buddy' links and so on as the user desired. They could also associate an intensity so as to indicate how much of a friend someone was, how sexy they appeared, or how trustworthy they were as a business partner, and so on. These could be kept private or aggregated to provide ranking predictions for new contacts (which could, in turn, be used to inform an automated introduction service).
  • Reputation systems and ranking. As much of how people interact socially is governed by considerations of status and social ranking, the system could explicitly keep track of each user and provide Google-like page ranks relative to each of their various interest groups. By careful consideration of how these ranks are calculated, the system could be made largely self-policing by automatically discouraging disruptive or destructive behaviour.
  • 'Private' interests. In some cases it might be desired to hide particular interests in a profile from everyone who does not have that same interest in their own profiles.
  • Full multimedia support. Current social interaction sites are moving to support the hosting of images, but I think it should go much further. The system should support the hosting of all kinds of digital content, including photos, videos, sound, and software. Where appropriate it should allow one to easily post a 'clip' from the middle of a sound or video file on one's account.
  • Full Internet interactivity. As an aid to gaining widespread adoption, the system should cater to a wide variety of different Internet protocols and interaction types, including SMS messages, voice messages, web cams, email gateways, RSS gateways, Usenet gateways, news services, and even other competing social networks. The more inclusive the system, the lower the barrier to migrating from existing systems that people may already be members of.
These last two items may sound like huge development investments, but the adoption of an architecturally open software model in which technically sophisticated users can contribute plugins can greatly reduce the need for local development efforts.

I could probably go on for page after page detailing the features that should exist in a good online social network, and that are missing in all current offerings, but this should give a good idea of what features the next generation should provide.
Executive Thinking

Virtual Corporation Portal

The term "Web Portal" has changed in meaning since it was introduced. Wikipedia's definition currently talks about integration of various offering from different vendors, personalized services for users and multiple-platform independence. None of this was part of the core idea behind Portals.

They were originally intended to serve as one-stop-shopping for a particular class of internet user. That is, they were narrowly focused on providing, in one location, all of the services needed to conduct some particular bit of business on the web. The idea was that if there was one site that, for instance, told you everything that you wanted to know about buying, caring for and raising exotic fish, then it would quickly be the place that exotic fish aficionados would go first when looking for something. As a result you would have a captive audience for purposes of advertising, and would be able to act as a middleman (and take a cut) by connecting consumers with vendors.

The idea was (and is) basically a good one, but it turns out that the look and feel of the portal is crucial. Someone visiting for the first time, even if they know almost nothing about the industry, must be able to spot at once what the portal is for, and how to find out what they want. At the same time a seasoned user of the service must not have to jump through hoops to get to the particular item that interests them today.

During the 1990's, many companies created portals, and most of them failed miserably. Mainly this was because they were hard to navigate, did a poor job of being a one-stop-shop for their users, and were often very difficult for the uninitiated to understand. As a result, most portals deployed today are part of a company intraweb and are used to coordinate the work of all of the members of that corporation, and where the audience has little or no choice in the use of the service.

You've probably guessed the reason that I'm providing this background: I would like to propose that someone create a new web portal. In particular, I would like to see one dedicated to the setting up and running of virtual corporations.

The idea behind a virtual corporation is quite simple. If most of the work is outsourced and the few actual employees work from home offices, the costs of running the company can be drastically reduced. One success story is Topsy Tail Co., which has revenues of $100 million per year, but only three employees.

Once everyone is working from home offices, then the cost of the corporate offices can be almost elimated. I say 'almost' because some corporate workspace is often useful. Sometimes one of the team members visits from out-of-town and needs a temporary office for the duration. Sometimes you need to get everyone together for an actual physical meeting. Sometimes its necessary to sit a client down in a business setting with a members of your team.

In all of these cases its useful to have a single office or meeting room that one can use at need, although since they are used quite rarely its often more economical to rent them on an as-needed basis from a virtual office supplier.

These suppliers are just one of the many services that a virtual corporation needs to contract for. In fact, it turns out that a virtual corporation has many specialized needs, including such items as:
  • Dedicated mail and web services.
  • Coordination and collaboration software.
  • Telecommuting employees
  • One-click legal and accounting services (such as incorporation in Vanuatu).
  • Phone answering and call-forwarding services
  • Receptionist and physical mail services
And the list can continue in that vein for some time. All of the above services are available on the web somewhere, but most are extremely hard to find, even with Google's help. A web portal that brought it all together and provided comprehensive one-stop-shopping for these services would be a boon for the modern entrepreneur who is trying to start a new business as inexpensively as possible.

Such a service might prove quite lucrative. Besides, the company that runs the portal could have rather low expenses, as it too could be virtual.
Executive Thinking

RSS Aggregators.

There are a large number of companies out there who are trying to get in on the RSS aggregation bandwagon. I'm going to assume for the sake of this essay that you already know about RSS and aggregation and the many uses to which it is currently being put (podcasting, vlogging, etc). If not, I suggest you check out the link above before reading further.

Now, the RSS system has a number of flaws, the most major of which is that it doesn't scale. The server load from having several hundred thousand subscribers poll a website every few minutes can bring even major webservices like MSN to their knees. That however is not what this essay is about. Instead of proposing a replacement for RSS (which I may well post about in the future), I'm going to propose a replacement for the current crop of aggregators.

There are many aggregators and I've looked at a fair number of them, but by no means all, so for all I know what I am about to propose may already exist out there. If they do though, I was unable to find them.

What I want is a service that doesn't just merge multiple RSS streams into a single stream. That's useful, but its far from enough. I would very much like to see a service that can create RSS streams out of non-web material like newsgroups and mailing lists and syndicate them. The resulting output from the RSS aggregator should default to RSS (of course) but it would also be very useful if it could also be output in the form of a mailing list or newsgroup gateway.

Then, I want to be able to merge streams and perform operations upon the results. I would like to collapse together all articles with identical texts. Too many of the news streams I subscribe to carry the exact same story and I would like to see it only once. It would also be good if the aggregator could tell when multiple articles were all about the same current hot topic and group them together, perhaps by merging the articles into a single meta-article.

Another thing that would be good would be to be able to splt a stream (either a regular stream, or one of these merged streams) so that I could filter technical discussions into a different feed than political discussions about technology, both of which currently show up in some of the technology streams I follow.

Once all that is in place, then I would like to see some collaborative filtering layered on top of it all. Something that allows the subscribers to the resulting feeds to give feedback, not only in the form of comments, but as ratings and subject tags so that the readers can refine the initial stream filtering done by the RSS engine. Think of it as a Digg for RSS with the ability to add and vote on topic categories.

Now THAT is a service I would like to see!
Executive Thinking

File Signature Database

During the dot-com bubble, I was working with some friends to build an internet service based on the many uses of digital signature technology. Since we went under when the bubble burst, and no else has yet to provide a similar service, I thought I would talk about what sorts of things we were doing, as I still think its a viable business. I mention in the FAQ for this journal that some businesses fail for reasons unrelated to their technical merit. This was one of those.

Now, to start with, when I say 'digital signature', I don't just mean the output of cryptographic hash systems used in many authentication systems, but the output from the general class of hash functions as applied to digital objects. I make this distinction because we used far more than just one type of hashing. In our project we made use of at least three different sorts of hashes:
  • Cryptographic Hashes
  • Structural Hashes
  • Analytic Hashes
The first type is the sort that I imagine comes first to mind for most technologists these days. Cryptographic hashes are commonly used in many internet protocols. They are designed so that even a one-bit change in the input data will change a random half of the bits in the hash. They provide authentication that a file hasn't been tampered with, can act as unique IDs for a data object, are very difficult to forge, and can be used to verify the origins of a digital object.

Structural hashes are a different sort of fish. They provide information about the structure of the data, including such things as its type and format. They can be used to characterize the sorts of operations that can be safely performed on a file that one has never seen before, and what sorts of content-specific signatures could be applied. Thus, if one knows that a given binary stream is a JPEG encoded image in a JFIF wrapper, then it makes sense to attempt to display it visually. The treatment for a tar-ed, zipped, C-source file is very different. The standard Unix command 'file', provides a simplified structural hash.

Finally, analytic hashes are designed to change their bits in characteristic ways when the data undergoes certain transformations. We were able to develop analytic hash systems that allowed us to tell what editor or compiler had been used to construct a data object. We could tell from its analytic signature if it was likely to have been infected with a virus. Best of all, we could tell if its content was substantially the same, but its format had been altered. Admittedly, we had achieved this last only for text documents, but we were researching ways to do the same things with images and audio data when the company shut down. This same technique allowed us to tell that two documents were textually related -- possibly even different articles on the same topic.

Given this basic system of hashes and a robust distributed database, to store them in, there were several different businesses that we could engage in. The most obvious was to store as much meta-data as we could on every data object on the internet, and associate it with the signature for the object. We could then answer specific queries about all objects we had registered. This included such things as:
  • What is this object?
  • Where did it come from?
  • Is it part of some specific collection?
  • What rights do I have with respect to this object?
  • What other versions of this object are there?
  • Are there functional replacements for this object?
  • Are there security concerns with this object?
  • Do I need a license to own/use/have this object?
  • Where can I get support for this object?
And many similar bits of information. The plan was that queries would be free but we would charge companies to store useful information about their products in our database. The cost to store the data would be based on the cost of the software, so that (for example) free software could be registered for free, while companies making large amounts of revenue from their software would have to pay more.

The reason that a company would want to register their software, images, music files and so-on with us, is that it would cut down on piracy. Right now, most major corporations do not know what software is running on their systems, or if they have a legal right to what is there. Our system could scan a huge network over night and report that it had found 762 copies of a particular piece of commercial software. If the company had only bought a 500-unit license, then it would know that it was in violation and would need to take some action to remedy the situation.

This could be extended in several ways. We could allow promotional information and service information to be registered. A company could also register local, private information about proprietary data objects and track their use within their organization. Version histories, comments, and distribution information could all be centrally collected and managed by a system built on top of the database.

There are many other uses for such a system as well. Just one example is that by keeping statistics on the sorts of queries we were receiving, we could estimate the installed base for any registered software product, and sell the accumulated statistics. Another idea we had was to build a monitor for all ingoing and outgoing data traffic on a site and ensure that no proprietary data was being accidentally leaked.

The longer we worked with the system, the more potentially profitable uses we found for it, and I think the business model is as valid today, as it was in 2000.