% fortune -ae paul murphy

National ID - finding bad guys

As I said yesterday, the data and processes needed for a token based replacement for a national ID card system are in place. What's needed is the token technology itself and the political will to implement it - both of which should evolve over the next few years.

Meanwhile billions will get spent on ineffective traditional cards and technologies as "the system" works out a compromise between individual freedoms and bureaucracy that basically hurts both sides - and incurs costs going well beyond dollar issues to include destroyed lives, deaths due to policing failures ultimately traceable to information systems failures, and bad public policy decisions entered into as compromises between otherwise irreconcilable forces.

So is there anything that could be done to head this off?

I don't believe we can change the minds or agendas of key players - they know they're right, the momentum is in place, and they're going to spend the money, make the compromises, and disown the negative consequences in the usual ways.

What we might be able to do, however, is build awareness for other options while also offering a few short term ideas that use existing technologies more effectively to find the most dangerous people out there - thereby accelerating the leakage of political support from the national id card proponents and thus ultimately limiting the damage they do.

Yesterday's blog, like Monday's, was part of a first attempt at part one of this - building awareness for better options on the theory that a snowball in a hot place starts a cooling trend.

But what about part two? what technologies are available now specifically to help police find terrorists embedded in the population?

Surprisingly, at least to me, the database anad analysis technlogies needed for this have actually been both widely known and widely available since the mid to late ninties.

Set aside ethical considerations for a moment and imagine that you've got the widest possible access to individual communication and related location information - everything recorded by the digital communications system, including voice and packet switch data for every end to end communication touching any member of the population in which your targets are hiding.

Traditionally you would analyze this data by mapping the clusters - seeing who talks to whom and then relating that to lists of known or suspected criminals to see if new names pop up.

Unfortunately, however, this approach is startlingly inefficient in terms of finding new linkages, first because it doesn't do well at finding connections that go through third parties, and secondly because it has a significent tendency to report accidental three way and higher connections.

To see how to improve on this consider two mid ninties fads among IT researchers. The first of these involved mapping the web - first by finding the connections between document repositories by tracing the URLs refered to within those documents and later augmenting this by tracing the addresses associated with the IPs used by people downloading those documents to browsers.

Document linkage mapping was widely expected to make document search more reliable but it didn't work out that way. Instead it makes search results more useful by allowing companies like Google to rank search hits in decreasing order of "cluster connectedness". Note the distinction: rankings don't affect what the search finds, only which search results you look at.

The other fad was a web research based variant on a much older intellectual parlor game: guessing how many intermediaries there are between two people chosen either randomly or for specific characteristics.

For example, there are probably only about five intermediaries between the average American web user and President Bush - meaning that someone you know if you're an American web user, knows someone, who knows someone, who knows someone, who knows someone, who knows the President.

Think of this as A knows B who knows C who knows D, etc and you can immediately see that the cluster approach should generally fail to identify weakly linked subclusters: say A,B and D,E,F in the absense of either an A,C or a B,C subcluster. It's not, in other words, very good at identifing groups using communications cutouts - unfortunately exactly the high value targets you're looking for.

So how to improve? Well, instead of thinking about a table listing the times, names, initiator, physical locations, connection type, and duration of each communication, think only about the two identifier columns. These are of the form A-->B, and can be re-ordered as one vector for each name: A-->B, F,G, K, Z... Then construct all possible common substrings of length greater than two without considering either order or placement: A-->B,F; A-->B,G; etc. Finally, group sub-strings by like elements.

The overall problem space looks combinatorially large but we have standard software codes for this from DNA research and, of course, the overwhelming majority of communications data can be counted once and then thrown away as repetitive - meaning that the key size determinant isn't the time period covered by the data but the number of names you start with.

Ultimately, of course, this is still just a clustering approach but n-way groupings with and without cut-outs show up clearly. Select only those groups featuring at least one known or suspected bad guy and you're most of the way to having a powerfull inteligence tool.

The downside, other than the invasion of privacy implied by how you got the data, is still false positives, but it's a different kind of false positive than you get with a URL cluster style ranking of the kind Google uses. Look for networks associated with a known or suspected bad guy on our hypothetical connections database and you may find that your target is two removes from a well known lobbyist with deep political connections -but that report isn't wrong: it's the conclusion you draw from it that's wrong. That connection is real, the false positive is the implication that the lobbyist is knowingly working with or for the bad guy, not the communications relationship between the two.

Once you've got a bunch of possible relationships you can, rank them in the order in which they merit human attention by looking at modifier data. Suppose, for example, that you know that two fairly thin (in terms of communications frequency and duration) groups of five or six names share links to B and C respectively with both of the later weakly linked to A? So what? By itself that wouldn't tell you anything - it's only when you look at the modifier information to discover that several frequently called numbers for members of both target groups are associated with flight training centers, that you might want to give the people involved some more direct attention.

It sounds simple, and in theory it is. In practice there are organizational and data access problems -but my point isn't that we should all run out and create personal anti-terrorist or counter-intelligence analysis tools; my point is that there are real national urgencies at work here and real interim solutions exist whose use we should all be supporting because their success would take political support away from the people driving the coming boondogles in national ID card implementation -and thus limit the damage they do.


Paul Murphy wrote and published The Unix Guide to Defenestration. Murphy is a 25-year veteran of the I.T. consulting industry, specializing in Unix and Unix-related management issues. web connetcions access to ph/emil, credit, and distance