◐ Shell
reader mode source ↗
Showing posts with label knowledge. Show all posts
Showing posts with label knowledge. Show all posts

Tuesday, September 29, 2015

No Social Constructs in My Little Town

In Paul Simon’s song “My Little Town”, there is the lyric “everything’s the same in my little town”.  This can be contrasted with "the big city" that is cosmopolitan, multicultural, not all the same.  It is only when you have the experience of more than one culture that it becomes natural to see that there can be more than one notion of how things are.  In Philosophy, when there is one “correct” definition of something, due to it existing in nature, independent of man, it is called a “natural kind”.  However, when something only exists because people have agreed to think that it does, that is called a “social construction”.  Computer programmers need to be aware that, of the things they have to model in their databases, user interfaces & business models, most are social constructions, and hence, there are many ways to skin a cat, but not arbitrary ways.  Two case studies are given that show how programmers can err on either side of the spectrum.

Social Constructs versus Natural Kinds

It is common to consider Natural Kinds to be “discovered”, and Social Constructions to be “invented”.  An example of something that only exists because we say it does is money.  A Bitcoin (or an ounce of gold, or a piece of paper with $100 printed on it) is worth whatever we say it is, and how many Big Macs can be bought with each is whatever we agree upon.  And via the computerized marketplace, we can change our collective mind every microsecond.

Social Constructions are also inherently “relative” to some culture, which means that there can be more than one version of it floating around, and each can be equally valid.  A traditional example is the definition of what it means to be a woman.  While there are the aspects of womanhood that are controlled by DNA and biology, many aspects are defined by society, and there are many different definitions existing simultaneously.

On the other hand, we believe that atoms exists in nature.  We may discover better definitions and understandings of them over time, but those would be mere changes in our knowledge of them, rather than making them an invention.  When Europeans thought that swans only came in white, and then found black ones in West Australia, their definition of swan changed, but we still think that swans are in fact a natural species, independent of whether you’re European.

Natural Kinds that Aren’t

There are times that something is thought to be a natural kind, but later realized to not be, for example, planets.  We thought that planets existed objectively and independent of our latest definition.  We now realize that those orbiting chunks of rock and clouds of gas may exist objectively, but our classifications of “planet” versus “dwarf planet” versus “failed star” do not.  I.E. they are arbitrary enough that aliens landing here will likely have different ways of classifying orbiting stuff.

Saving the Phenomena

So, if planets aren’t “real”, where did they come from?  An early view of the universe was that everything literally revolved around the earth in a perfect circle, except for a handful of “wanderers”, the original meaning of planet.  As we gained more knowledge, we would update our definitions. But we always (if only unconsciously) wanted to “save the phenomena”; in other words, make sure that the new definitions didn’t drop any planets and didn’t add any, otherwise, we would be defining something that did not match our intuitive notion of planets.

Of course, recently that became impossible because we realized that we were either going to have to add hundreds of new “planets”, or, drop Pluto, to be consistent.  The more that people tried to keep the original collection, the more it became clear that the collection was based on culture and history rather than an objective category of things in space.  Some even say that Jupiter is not really a planet, but a failed star.

Case Study: World Headquarters in My Little Town

The world headquarters of Coca-Cola is in Atlanta, and while a world headquarters would be expected to be pretty cosmopolitan, it is in The South which was traditionally very monoculture, conservative, and religious (which I can say because I grew up there).  I was there, on a Y2K project, redesigning data files which were using just 2 digits to represent years (even though the file formats had been specified only two years previously, by the way).  During discussions with the developers (all Atlanta locals), it was assumed that there was only one obvious “correct” way to represent dates in a string: mmddyyyy.  Having lived overseas, I knew to point out that most of the world doesn’t do it that way, instead using ddmmyyyy, or yyyymmdd (and we did not even get into other calendar systems).

The point is that it was assumed that dates were a natural kind when they are actually very socially constructed. In my little town every thing is the same and therefore looks like the one and only way God intended.

Case Study: Bizarrely Arbitrary User Interface at domain.com

While the concept of “social construction” says that there can be several equally-valid ways of defining some things, do not forget about the “social” part! I.E. you should not create an arbitrary definition that no one actually uses and therefore no one will understand.

On the website of domain.com (a domain name registry provider), there is the domain name registration form which includes a mandatory phone number field.  The required format is so bizarre though that it took a chat with customer support to figure out what is was.  It turned out to require the phone number to be entered as the fractional portion of a floating point number…let that sink in…floating point notation, with a mandatory leading plus sign and mandatory integer of 1.  So, phone number “(123) 456-7890” had to be entered as +1.1234567890  ...AND, to make matters worse, the error message received when it was not entered that way, only said that a legal phone number was required, without explaining what the non-obvious required format was.

When I pressed the support chat operator for an answer to my question, WTF?! , I was told (after some time on hold) that “that was the format that the developers chose”. There was no answer to my question: Of all the phone number formats on the planet, who has ever used that?   Apparently it was the culture of domain.com off-shored contract developers with no managers who were engineers enough to review the design.

Saturday, February 15, 2014

It's about Time, It's about Space

A 1960s TV series theme song began, "It's about time, it's about space...". Some, from Physics to Philosophy, say it's about both, claiming they are each aspects of a single space-time. Computer systems developers need to consider this as they build GIS applications.

Ontology, being the branch of philosophy concerned with describing "what exists", tackles the topics of Space and Time since they are often used to describe things. An Introduction to Ontology[1] devotes a chapter to each. As usual, things are more complicated than our initial intuition expects, and debate continues about different viewpoints. In a nutshell, the following are discussed:
  • Space is usually defined in terms of "regions"
  • Space is either absolute or relative.
  • Space is either something things "are in", or it is synonymous with the thing itself
    i.e. regions only have properties like size and location, versus, a region itself having the property blue if the stuff in it is blue.
  • Space is either Euclidean or not (i.e. flat or curved)
  • Space is either separate from Time, or parts of the same thing: space-time.
Most programmers today, in the age of Map apps, Geographic Information Systems, and geocoding, take the view that an entity such as a business or address is located at some location. The location ideally could be defined as a collection of regions defined by GPS coordinates. Often, the location is (over)simplified to a single point on a map.

While it is recognized that many problems exist with actual databases of geocoded entities, it is usually assumed that they are in the realm of epistemology rather than ontology. In other words, it is assumed the problem is with "our knowledge" due to inaccuracies in the set of GPS coordinates; not that locations don't actually have a definite set of coordinates.

However, not every entity that takes up space has a well-defined and unchanging mapping to a set of GPS coordinates.  ZIP codes, for example, are not defined in terms of geography but rather as collections of delivery routes. Another example, as shown in the title insurance case study below, is in real estate legal descriptions. In addition to a knowledge problem caused by ambiguous language used in these descriptions, they can also refer to ephemeral landmarks.

While a naive assumption that space is different than time is often made in data model design, entities like ZIP codes and Legal Descriptions require a time dimension to be completely accurate. It turns out that the mapping of zip codes to postal routes changes several times a year.  And landmarks, referred to in property descriptions, can change location and shape over time.

Case Study: TICOR Title Insurance System
OMEX was a startup that was an early pioneer in creating optical disk technology for data storage. It took on a contract to produce a computer system to support TICOR, the largest title insurance company in the U.S.  TICOR itself had the contract to keep backup copies of all the real estate transactions filed with Los Angeles county.  As a part of archiving copies of the documents, it was free to use the information in them, and hence support its business of providing title insurance.

The computer system was to replace using microfilm photos of the documents with optical disk storage of the images.  It would link these images with a structured database of information related to each property. One of the goals of the database was to enable answering basic questions about property locations.

The programmers, having a naive notion of how property boundaries were defined, were surprised to see that a common method is “metes and bounds” which uses plain english descriptions using landmarks. E.G. "beginning with a corner at the intersection of two stone walls near an apple tree on the north side of Muddy Creek road one mile above the junction of Muddy and Indian Creeks, north for 150 rods to the end of the stone wall bordering the road, then northwest along a line to a large standing rock on the corner of the property now or formerly belonging to John Smith, thence west 150 rods to the corner of a barn near a large oak tree, thence south to Muddy Creek road, thence down the side of the creek road to the starting point."

As can be seen, it would be difficult to translate this into a collection of GPS coordinates. But even if you did, you would not be done with the problem.  Like ZIP codes that change over time, the location and shape of creeks, rivers, etc change over time. Lest you think this is a merely theoretical problem, for centuries, States have sued each other over land ownership due to border rivers migrating over time.

Ultimately, the computer system wound up just using unstructured text fields to contain the legal description rather than the more ambitious GIS database they had originally promised.

[1] An Introduction to Ontology, Nikk Effingham, Polity Press, 2013
[2] A River Runs Thru It, How the States Got Their Shapes, History Channel, 2011

Tuesday, August 11, 2009

Faces with je ne sais quoi

A couple of years ago, I mused on the (possibly fatal) limitations of having Rationalism as the fundamental basis of the Semantic Web. Recently in the Communications of the ACM, an article added another example to the list of things that can be "known" but can't reliably be put into words. In Face Recognition Breakthough[1], a new technique was presented that performed significantly faster and with more accuracy than traditional face-recognition techniques. The basis of the technique was so surprising and counter-intuitive that, as recently as 2007, papers about it were rejected by mainstream computer vision conferences.

In very simplistic terms, the new technique finds the most compact way to represent the pixels of a picture (of say a face) by throwing them at a random set of numbers and seeing what sticks. That compressed data is compared directly with compressed versions of other face pictures to find the closest match. (If you really want the gory math details, see this video lecture.[2]) What is DOESN'T do is all the traditional figuring out of where eyes and mouth and nose and ears are, and calculating the relationships between their locations, distances, etc. In other words, it doesn't work by analyzing a face into words/concepts (eye, nose, mouth, etc) and specifying relationships between them. It DOES do weird math using random numbers that is irrational in the literal sense of the word. And apparently this weird math not only works, it works better!

The 2007 rejection of the papers as presenting outlandish claims was based on the same bias as rationalism has; if you can't put something into words, much less rational arguments, its not true, and its not knowledge. Just as neural-nets do, the mechanics of sparse representation and compressed sensing encode "knowledge" in a form that is completely unintelligible to us humans when we look at the "raw data". And while techniques that DO use more human-reason-friendly ideas are available, they often don't work as well.

[1] http://portal.acm.org/citation.cfm?id=1536616.1536623
[2] http://content.digitalwell.washington.edu/msr/external_release_talks_12_05_2005/15994/lecture.htm
Note: use IE browser; also you can skip forward past 46 minutes of theory to go directly to applications.




Wednesday, April 30, 2008

Your Pipe Inventory Record is not a Pipe

In this blog, I recently posted: Reality is the System of Record. Unfortunately, I just found a lost reminder to myself about a really good introductory example to use. Even though it didn't make it into the original post, it seems worth mentioning here anyway...

La trahison des images is a famous painting by Magritte which seems to pose a riddle.  It contains nothing but a pipe and the phrase, "This is not a pipe". However, it only seems enigmatic because the solution is too obvious.

As Art critic Robert Hughes explains in The Shock of the New, "This, indeed, is not a pipe. It is a painting; a work of art; a sign that denotes an object and triggers memory". As Magritte himself once remarked, "Of course it's not a pipe. Just try to fill it with tobacco".

In other words, it is a representation of a pipe, not to be mistaken for an actual one. And as obvious as that seems, computer system developers make the same mistake all the time. They do so when they forget that their Customer data table & business domain objects are not actual customers, but only representations, i.e. memories, i.e. copies of information.  So, as with all cached copies of external data, it is the duty of your so-called "system of record" to keep in sync with the real customer, in the real world, because Reality is the System of Record.

Thursday, April 17, 2008

Reality is the System of Record

"The system of record is the place where there is a definitive value for some unit of data... If you have no system of record for your bank account or if you have multiple systems of record for the same account, something is fundamentally wrong."
Bill Inmon, father of the data warehouse.[1]
Over my years of consulting in large enterprise environments, I've heard arguments over whose system is the "system of record" for some particular piece of data.  I have seen programmers officially acknowledge another system as the SOR, all the while building their own system as if it were.  I've also heard corporate developers say that something is a customer if and only if it has a record in the SOR (never mind what the customer thinks, nor if the record literally has the name "Donald Duck").  I've seen that same customer information system built with so little regard for mirroring the real world that it defined Frankenstein customers, some of whom were composed of parts from multiple actual people.  Owning an SOR seems to breed a certain lack of humility which (IMHO) could benefit from learning a little Philosophy.  It will teach you that you are never the system of record, and instead, merely a "faint copy" of one of many idiosyncratic conceptions of the world.

Representationalism

In modern Cognitive Science, there is the assumption that "the mind has mental representations analogous to computer data structures". The idea in Philosophy that "the mind perceives only mental images (representations) of material objects, not the objects themselves"[2] is called Representationalism (and more generally indirect realism), and it can trace its roots across 2400 years of Philosophy of Mind.  Socrates says (in Plato's Parable of the Cave) that most people only see shadows of puppets instead of reality. Aristotle said thoughts are likenesses of things, and words refer to things indirectly through thoughts.  Rene Descartes proposed that all sensory information is transmitted by the nerves to a central "theatre", where the soul makes contact with the physical body and watches it.  John Locke said, "The mind represents the external world, but does not duplicate it."  David Hume thought that ideas were "faint copies" of physical sensations.

Skepticism (i.e. the notion that we don't know what we think we know) is one of the oldest ideas in Philosophy, and the raison d'être for the entire branch called Epistemology (which asks how can we be sure we know what we think we know?).  They were born from the ancient realization that our formation of ideas, concepts, and representations out of our sensory input (hidden behind a "veil of perception") is a very inexact process, complete with optical illusions, dreams, hallucinations, color-blindness, double vision, etc, etc.  In Immanuel Kant's "Critique of Pure Reason", he argues that because our minds are hardwired to perceive the world in certain ways, they actively shape experience rather than passively record perceptions.  His claim was that our sense perception is effectively a pair of tinted glasses that we can't take off.  Because we've never seen the world without them, it takes effort to see the world as it is really is.  One of the very reasons to practice Philosophy is to understand the true nature of things and avoid the "naive realism" of "common sense". Guides to this understanding are found in the branches of Metaphysics and Ontology which explore how to understand and model the world respectively.  One of the tactics of Epistemology to aid in this effort is the doctrine of Verificationism which says that a statement has no meaning if there is not a way to verify its truth.

For programmers, the big epiphany here should be that their business database is really just one particular representation of reality, not reality itself.  Reality is the System of Record. A developer should be humble, and realize that it is difficult to accurately model the world such that it integrates with the data models of other systems.  They also should be skeptical that their actual data is both accurate and up-to-date, developing ongoing mechanisms to actively verify each. 

Syncing the Systems of Record

When any system keeps copies of data for which it is not the system of record, that system is actually just a cached copy of that external data.  And like any cache, it has the responsibility to keep track of whether its copy is "stale", and to implement mechanisms to insure "cache coherence". Once one realizes that reality is the system of record, it becomes clear that it is not enough to keep databases in sync with each other; they ALL must chase after the ever-changing state of the world.

In a large organization, each database is just one of many competing representations. In the same way that different people have different mental representations of their shared reality, different databases will each have their own slightly (or largely) incompatible data models, and sets of data values, for the same domain entities.  A side-effect of this is that many systems keep copies of external data that they've transformed in some way for their own use.  When systems don't realize that their SOR data represents the same real-world entities as other SORs do (e.g. patient DB versus employee DB), the eyesight of the entire enterprise goes out of focus as each data-set drifts apart.

CASE STUDY: ChoicePoint

ChoicePoint was an independent company prior to being bought out by Reed Elsevier in 2008.  It collected and combined data about businesses and individuals from a wide variety of sources, selling access to both private and public (i.e. government) clients.  While consulting there, I saw first hand how data was effectively only verified if someone phoned in to complain about inaccuracies. I also was informed by management that no access control mechanisms were to be included in the design of a new system, despite Congress exploring adding requirements for such.  They said that ChoicePoint wanted to be able to protest at the cost of adding the controls after the fact in the hope that it would defeat the requirements in the first place. There have been a whole series of lawsuits and government actions against ChoicePoint for out-of-date data, inaccurate data, and selling data to unauthorized buyers which cost it so much money that it had to be sold.

[1] The System of Record in the Global Data Warehouse, Bill Inmon, Information Management Magazine, May 2003
http://www.information-management.com/issues/20030501/6645-1.html

[2] Representationalism, Encyclopædia Britannica
http://www.britannica.com/EBchecked/topic/498476/representationism


Monday, May 14, 2007

Rationalism of the Semantic Web a Flaw?

In thinking about how knowledge is represented in the real world, it occurs that, in addition to the traditional rational representation via human and formal languages, there are at least two other biggies: DNA and neural nets.

Many, if not all, of the things we learn are represented in our brains as complex networks of (mathematical) functions that are simulated in computers via the technique of neural networks. And, if you have ever looked at the data associated with a neural net that has been trained to, say, recognize a handwritten letter "A", you've seen how there isn't any way to put that knowledge into words. In fact, it is not clear by simply looking at a net's matrix of fuzzy weights that it knows anything at all, much less what it knows!

In a similar vein, DNA obviously encodes tremendous amounts of knowledge, like how to make a beating heart, that are also very hard to translate into explicit "facts". In fact, DNA, like self-modifying source code in Lisp, is hard to verbalize because what it "does" can be very far removed from what is "says".

Now because the whole theory and strategy of representing knowledge via "facts", whether in a traditional database, or a semantic network, assumes that everything can be put into a fact format, there are huge amounts of knowledge in the world that can't be used. Yes, one can simplistically use a BLOB to stuff anything reducible to a number string into a database tuple, but that doesn't really put the knowledge into a form that can be "reasoned" about via inferencing rules. And it is that sort of inferencing that is the rationale behind the Semantic Web and why we should bother to create it.

Is this a (fatal) flaw in the foundation of the Semantic Web? Many philosophers have claimed it is a fatal flaw in Rationalism which is the philosophical equivalent to semantic networks. Phenomenologists insisted that many other flavors of knowledge had to be handled beyond the ones covered in rationalism and logic. How will the semantic web handle these? Is it even possible philosophically?