Data Mining

Definition By Discussion

Organization X starts up their business, gets a big server and a database program, and runs their transactions through it. Their database is tuned to keep track of things like customer contacts and double-entry accounting.

After their database gets really full, they want to ask very hard questions of it. For example: "Based on our demographics, what is the best route for our salespeople to take through a given state, scheduling the most lucrative contacts on the days when their previous buying patterns show that they are most likely to be at home."

Organization X now connects with Organization Y, who goes into the business of "data mining". They claim they can find new ways to read very large swaths of legacy data, to extract more value from it. Along the way, they may find themselves trying to access information in ways the database was not tuned for. They also may encounter requests to access information in ways no known database will allow.

Hours of fun!

Another view by a data mining purist:

The specific problem above asks for a solution of a variant of the travelling salesman problem (TSP) on a large data base. TSP is NP-complete (very hard), even without the practical aspect of legacy systems getting in the way. The legacy system is not what makes it data mining, nor is the solving of NP complete problems. What makes it data mining is trying to determine from a set of data, designed originally to answer one question (Who gets paid how much? -- a billing system), other information that you wish you had collected at the time (Which customers are home when?).

It is also data mining when you try to arrive at the judgement of who are the "most lucrative contacts". Bear in mind that "most lucrative contacts" includes small-volume customers who buy every month and chew up no salesman time with dumb questions, providing they are home when you arrive. "Most lucrative contacts" also includes delivering the ordered case of beer/wine/pizza once a year to the holiday house of the CEO of your most valuable corporate beer/wine/pizza customer. You could just send the truck, but sending a salesman wins every time. Now data mining is not going to do this for you but in making the huge bureaucratic automated system at least part of the trick is still allowing the people on the ground to add the special tweaks.

Data mining often occurs when patterns must be extracted from noisy, loosely correlated data. Solving problems, even NP complete ones, on noise- and error-free data is, by comparison, easy.

Imagine trying to solve a variant of the standard travelling salesman problem where instead of exact coordinates for the towns, you were given inaccurate estimates, given by random citizens surveyed in the streets. In order to get the best answer, you cannot only apply the standard algorithm; you must also form a judgement of the accuracy of each data point.

This is hours (years!) of fun!

The data mining in which I have participated ran along somewhat more abstract lines. In one place, we had a large amount of billing data accumulated over years. My task was to see if there was anything about the data that was "unusual" or "interesting" or which would render their business assumptions invalid. I found a number of correlations and statistical profiles that challenged their business assumptions (as well as some that proved to be eye openers for new opportunity).

In another environment, we took 3 dissimilar business data sets (car rental, hospitality, and real estate finance) and also went looking for the unexpected. Yes, we also did the usual demographic slicing, but one of the goals was to recognize new patterns that would be of use to the business enterprise.

-- Garry Hamilton

To understand data mining, it can be helpful to think about it in the context of impedance with scientific and statistical cultures.

Much of data mining is about leveraging existing data to make useful predictions. It is different from conventional statistical prediction because data miners don't care much about parsimony, so they don't care about covariance. Where two or more variables have similar predictive qualities (i.e. contain equivalent information), a statistician will endeavour to chose the better of two covariant parameters for their model. A data miner will happily bung them both in if together they improve its predictive power even slightly. This is because statisticians think good predictive models are ones that both predict well and hint at the underlying causalities. A data miner doesn't care about anything except good prediction (for example; what's going to be the most profitable / least risky way to spend my limited resource). That's why data miners are happy with complicated but effective neural nets, but they can creep statisticians out. The universe might be mathematical, but mother nature is a data miner, not a statistician.

In fact prediction is not just what a data miner cares about. Much data mining involves extracting, structuring, and restructuring data. A data miner cares greatly about how he will extract the data and structure it - sometimes the data is so structured that there are little predictions required; other times there are several predictions required in the analysis of the data by a statistics expert. Questions of "what use is the data to us once we extract it, and how will we make it useful to customers?" is what should be most important to a data miner. Predicting is just one piece of the puzzle.

Restructuring data tends to be a necessary precursor step to data mining, but I expect most people (including me) wouldn't consider it to be 'Data Mining' itself any more than we'd consider driving to people's houses and obtaining signatures to be the work of a plumber.

Bringing mother nature into a conversation about data mining is about as useful as bringing "God" up in a conversation about databases. Who and what is "mother nature", and what makes you think mother nature is not the universe itself? What does God and Mother nature have to do with data mining, and likewise: what does Jesus and the Sun or the Clouds have to do with Unix? And who might care?

I'd infer the author is alluding (by analogy) to evolution and the fact that genetic fitness - that is, the degree to which the probability of surviving and breeding is determined by genetics - isn't about any single 'best' predictors but would rather 'bung' everything together to form a predictor. What does this have to do with Unix? well, the survival of Unix certainly is determined by a very wide number of parameters - there is no single metric by which to determine the quality or market-fitness of an operating system.

As discussed above, data miners are also sent on fishing expeditions to discover interesting things; descriptive, as opposed to predictive models. "There's gold in them thar hills" they say, then don their alchemical robes and dissapear. They do analysis to discover theories, which can creep scientists out (who are more comforatble with doing analysis to evaluate theories they thought of themselves, clever monkeys). The scientific fetish for parsimony (an artefact of reductionism) can lead scientists to be flumoxed by complexity. When dealing with complex patterns in data, a data miner might use algorithms that deliver complicated but useful models. However, leveraging these models can involve articulating them to people with reductionist proclivities, so it's not uncommon to "model the model" with something that is easy to explain - for example, make a decision tree that can be graphically presented on one side of a an envelope instead of pages of code (even if the code outperforms the tree). >> artefact reductionism

Data Mining is an appied science; The machine learning branch of computer science underpins it.

I do data mining for a living and I find the above description of data mining offensive, hallucinogenic, erroneous, and ludicrous. Data mining just involves studying and republishing data that was already here. It can involve complicated techniques to extract data, since the data is not usually in very structured form. Scientists tend to want to study new theories (or existing theories) about the universe, while data miners tend to mine data that came from humans. For example, a data miner does not go around mining information from camels and elephants mouths - instead data miners tend to focus on a branch of human information statistics. Examples are mining data from streams of text, phone calls, banking transactions, internet websites, encyclopedias, books (optical character recognition ), etc. Therefore some data mining is simply a form of analysis (sometimes very statistical) on information - and is not a black magic. All statistics are not fool proof and contain estimations and possible errors - data mining is no exception. Some data mining involves republishing data, combined with statistical analysis. For example a search engine data mines and then republishes a small summary of the original data (existing copyrighted content). None of this is creepy - data mining is just humans analyzing and republishing data that was already there, in a different form.

I can only surmise you deeply misunderstood the above author's statements. Data miners are essentially in the business of statistics-based compression of data into 'theories' (i.e. from the dataset, P has a 95% chance of predicting Q). They tend to verify such theories by splitting data-sets randomly - e.g. half for production of rules, half for testing of rules. These 'theories' are, themselves, the products of a data mining exercise. Scientists often take a similar approach (e.g. viewing a graph of data before constructing a hypothesis), so the above author's characterization on that aspect is a bit unfair - but, still, that scientists favor parsimony is correct (Occams Razor) - they attempt to invent a simple theory via an intermediate model based on existing observations, then test it against future observations. A scientist's model will tend to be well factored, e.g. drop observations into a model, and out pops a predicted value. Relatively, the 'theories' that come out of a data mining exercise can be complicated or poorly factored... e.g. supposing data matches (precisely) a simple 'model' a^2 + b^2 = c^2 that a scientist or statistician or even a kid with a trigonometry background would discover immediately. Many techniques for automated mining of data will be far from perfect and will result, from this data, in many if not dozens of formulas dealing with various ranges of a and b to predict various ranges of c. And teaching a neural network via this data will be complex 'black magic' from a scientist's perspective - you can't get useful theory back out of a neural network. The advantage of Data Mining is that these very same tools and techniques can make useful predictions even where scientists are unable to provide useful theories or models - i.e. the job gets done even if a scientist will find the solution ugly, hackish, and abhorrent under Occams Razor.

And while the above author said nothing about elephants and camels, there is no reason the same tools of Data Mining couldn't be used to relate dung readings of camels and elephants to foods they have eaten recently (may even be useful and economical - dung is a valuable commodity), or predict their migration patterns both natural and in the presence of humans. Data mining isn't about human statistical analysis - that's simply where it is most popularly applied.

What I’m about to describe has probably been done. If not, it should be. Imagine a computer program that scours scholarly books and periodicals for references. Suppose I refer to essays A, B, and C in my published essay. Each essay, including mine, would be represented by a dot. There would be a line from my essay to A, to B, and to C. Each reference in essay A would have a line from it to A, and so on. Suppose the dots varied in size depending on how many lines they had connected to them. Obviously, scholarly space would be large and messy, but you would be able to see which essays are most often referred to, and therefore most influential. (Influence in this context doesn’t mean agreement with; it means affected by.) I suspect that works such as John Rawls’s A Theory of Justice (1971) and Robert Nozick’s Anarchy, State, and Utopia (1974) would appear as planets in scholarly space, since they have been referred to many times. Most essays would be visible, if at all, as specks of dust. If anyone is aware of a website that contains this sort of thing, please let me know.

Whilst it doesn't use a graphical visualisation like you've described, there is citeseerx.ist.psu.edu

I expect the clustering algorithm - that tries to place essays near one another in the visual model - would be non-trivial.

Curiously the term Data Mining is all about extracting Real Information whereas the term Information Technology is all about handling data. --Jon Grover

Category Data Mining, Category Information

See original on c2.com