Random sampling to save money
I was stunned when my client said that a database query that I asked them to run would cost the company $100,000 per year. I had framed my question in the most natural way, not thinking that at the company's scale it would be worth spending some time thinking about the query.
Things have somewhat come full circle. When computers were new, computer time was expensive and programmer time was dirt cheap by comparison. Someone might be scolded for using computer time to do what a programmer could do. Now of course programmer time is expensive and computer time is cheap. Usually.
But when dealing with gargantuan data sets, the cost of computer time might matter a great deal, especially when renting services from a cloud provider. Maybe you can find a more clever way to run an enormous query. Or even better, maybe you can find a way to avoid running an enormous query.
One way to save time and money is to base decisions on random samples rather than exhaustive queries. This requires a little preparation. How big a sample do you need? How exactly are you going to take a sample? What kind of uncertainty do you have in your result when you're done? You can afford to think about these questions a long time if it saves tens of thousands of dollars per year.
The post Random sampling to save money first appeared on John D. Cook.