Big data is a promising investment for firms, but embracing data can also bring confusion and potential minefields — everything from where companies should be spending money to how they should be staffing their data teams.
MIT adjunct professor Michael Stonebraker, a computer scientist, database research pioneer, and Turing award winner, said he sees several things companies should do to build their data enterprises — and just as importantly, mistakes companies should cease or avoid.
In a talk last fall as part of the 2019 MIT Citi Conference, Stonebraker borrowed a page from David Letterman to offer 10 big data blunders he’s seen in the last decade or so. His (sometimes opinionated!) advice comes from discussions with tech and data executives during more than decades in the field as well as his work with several data startups.
Here’s Stonebraker’s list, including a bonus tip.
Blunder #1: Not moving everything to the cloud.
Companies should be moving their data out of the building and into a public cloud, or purchase a private cloud, Stonebraker said. Why? Firms like Amazon offer cloud storage at a fraction of the cost and with better infrastructure, often with tighter security and staff that specialize in cloud management for a living.
“They're deploying servers by the millions; you're deploying them by the tens of thousands,” Stonebraker said. “They're just way further up the cost curve and are offering huge economies of scale.”
Clouds also offer elasticity — with a cloud, your company can use a thousand servers to run end-of-the-month numbers, and a scaled-back amount for everyday tasks.
Blunder #2: Not planning for artificial intelligence and machine learning to be disruptive.
Machine learning is already remaking a variety of industries, Stonebraker said, and it is going to replace some workers. “The odds that it is not disruptive in financial services is zero,” he said.
In light of this, companies should avoid being disrupted and instead be the disruptor. This means paying for AI and machine learning expertise, which is in short supply. “There’s going to be an arms race,” he said of the competition to hire talent. “Get going on it as quick as you can.”
Blunder #3: Not solving your real data science problem.
Leaders often feel like they are on top of data science, and things like algorithm development, because they’ve hired data scientists. But data scientists typically spend most of their time analyzing and cleaning data and integrating it with other sources, Stonebraker said
For example, a machine learning expert at iRobot told Stonebraker that she spent 90% of her time working on data discovery, integration, and cleaning. Of the 10% left of her time, she said she spent 90% of that fixing data cleaning errors — which left about 1% of her time to the job she was hired for, Stonebraker said.
These tasks are important — “without clean data, or clean enough data, your data science is worthless,” he said.
But it’s also important to realize how data scientists are actually spending their time. “They are in the data integration business, and so you might as well admit that that’s what your data scientists do,” he said. The best way to address this, he said, is to have a clear strategy for dealing with data cleaning and integration, and to have a chief data officer on staff.
Blunder #4: Believing that traditional data integration techniques will solve issue #3.
Many err in the belief that traditional solutions will help address data cleaning and integration, Stonebraker said, specifically ETL (extract, transform, load) and master data management processes. The ETL process requires intensive human effort, Stonebraker said, and takes a lot of time and gets too expensive if you have more than 20 data sources. These processes also require a global data model at the outset, while today’s enterprises are agile and evolve quickly. The technology is brittle and not going to scale, he said.
Once you’ve run ETL, you need to match records to find out which ones go together, often using rule systems, which also don’t scale.
As an example, Stonebraker pointed to General Electric, which wanted to classify 20 million spending transactions. Their staff initially wrote about 500 rules, which took care of classifying only about 10% of their transactions. GE partnered with Tamr, an enterprise data mastering company co-founded by Stonebraker. Tamr built a machine learning model to classify the rest of the 18 million records.
“Machine learning is going to take over in this space,” Stonebraker said. “It’s okay to use rules to generate training data. Don’t try to use it for big problems.”
Blunder #5: Believing data warehouses will solve all your problems.
Data warehouses can solve some big data problems — but not all of them. Warehouses don’t work for things like text, images, and video, Stonebraker said. Instead, use data warehouses for what they’re good for: customer-facing, structured data from a few data sources.
Many companies have bought into traditional data warehouse technology that costs up to seven figures a year, Stonebraker said. “Get rid of the high-price spread and just remember, always, that your warehouse is going to move to the cloud,” he said.
Blunder #6: Believing that Hadoop/Spark will solve all your problems.
Many companies have invested in Hadoop, the open-source software collection from Apache, or Spark, the company’s analytics engine for big data processing. They shouldn’t be the answer for everything or everyone, Stonebraker said.
“In my opinion you should be looking at best-of-breed technologies, not the lowest common denominator,” Stonebraker said. This is especially true for high-level functions, or a company’s “secret sauce,” the special elements that are the key to success. “Spark and Hadoop are useless on data integration,” Stonebraker asserted, which is where data scientists spend a lot of time.
What do you do with your Hadoop/Spark cluster? Companies can repurpose it for a data lake or for data integration, or even throw it away — the lifetime for most hardware is about three years.
Blunder #7: Believing that data lakes will solve all your problems.
Conventional wisdom assumes that if a company loads all its data into a data lake — a centralized repository for all data — they’ll be able to correlate all their data sets. But they often end up with data swamps, not data lakes.
Independently constructed data sets are never “plug-compatible,” Stonebraker said. Things like semantics, units, and time granularity might not match: one data set might call something salary, the other wages; one might use Euros, the other dollars; one might have gross salary before taxes, the other are net. Duplicates have to be removed, and spellings might vary from one set to another.
For example, Stonebraker said, human resources databases need to account for employees working in two different locations. If two records are simply added together, staff will be overcounted by the number of duplicates. “The net result is your analytics will be garbage, and your machine learning models will fail. Garbage in, garbage out,” he said
Companies need to clean their lake data with a data curation system that will solve these problems. “This problem has been around since I’ve been an adult and it’s getting easier by applying machine learning and modern techniques,” Stonebraker said (see Blunder #4), but it's still not easy and companies should put their best staff on the problem. “Don’t use your homebrew system,” he said of in-house technology, which is often outdated. Usually the best data curation systems come from startups, he said.
Blunder #8: Outsourcing your new stuff to big data analytics services firms like Palantir, IBM, and Mu Sigma.
Typical enterprises spend about 95% of the IT budget on running legacy code, like maintenance, and often have their best people keeping the lights on. “Shiny new stuff gets outsourced, often because there is no appropriate talent internally, or because your best people are stuck keeping your accounts receivable system up,” Stonebraker said. This is a bad outcome — maintenance is boring, he said, and so creative people quit, and companies lose talent that could be working on new things.
New tools shouldn’t be outsourced, Stonebraker said. Other things should, like maintenance, — and while you’re at it, don’t run your own email system, he said.
Blunder #9: Succumbing to the “Innovators Dilemma.”
In his classic book “The Innovators Dilemma,” Harvard Business School professor Clayton Christiansen said successful companies know when to abandon legacy systems, even if it means drastic changes or potentially losing customers.
“You have to make all kinds of bets on the future, and a bunch of them are going to require you to give up at least some piece of your current business model and reinvent yourself,” Stonebraker said. “You simply have to be willing to do that in any high tech field.
Blunder #10: Not paying up for a few rocket scientists.
To address all of the above issues, companies need to invest in some highly skilled employees, Stonebraker said. Human resources won’t like what you’re paying, and “they’re not going to wear suits,” Stonebraker said, but don’t drive them away. “They will be your guiding lights.”
(Bonus) Blunder #11: Working for a company that is not trying to do something about the “sins of the past.” If you work for a company that's falling into any of the above blunders, then you should be fixing it — or looking for a new job, Stonebraker said.