Strategic Insights and Clickworthy Content Development

Category: Big Data

How Valuable Is Your Company’s Data?

Companies are amassing tremendous volumes of data, which they consider their greatest asset, or at least one of their greatest assets. Yet, few business leaders can articulate what their company’s data is worth.

Successful data-driven digital natives understand the value of their data and their valuations depend on sound applications of that data. Increasingly venture capitalists, financial analysts and board members will expect startup, public company and other organizational leaders to explain the value of their data in terms of opportunities, top-line growth, bottom line improvement and risks.

For example, venture capital firm Mercury Fund recently analyzed SaaS startup valuations based on market data that its team has observed. According to Managing Director Aziz Gilani, the team confirmed that SaaS company valuations, which range from 5x to 11x revenue, depend on the underlying metrics of the company. The variable that determines whether those companies land in the top or bottom half of the spectrum is the company’s annual recurring revenue (ARR) growth rate, which reflects how well a company understands its customers.

Mercury Fund’s most successful companies scrutinize their unit economics “under a microscope” to optimize customer interactions in a capital-efficient manner and maximize their revenue growth rates.

For other companies, the calculus is not so straightforward and, in fact, it’s very complicated.

Direct value

When business leaders and managers ponder the value of data, their first thought is direct monetization which means selling data they have.

“[I]t’s a question of the holy grail because we know we have a lot of data,” said David Schatsky, managing director at Deloitte. “[The first thought is] let’s go off and monetize it, but they have to ask themselves the fundamental questions right now of how they’re going to use it: How much data do they have? Can they get at it? And, can they use it in the way they have in mind?”

Data-driven digital natives have a better handle on the value of their data than the typical enterprise because their business models depend on collecting data, analyzing that data and then monetizing it. Usually, considerable testing is involved to understand the market’s perception of value, although a shortcut is to observe how similar companies are pricing their data.

“As best as I can tell, there’s no manual on how to value data but there are indirect methods. For example, if you’re doing deep learning and you need labeled training data, you might go to a company like CrowdFlower and they’d create the labeled dataset and then you’d get some idea of how much that type of data is worth,” said Ben Lorica, chief data officer at O’Reilly Media. “The other thing to look at is the valuation of startups that are valued highly because of their data.”

Observation can be especially misleading for those who fail to consider the differences between their organization and the organizations they’re observing. The business models may differ, the audiences may differ, and the amount of data the organization has and the usefulness of that data may differ. Yet, a common mistake is to assume that because Facebook or Amazon did something, what they did is a generally-applicable template for success.

However, there’s no one magic formula for valuing data because not all data is equally valuable, usable or available.

“The first thing I look at is the data [a client has] that could be turned into data-as-a-service and if they did that, what is the opportunity the value [offers] for that business,” said Sanjay Srivastava, chief digital officer at global professional services firm Genpact.

Automation value

More rote and repeatable tasks are being automated using chatbots, robotic process automation (RPA) and AI. The question is, what is the value of the work employees do in the absence of automation and what would the value of their work be if parts of their jobs were automated and they had more time to do higher-value tasks?

“That’s another that’s a shortcut to valuing that data that you already have,” said O’Reilly’s Lorica.

Recombinant value

Genpact also advances the concept of “derivative opportunity value” which means creating an opportunity or an entirely new business model by combining a company’s data with external data.

For example, weather data by zip code can be combined with data about prevalent weeds by zip code and the available core seed attributes by zip codes. Agri-food companies use such data to determine which pesticides to use and to optimize crops in a specific region.

“The idea is it’s not just selling weather data as a service, that’s a direct opportunity,” said Srivastava. “The derivative opportunity value is about enhancing the value of agriculture and what value we can drive.”

It is also possible to do an A/B test with and without a new dataset to determine the value before and after the new data was added to the mix.

Algorithmic value

Netflix and Amazon use recommendation engines to drive value. For example, Netflix increases its revenue and stickiness by matching content with a customer’s tastes and viewing habits. Similarly, Amazon recommends products, including those that others have also viewed or purchased. In doing so, Amazon successfully increases average order values through cross-selling and upselling.

“Algorithmic value modeling is the most exciting,” said Srivastava. “For example, the more labeled data I can provide on rooftops that have been damaged by Florida hurricanes, the more pictures I have of the damage caused by the hurricanes and the more information I have about claim settlements, the better my data engine will be.”

For that use case, the trained AI system can automatically provide an insurance claim value based on a photograph associated with a particular claim.

Risk-of-Loss value

If a company using an external data source were to lose access to that data source, what economic impact would it have? Further, given the very real possibility of cyberattacks and cyberterrorism, what would the value of lost or corrupted data be? Points to consider would be the financial impact which may include actual loss, opportunity cost, regulatory fines and litigation settlement values. If the company has cybersecurity insurance, there’s a coverage limit on the policy which may differ from the actual claim settlement value and the overall cost to the company.

A bigger risk than data loss is the failure to use data to drive value, according to Genpact’s Srivastava.

There’s no silver bullet

No single equation can accurately assess the value of a company’s data. The value of data depends on several factors, including the usability, accessibility and cleanliness of the data. Other considerations are how the data is applied to business problems and what the value of the data would be if it were directly monetized, combined with other data, or used in machine learning to improve outcomes.

Further, business leaders should consider not only what the value of their company’s data is today, but the potential value of new services, business models or businesses that could be created by aggregating data, using internal data or, more likely, using a combination of internal and external data. In addition, business leaders should contemplate the risk of data loss, corruption or misuse.

While there’s no standard playbook for valuing data, expect data valuation and the inability to value data to have a direct impact on startup, public company, and merger and acquisition target valuations.

Big Data: The Interdisciplinary Vortex

As seen in  InformationWeek.

vortexGetting the most from data requires information sharing across departmental boundaries. Even though information silos remain common, CIOs and business leaders in many organizations are cooperating to enable cross-functional data sharing to improve business process efficiencies, lower costs, reduce risks, and identify new opportunities.

Interdepartmental data sharing can take a company only so far, however, as evidenced by the number of companies using (or planning to use) external data. To get to the next level, some organizations are embracing interdisciplinary approaches to big data.

Why Interdisciplinary Problem-Solving May Be Overlooked

Breaking down departmental barriers isn’t easy. There are the technical challenges of accessing, cleansing, blending, and securing data, as well as very real cultural habits that are difficult to change.

Today’s businesses are placing greater emphasis on data scientists, business analysts, and data-savvy staff members. Some of them also employ or retain mathematicians and statisticians, although they may not have considered tapping other forms of expertise that could help enable different and perhaps more accurate forms of data analysis and new innovations.

“Thinking of big data as one new research area is a misunderstanding of the entire impact that big data will have,” said Dr. Wolfgang Kliemann, associate VP for research at Iowa State University. “You can’t help but be interdisciplinary because big data is affecting all kinds of things including agriculture, engineering, and business.”

Although interdisciplinary collaboration is mature in many scientific and academic circles, applying non-traditional talent to big data analysis is a stretch for most businesses.

But there are exceptions. For example, Ranker, a platform for lists and crowdsourced rankings, employs a chief data scientist who is also a moral psychologist.

“I think psychology is particularly useful because the interesting data today is generated by people’s opinions and behaviors,” said Ravi Iyer, chief data scientist at Ranker. “When you’re trying to look at the error that’s associated with any method of data connection, it usually has something to do with a cognitive bias.”

Ranker has been working with a UC Irvine professor in the cognitive sciences department who studies the wisdom of crowds.

“We measure things in different ways and understand the psychological biases each method of data creates. Diversity of opinion is the secret to both our algorithms and the philosophy behind the algorithms,” said Iyer. “Most of the problems you’re trying to solve involve people. You can’t just think of it as data, you have to understand the problem area you’re trying to solve.”

Why Interdisciplinary Problem-Solving Will Become More Common

Despite the availability of new research methods, online communities, and social media streams, products still fail and big-name companies continue to make high-profile mistakes. They have more data available than ever before, but there may be a problem with the data, the analysis, or both. Alternatively, the outcome may fall short of what is possible.

“A large retail chain is interested in figuring out how to optimize supply management, so they collect the data from sales, run it through a big program, and say, ‘this is what we need.’ This approach leads to improvements for many companies,” said Kliemann. “The question is, if you use this specific program and approach, what is your risk of not having the things you need at a given moment? The way we do business analytics these days, that question cannot be answered.”

One mistake is failing to understand the error structure of the data. With such information, it’s possible to identify missing pieces of data, what the possible courses of action are, and the risk associated with a particular strategy.

“You need new ideas under research, ideas of data models, [to] understand data errors and how they propagate through models,” said Kliemann. “If you don’t understand the error structure of your data, you make predictions that are totally worthless.”

Already, organizations are adapting their approaches to accommodate the growing volume, velocity, and variety of data. In the energy sector, cheap sensors, cheap data storage, and fast networks are enabling new data models that would have been impossible just a few years ago.

“Now we can ask ourselves questions such as if we have variability in wind, solar, and other alternative energies, how does it affect the stability of a power system? [We can also ask] how we can best continue building alternative energies that make the system better instead of jeopardizing it,” said Kleinman.

Many universities are developing interdisciplinary programs focused on big data to spur innovation and educate students entering the workforce about how big data can affect their chosen field. As the students enter the workforce, they will influence the direction and culture of the companies for which they work. Meanwhile, progressive companies are teaming up with universities with the goal of applying interdisciplinary approaches to real-world big data challenges.

In addition, the National Science Foundation (NSF) is trying to accelerate innovation through Big Data Regional Innovation Hubs. The initiative encourages federal agencies, private industry, academia, state and local governments, nonprofits, and foundations to develop and participate in big data research and innovation projects across the country. Iowa State University is one of about a dozen universities in the Midwestern region working on a proposal.

In short, interdisciplinary big data problem-solving will likely become more common in industry as organizations struggle to understand the expanding universe of data. Although interdisciplinary problem-solving is alive and well in academia and in many scientific research circles, most businesses are still trying to master interdepartmental collaboration when it comes to big data.

Six Characteristics of Data-Driven Rock Stars

As seen in InformationWeek

Rock starData is being used in and across more functional aspects of today’s organizations. Wringing the most business value out of the data requires a mix of roles that may include data scientists, business analysts, data analysts, IT, and line-of-business titles. As a result, more resumes and job descriptions include data-related skills.

A recent survey by technology career site Dice revealed that nine of the top 10 highest-paying IT jobs require big data skills. On the Dice site, searches and job postings including big data skills have increased 39% year-over-year, according to Dice president Shravan Goli. Some of the top-compensated skills include big data, data scientist, data architect, Hadoop, HBase, MapReduce, and Pig — and the pay range for those skills ranges from more than $116,000 to more than $127,000, according to data Dice provided to InformationWeek.

However, the gratuitous use of such terms can cloud the main issue, which is whether the candidate and the company can turn that data into specific, favorable outcomes — whether that’s increasing the ROI of a pay-per-click advertising campaign or building a more accurate recommendation engine.

If data skills are becoming necessary for more roles in an organization, it follows that not all data-driven rock stars are data scientists. Although data scientists are considered the black belts, it is possible for other roles to distinguish themselves based on their superior understanding and application of data. Regardless of a person’s title or position in an organization, there are some traits common to data-driven rock stars that have more to do with attitudes and behaviors than technologies, tools, and methods. Click through for six of them.  [Note to readers:  This appeared as a slideshow.]

They Understand Data

Of course data-driven rock stars are expected to have a keener understanding of data than their peers, but what exactly does that mean? Whether a data scientist or a business professional, the person should know where the data came from, the quality of it, the reliability of it, and what methods can be used to analyze it, appropriate to the person’s role in the company.

How they use numbers is also telling. Rather than presenting a single number to “prove” that a certain course of action is the right one, a data-driven rock star is more likely to compare the risks and benefits of alternative courses of action so business leaders can make more accurate decisions.

“‘Forty-two’ is not a good answer,” said Wolfgang Kliemann, associate VP for research at Iowa State University. “‘Forty-two, under the following conditions and with a probability of 1.2% chance that something else may happen,’ is a better answer.”

They’re Curious

Data-driven rock stars are genuinely curious about what data indicates and does not indicate. Their curiosity inspires them to explore data, whether toggling between data visualizations, drilling down into data, correlating different pieces of data, or experimenting with an alternative algorithm. The curiosity may be inspired by data itself, a particular problem, or problem-solving methods that have been used in a similar or different context.

Data scientists are expected to be curious because their job involves scientific exploration. Highly competitive organizations hire them to help uncover opportunities, risks, behaviors, and other things that were previously unknown. Meanwhile, some of those companies are encouraging “out of the box” thinking from business leaders and employees to fuel innovation, which increasingly includes experimenting with data. Some businesses even offer incentives for data-related innovation.

They Actively Collaborate with Others

The data value chain has a lot of pieces. No one person understands everything there is to know about data structure, data management, analytical methods, statistical analysis, business considerations, and other factors such as privacy and security. Although data-driven rock stars tend to know more about such issues than their peers, they don’t operate in isolation because others possess knowledge they need. For example, data scientists need to be able to talk to business leaders and business leaders have to know something about data. Similarly, a data architect or data analyst may not have the ability to manipulate, explore, understand, and dig through large data sets, but a data scientist could dig through and discover patterns and then bring in statistical and programming knowledge to create forward-looking products and services, according to Dice president Shravan Goli.

They Try to Avoid Confirmation Bias

Data can be used to prove anything, especially a person’s opinion. Data-driven rock stars are aware of confirmation bias, so they are more likely to try to avoid it. While the term itself may not be familiar, they know it is not a best practice to disregard or omit evidence simply because it differs from their opinions.

“People like to think that the perspective they bring is the only perspective or the best perspective. I’m probably not immune to that myself,” said Ravi Ivey, chief data scientist at Ranker, a platform for lists and crowdsourced rankings. “They have their algorithms and don’t appreciate experiments or the difference between exploratory and confirmatory research. I don’t think they respect the traditional scientific method as such.”

The Data Science Association’s Data Science Code of Professional Conduct has a rule dedicated specifically to evidence, data quality, and evidence quality. Several of its subsections are relevant to confirmation bias. Among them are failing to “disclose any and all data science results or engage in cherry-picking” and failing to “disclose failed experiments or disconfirming evidence known to the data scientist to be directly adverse to the position of the client.”

They Update Their Skill Sets

Technology, tools, techniques, and available data are always evolving. The data-driven rock star is motivated to continually expand his or her knowledge base through learning, which may involve attending executive education programs, training programs, online courses, boot camps, or meetups, depending on the person’s role in the company.

“I encourage companies to think about growing their workforce because there aren’t enough people graduating with data science degrees,” said Dice president Shravan Goli. “You have to create a pathway for people who are smart, data-driven, and have the ability to analyze patterns so they have to add a couple more skills.”

Job descriptions and resumes increasingly include more narrowly defined skills because it is critical to understand which specific types of big data and analytical skills a candidate possesses. A data-driven rock star understands the technologies, tools, and methods of her craft as well as when and how to apply them.

They’re Concerned About Business Impact

With so much data available and so many ways of analyzing it, it’s easy to get caught up in the technical issues or the tasks at hand while losing site of the goal: using data in a way that positively impacts the business. A data-driven rock star understands that.

Making a business impact requires three things, according to IDC adjunct research adviser Fred McGee: having a critical mass of data available in a timely manner, using analytics to glean insights, and applying those insights in a manner that advances business objectives.

A data-driven rock star understands the general business objectives as well as the specific objective to which analytical insights are being applied. Nevertheless, some companies are still falling short of their goals. Three-quarters of data analytics leaders from major companies recently told McKinsey & Company that, despite using advanced analytics, their companies had improved revenue and costs by less than 1%.

How Corporate Culture Impedes Data Innovation

As seen in InformationWeek

Floppy disk

Corporate culture moves slower than tech

Competing in today’s data-intensive business environment requires unprecedented organizational agility and the ability to drive value from data. Although businesses have allocated significant resources to collecting and storing data, their abilities to analyze it, act upon it, and use it to unlock new opportunities are often stifled by cultural impediments.

While the need to update technology may be obvious, it may be less obvious that corporate cultures must also adapt to changing times. The necessary adjustments to business values, business practices, and leadership strategies can be uncomfortable and difficult to manage, especially when they conflict with the way the company operated in the past.

If your organization isn’t realizing the kind of value from its big data and analytics investments that it should be, the problem may have little to do with technology. Even with the most effective technologies in place, it’s possible to limit the value they provide by clinging to old habits.

Here are five ways that cultural issues can negatively affect data innovation:

1. The Vision And Culture Are At Odds

Data-driven aspirations and “business as usual” may well be at odds. What served a company well up to a certain point may not serve the company well going forward.

“You need to serve the customer as quickly as possible, and that may conflict with the way you measured labor efficiencies or productivity in the past,” explained Ken Gilbert, director of business analytics at the University of Tennessee Office of Research and Economic Development, in an interview with InformationWeek.

[ What matters more: Technology or people? Read Technology Is A Human Endeavor. ]

Companies able to realize the most benefit from their data are aligning their visions, corporate mindsets, performance measurement, and incentives to effect widespread cultural change. They are also more transparent than similar organizations, meaning that a wide range of personnel has visibility into the same data, and data is commonly shared among departments, or even across the entire enterprise.

“Transparency doesn’t come naturally,” Gilbert said. “Companies don’t tend to share information as much as they should.”

Encouraging exploration is also key. Companies that give data access to more executives, managers, and employees than they did in the past have to also remove limits that may be driven by old habits. For example, some businesses discourage employees from exploring the data and sharing their original observations.

2. Managers Need Analytics Training

Companies that are training their employees in ways to use analytical tools may not be reaching managers and executives who choose not to participate because they are busy or consider themselves exempt. In the most highly competitive companies, executives, managers, and employees are expected to be — or become — data savvy.

Getting the most from BI and big data analytics means understanding what the technology can do, and how it can be used to best achieve the desired business outcomes. There are many executive programs that teach business leaders how to compete with business analytics and big data, including the Harvard Business School Executive Education program.

3. Expectations Are Inconsistent

This problem is not always obvious. While it’s clear the value of BI and big data analytics is compromised when the systems are underutilized, less obvious are inconsistent expectations about how people within the organization should use data.

“Some businesses say they’re data-driven, but they’re not actually acting on that. People respond to what they see rather than what they hear,” said Gilbert. “The big picture should be made clear to everybody — including how you intend to grow the business and how analytics fits into the overall strategy.”

4. Fiefdoms Restrict Data Sharing

BI and analytics have moved out from the C-suite, marketing, and manufacturing to encompass more departments, but not all organizations are taking advantage of the intelligence that can be derived from cross-functional data sharing. An Economist Intelligence Unit survey of 530 executives around the world revealed that information-sharing issues represented the biggest obstacle to becoming a data-driven organization.

“Some organizations supply data on a need-to-know basis. There’s a belief that somebody in another area doesn’t need to know how my area is performing when they really do,” Gilbert said. “If you want to use data as the engine of business growth, you have to integrate data from internal and external sources across lines, across corporate boundaries.”

5. Little-Picture Implementations

Data is commonly used to improve the efficiency or control the costs of a particular business function. However, individual departmental goals may not align with the strategic goal of the organization, which is typically to increase revenue, Gilbert said.

“If the company can understand what the customer values, and build operational systems to better deliver, that is the company that’s going to win. If the company is being managed in pieces, you may save a dime in one department that costs the company a dollar in revenue.”

Hadoop is Now a General-purpose Platform

As seen in SD Times

HadoopApache Hadoop adoption is accelerating among enterprises and advanced computing environments as the project, related projects, and ecosystem continue to expand. While there were valid reasons to avoid the 1.x versions, skeptics are reconsidering since Hadoop 2 (particularly the latest 2.2.0 version) provides a viable choice for a wider range of users and uses.

“The Hadoop 1.x generation was not easy to deploy or easy to manage,” said Juergen Urbanski, former chief technologist of T-Systems, the IT consulting division of Deutsche Telecom. “The many moving parts that make up a Hadoop cluster were difficult for users to configure. Fortunately, Hadoop 2 fills in many of the gaps. Manageability is a key expectation, particularly for the more critical business use cases.”

Hadoop 2.2.0 adds the YARN resource-management framework to the core set of Hadoop modules, which include the Hadoop Common set of utilities, the Hadoop Distributed File System (HDFS), and Hadoop MapReduce for parallel processing. Other improvements include enhancements to HDFS, binary compatibility for Map/Reduce applications built on Hadoop 1.x, and support for running Hadoop on Windows.
Meanwhile, Hadoop-related projects and commercial products are proliferating along with the ecosystem. Collectively, the new Hadoop capabilities provide a more palatable and workable solution, not only for enterprise developers, business analysts and IT, but also a larger community of data scientists.

“There are many technologies that are helping Hadoop realize its potential as being a more general-purpose platform for computing,” said Doug Cutting, co-creator of Hadoop. “We started out as a batch processing system. People used it to do computations on large data sets that they couldn’t do before, and they could do it affordably. Now there’s an ever-increasing amount of data processing that organizations can do using this one platform.”

YARN expands the possibilities
The limitations of Map/Reduce were the genesis of Apache Hadoop NextGen MapReduce (a.k.a. YARN), according to Arun Murthy, release manager for Hadoop 2.

“It was apparent as early as 2008 that Map/Reduce was going to become a limiting factor because it’s just one algorithm,” he said. “If you’re trying to do things like machine learning and modeling, Map/Reduce is not the right algorithm to do it.”

Rather than replacing Map/Reduce altogether, it was supplemented with YARN to provide things like resource management and fault tolerance as base primitives in the platform, while allowing end users to do different things as they process and track the data in different ways.

“The architecture had to be more general-purpose than Map/Reduce,” said Murthy. “We kept the good parts of Map/Reduce, such as scale and simple APIs, but we had to allow other things to coexist on the same platform.”

The original Hadoop MapReduce was based on the Google Map/Reduce paper, while Hadoop HDFS was based on the Google File System paper. HDFS provides a mechanism to store huge amounts of heterogeneous data cheaply; Map/Reduce enables highly efficient parallel processing.

“Map/Reduce is a mature concept that comes from LISP and functional programming,” said Murthy. “Google scaled Map/Reduce out in a massive way while keeping a real simple interface for the end user so the end user does not have to deal with the nitty-gritty details of scheduling, resource management, fault tolerance, network partitions, and other crazy stuff. It allowed the end user to just deal with the business logic.”

Because YARN is an open framework, users are free to use algorithms other than Map/Reduce. In addition, applications can run on and integrate with it.

“The scientific and security computing communities depend on Open MPI technologies, which weren’t even an option in Hadoop 1,” said Edmon Begoli, CTO of analytics consulting firm PYA Analytics. “The architecture of Hadoop 2 and YARN allows you to plug in your own resource manager and your own parallel processing algorithms. People in the high-performance computing community have been talking about YARN enthusiastically for years.”

HDFS: Aspirin for other headaches
Some CIOs have been reluctant to bring Hadoop into the enterprise because there have been too many barriers to entry, although Hadoop 2 improvements are turning the tide.

“I think two of the deal breakers were NameNode federation and the Quorum Journal Manager, which is basically a failover for the HDFS NameNode,” said Jonathan Ellis, project chair for Apache Cassandra. “Historically, if your NameNode went down, you were basically screwed because you’d lose some amount of data.”

Hadoop 2 introduces the Quorum Journal Manager, where changes to the NameNodes are recorded to replicated machines to avoid data loss, he said. NameNode federation allows a pool of NameNodes to share responsibility for an HDFS cluster.

“NameNode federation is a bit of a hack because each NameNode still only knows about the file set it owns, so at the client level you have to somehow teach the client to look for some files on one NameNode and other files on another NameNode,” said Ellis.

HDFS is nevertheless an economically feasible way to store terabytes or even petabytes of data. Facebook has a single cluster that stores more than 100PB on Hadoop, according to Murthy.

“It’s amazing how much data you can store on Hadoop,” he said. “But you have to interact with the data, interrogate it, and come up with insights. That’s where YARN comes in. Now you have a general-purpose data operating system, and on top of it you can run applications like Apache Storm.”

John Haddad, senior director of product marketing at Informatica, said the Hadoop 2 improvements allow his organization to run more types of applications and workloads.

“Various teams can run a variety of different applications on the cluster concurrently,” he said. “Hadoop 1 lacked some of the security, high availability and flexibility necessary to have different applications, different types of workloads, and more than one organization or team submitting jobs to the cluster.”

Gearing up for prime time
The number and types of Hadoop open-source projects and commercial offerings are expanding rapidly. Hadoop-related projects include HBase, a highly scalable distributed database; the Hive data warehouse infrastructure; the Pig language and framework for parallel computing; and Ambari, which provisions, manages and monitors Apache Hadoop clusters.

“It seems like we’ve got 20 or 30 new projects every week,” said Cutting. “We have all these separate, independent projects that work together, so they’re interdependent but under separate control so the ecosystem can evolve.”

Meanwhile, solution providers are building products for or integrating their products with Hadoop. Collectively, Hadoop improvements, open-source projects and compatible commercial products are allowing organizations to tailor it to their needs, rather than having to shoehorn what they are doing into a limited set of capabilities. And the results are impressive.

For example, Oak Ridge National Laboratory used Hadoop to help the Center for Medicare and Medicaid Services identify tens of millions of dollars in overpayments and fraudulent transactions in just three weeks.

“Using only two or three engineers, we were able to approach and understand the data from different angles using Hive on Hadoop because it allowed us to write SQL-like queries and use a machine-learning library or run straight Map/Reduce queries,” said PYA Analytics’ Begoli. “In the traditional warehousing world, the same project would have taken months unless you had a very expensive data warehouse platform and very expensive technology consulting resources to help you.”

The groundswell of innovation is enabling Hadoop to move beyond its batch-processing roots to include real-time and near-real-time analytics.

 

The groundswell of innovation is enabling Hadoop to move beyond its batch-processing roots to include real-time and near-real-time analytics.

Skeptics are doing a double take
Hadoop 2 is converting more skeptics than Hadoop 1 because it’s more mature, it’s easier (but not necessarily easy) to implement, it has more options, and its community is robust.

“You can bring Hadoop into your organization and not worry about vendor lock-in or what happens if the provider disappears,” said Murthy. “We have contributions from about 2,000 people at this point.”

There are also significant competitive pressures at work. Organizations that have adopted Hadoop are improving the effectiveness of things like fraud detection, portfolio management, ad targeting, search, and customer behavior by combining structured and unstructured data from internal and external sources that commonly include social networks, mobile devices and sensors.

“We’re seeing organizations start off with basic things like data warehouse optimization, and then move on to other cool and interesting things that can drive more revenue from the company,” said Informatica’s Haddad.

For example, Yahoo has been deploying YARN in production for a year, and the throughput of the YARN clusters has more than doubled. According to Murthy, Yahoo’s 35,000-node cluster now processes 130 to 150 jobs per day versus 50 to 60 before YARN.

“When you’ve got 2x over 35,000 to 40,000 nodes, that’s phenomenal,” he said. “It’s a pretty compelling story to tell a CIO that if you just upgrade your software from Hadoop 1 to Hadoop 2, you’ll see 2x throughput improvements in your jobs.”

Of course, Hadoop 2.2.0 isn’t perfect. Nothing is. And some question what Hadoop will become as it continues to evolve.

Hadoop co-creator Cutting said the beauty of Hadoop as an open-source project is that new things can replace old things naturally. That prospect somewhat concerns PYA Analytics’ Begoli, however.

“I’m concerned about the explosion of frameworks because it happened with Java and it’s happening with JavaScript,” he said. “When everyone is contributing something, it can be too much or the original vision can be diluted. On the other hand, a lot of brilliant teams are contributing to Hadoop. There are management tools, SQL tools, third-party tools and a lot of other things that are being incubated to deliver advanced capabilities.”

While Hadoop’s full impact has yet to be realized, Hadoop 2 is a major step forward.

Well-known Hadoop implementations

Amazon Web Services: Amazon Elastic MapReduce uses Hadoop in order to provide a quick, easy and cost-effective way to distribute and process large amounts of data across a resizable cluster of Amazon EC2 instances. It can be used to analyze click-stream data, process vast amounts of genomic data and other large scientific data sets, and process logs generated by Web and mobile applications.