Dataworks Summit – Big Data meets multi-cloud

‘The network is the computer’ was the mantra of the early days of connected systems, but it took the Internet to fully realize the concept. In today’s era of smart sensors, cheap storage and sophisticated algorithms, an apt aphorism might be ‘the data is the business’ in that business decisions, new services and product strategies are fueled by the analysis of massive amounts of mundane data. The ability to collect, store and analyze such routine data as transaction records, system logs, sensor readings and location information with increasing granularity has the potential to turn what was formerly lost or ignored information into valuable business assets.

The organizations that are most adept at spinning the digital straw into gold find themselves at a significant competitive advantage. Aside from the advances in core infrastructure, perhaps nothing has been as responsible for the rise of data-inspired business decisions as the Hadoop ecosystem of open source distributed data storage and processing software.

The extent of their collective effect on the industry was manifest at the recent Dataworks Summit, led by one of the two pure-play purveyors of Hadoop-related technology, Hortonworks. Founded by some of the originators of Hadoop at Yahoo, Hortonworks used the event to share its vision of a data backplane that links data from anywhere an enterprise might have or generate it, on-premise clusters, cloud storage services or connected devices and

Hortonworks’ grand unification theory

Hortonworks executives shared details of product updates, roadmaps and strategy that has Hadoop and related open source products as the foundation of what it calls a connected data platform that encompasses both stored, whether in a Hadoop cluster or S3 bucket, and streaming data. Unifying it in a scalable, extensible system enables wrapping the data stores with various software modules that facilitate management and policy enforcement, various types of data analysis including stream and real-time processing, traditional OLAP and machine/deep learning and custom applications.

Like virtually every IT vendor, Hortonworks has embraced the model of hybrid cloud infrastructure. At Dataworks it unveiled updates to the HDP (Data Platform), HDF (Data Flow) and DPS (DataPlane Service) that improve its integration with cloud data sources including Azure and Google Cloud through the use of containers or API-level integration with services like HDInsight.

For those not among Hortonworks’ 1,400 customers and unfamiliar with its products and nomenclature:

  • HDP is a packaged Hadoop distribution (think Red Hat Enterprise Linux) with a central management system (YARN project) that is designed for processing stored data (at rest).
  • HDF is an assemblage of several open source projects (Apache Kafka, Storm, NiFi and Druid) designed to process and analyze in real-time streaming data such from sources like IoT devices and log files.
  • DPS is a services management layer that links HDP and HDF data sources with management, replication/backup, discovery and analytics services.

More broadly, Hortonworks’ portfolio occupies a hierarchy that starts with separate products designed for stored (at-rest) and streaming (in-motion) data. Sitting above these is the data plane service that pulls both native Hadoop and connected data sources into a single control plane that provide environment-wide management, security and access policy enforcement and application access via APIs.

Hortonworks’ approach is decidedly different than that of legacy storage vendors that still primarily build systems for raw database volumes and file shares. Instead, Hortonworks uses the distributed data design popularized by Hadoop that spreads data across dozens or hundreds of nodes and exposes it to applications such as data science IDEs and data governance software through a controller and APIs.

The design works exceptionally well for new applications using unstructured data types like log and text files, images and sensor readings, however for the broadest appeal, it still must access data outside of the distributed environment. Data integration is an increasingly important feature now that organizations are using cloud services like S3, EBS and Azure Files to hold business information. Hortonworks addresses integration through connectors to various storage types and service that allows applications to either access the data in situ or copy it to the Hadoop distributed store. Indeed, the latter is preferred in most cases to improve performance and data management, meaning that over time, more of an organization’s information ends up in Hadoop distributed storage.

Recently a more important category of data-hungry application using machine or deep learning has prompted changes to the Hadoop architecture. AI apps are built using specialized software frameworks like TensorFlow or Caffe that can be tedious to install and configure for first-timers. Thus, bundling these into a portable container offers convenience and reproducibility for the developers and IT teams using and deploying AI development environments. Likewise, ML and DL algorithms are highly parallelizable, which means that performance is significantly better using GPUs. These two factors have led the Hadoop community to add support for containerized workloads and GPU instances, which Hortonworks now supports in its HDP 3.0 release. These additions will also allow Hadoop environments to take advantage of cloud GPU instances and container orchestration services, particularly those based on Kubernetes like Azure AKS or Google Cloud GKE.

While valuable in the short-term, these are mere stepping stones on the path to Hortonworks’ broader strategy of creating multi-cloud “data fabric” in which the Hadoop ecosystem of more than 25 open source projects provide the analysis capabilities and programming interfaces for data-driven applications. Hortonworks’ big vision is to enable its DPS control plane to actively manage workloads across on-premise and multi-cloud environments, perhaps even dynamically bursting applications from one to another to meet demand spikes or mitigate infrastructure failures. Indeed, such a unified data control plane would allow applications to mix-and-match deployment environments to optimize performance based on their compute capabilities and data locality.

My take

Hortonworks presented a compelling vision, but like all technology companies with a relatively narrow domain, the vision is based on a particular worldview that doesn’t always align with the messy conditions in the field where enterprises rely on 30-year-old databases and COBOL applications that aren’t easily migrated to cloud services and distributed data systems.

Nevertheless, it’s hard to argue with the numbers Hortonworks’ CFO presenting that showed 65 percent annual revenue growth over the past four years and customers that on average increase their Hortonworks spending by 20 to 25 percent a quarter. Still, that’s coming off a tiny revenue base of $46 million in 2014. For example, based on last year’s revenues, SAP is more than 100-times the size of Hortonworks, while Oracle is almost 150-times as large.

If Hortonworks’ grand unified theory of data spanning cloud and on-premise infrastructure sounds familiar, it should, since it’s one shared by virtually every significant storage vendor paying homage to the notion of hybrid cloud. For example, as I detailed the last two years, Veritas is trying to reboot its company from its niche in data backup and replication into a pan-cloud data management system. Likewise, EMC which struggled to create a coherent hybrid cloud message as it tried to bridge the gap between legacy storage systems and cloud services before its acquisition by Dell, which continues to grapple with the same problem.

There’s an inherent appeal to the lock-in resistant multi-cloud vision, however I wonder how well the idea of redesigning one’s entire data strategy and infrastructure will be received outside the core of big data adopters focused on particular problem scenarios.

Longer-term, like all pan-cloud strategies that work by inserting an abstraction layer between the infrastructure and applications or data, I have doubts about the viability as cloud users move up the product hierarchy from IaaS to PaaS and application services. Yes, these risk lock-in since such services aren’t compatible, but the advantages of using things like Google Spanner and BigQuery, Azure Cognitive Services and AWS Athena or Data Pipeline seem overwhelming. The real value of the cloud can only be realized when organizations stop treating it like an off-premise managed VM server farm and storage array and start using it as an information utility with individually metered services on demand. Building a multi-cloud system that relies on using the lowest common denominator is a backward-looking strategy.

Ideally, Hortonworks and others will be able to find the sweet spot and enable platform agnosticism without crippling access to innovative higher-level cloud services. The sizable, engaged developer community and project ecosystem that’s built up around Hadoop and was on display at the Dataworks Summit is cause for hope since the open source ethos fosters innovation and gravitates to the best technological solutions. The convergence of big data, cloud and AI will be a trend to watch over the coming years.

Image credit – © naypong –


Article by channel:

Read more articles tagged: