Pure Storage wants to work with data gravity, not against it – Blocks and Files

Pure Storage CEO Charles Giancarlo expressed two remarkable views in an interview with Blocks & Files – that hyperconverged infrastructure does not exist in hyperscaler data centers and that data needs to be virtualized.

He expressed many remarkable points of view, but these two were particularly impressive. First, we asked him if running applications in the public cloud makes the distinction between DAS (Direct-Attached Storage) and external storage redundant. He said, “Generally, the public cloud is designed with disaggregated storage in mind…with DAS used for server boot drives.”

Storage systems are connected to compute by high-speed Ethernet networks.

Charles Giancarlo

This is more efficient than creating virtual SANs or filers by aggregating the DAS of each server into the HCI (hyperconverged infrastructure). HCI was generally a good approach, back in the 2000s when network speeds were around 1Gbps, but “now with 100Gbps and 400Gbps coming, elements dissociated can be used and it is more effective”.

The use of HCI is limited, according to Giancarlo, by scaling difficulties, because the larger an HCI cluster becomes, the more its resources are applied to internal affairs and not to running applications.

Faster networking is a factor in a second point he made about data virtualization: “Networking was virtualized 20 years ago. Compute was virtualized 15 years ago, but storage is still very physical. Initially, networking was not fast enough to share storage. It’s no longer the case now. He noted that apps are becoming containerized (cloud-native) and therefore able to run anywhere.

He mentioned that large petabyte-scale datasets have data gravity; moving them takes time. With Kubernetes and containers in mind, Pure will soon have Fusion for traditional workloads and Portworx Data Services (PDS) for cloud-native workloads. Both will generally be available in June.

What does it mean? Fusion is Pure’s way of federating all Pure devices – on-premises and off-premises hardware/software arrays, i.e. software in the public cloud – with a cloud-like hyperscaler consumption model. PDS, on the other hand, offers the possibility of deploying databases on demand in a Kubernetes cluster. Fusion is a self-service standalone SaaS management plan, and PDS is also a SaaS offering for data services.

We should design a customer’s Pure infrastructure, on-premises and off-premises, combined to form resource pools and presented for public cloud-like use, with classes of service, workload placement and l ‘balancing.

Giancarlo said “datasets will be managed through policies” in an orchestrated way, with one benefit being the elimination of uncontrolled copying.

He said, “DMBS and unstructured data can be replicated 10 or even 20 times for development, testing, analytics, archiving and other reasons. How do people follow up? Dataset management will be automated in Pure. »

Suppose there is a 1PB dataset in a London data center and an application in New York needs it to run analysis routines? Do you move data to New York?

Giancarlo said, “Don’t move the [petabyte-level] database. Move megabytes of application code instead.

A containerized application can run anywhere. Kubernetes (Portworx) can be used to instantiate it in London datacenter. In effect, you accept the limitations imposed by the gravity of the data and work with it, moving from light containers to heavy datasets and not the other way around. You create a snapshot of the dataset in London and the moved containerized application code works on the snapshot and not on the original raw data.

When the application’s work is complete, the snapshot is deleted and excessive data copying is avoided.

Of course, the data must be copied for disaster recovery reasons. Replication can be used for this because it is not as time-sensitive as an analysis application requiring results in seconds rather than waiting for hours as a data set slowly makes its way through a 3,500 mile network pipe.

Giancarlo asserted, “With Pure Fusion, you can configure this per policy – ​​and follow data sovereignty requirements.”

He said ideas of information lifecycle management need to be updated with dataset lifecycle management. According to him, Pure should be applicable to very large-scale dataset environments, those handled by Infinidat and VAST Data. Giancarlo called them newcomers, saying they were suppliers monitored by Pure, although he said they didn’t come across them very often in customer offers.

Referring to this high-end market, Giancarlo said: “We clearly want to reach the environment on a very large scale that our systems have not yet reached. We intend to change this with specific strategies. There were no more details about it. We asked about mainframe connectivity and he said it was relatively low on Pure’s priority list: “Maybe through M&A, but we don’t want to fragment the product line. ”

Pure’s main competition comes from traditional vendors such as Dell EMC, Hitachi Vantara, HPE, IBM and NetApp. “Our biggest competitive advantage,” he said, “is that we think data storage is cutting-edge technology and our competitors think it’s a commodity…That changes the how you invest in the market.”

For example, it is better to have a cohesive set of products rather than several different products to meet all needs. Take that, Dell EMC. It is also necessary and useful to invest in building your own flash drives and not using basic SSDs.

Our conclusion is that Pure is bringing the cloud-like storage infrastructure and consumption model to the on-premises world, using the containerization movement to its advantage. It will provide data infrastructure management facilities to virtualize datasets and overcome data gravity by moving compute (applications) to data instead of vice versa. Expect announcements on progress on this path at the Pure Accelerate event in June.

Lance B. Holton