The silent revolution: the point where flash storage takes over for HDD

The post The silent revolution: the point where flash storage takes over for HDD appeared first on Mirantis | Pure Play Open Cloud.
For years the old paradigm has held true. If you want fast, buy flash, but it is going to cost you. If you want cheap, large HDD based servers were the go-to. The old standby for reasonably fast block storage, the 2TBx24 chassis was ubiquitous. For years, it looked like flash would be relegated to performance tiers. But is this actually true? I’ve been suspicious for some time, but a few days ago I did a comparison for an internal project and what I saw surprised even me. 
Flash storage technology and comparative cost
Flash storage has developed at a breakneck pace in recent years. Not only have devices become more resilient and faster, but there is also a new interface to consider. Traditional SSDs are SATA, or in relatively rare cases, SAS based. This limits the performance envelope of the devices severely. SATA SSDs top out at about 550MB/s maximum throughput, and offer around 50k small file input/output operations per second (IOPS) regardless of the speed of the actual chips inside the device. 
This limitation is due to the data transfer speed of the bus and the need to translate the storage access request to a disk based protocol (SATA/SAS) and, inside the SSD, back to a memory protocol. The same thing happens on the way out when data is being read. 
Enter Non-Volatile Memory express (NVMe). This ‘interface’ is essentially a direct connection of the flash storage to PCIe lanes. A configuration of 4 lanes per NVMe is common, though technology exists to multiplex NVMes so more devices can be attached than there are PCIe lanes available. 
NVMe devices typically top out above 2GB/s, and can offer several hundred thousand IOPS – theoretically. They also consume a lot more CPU when operating in a software defined storage environment, which limits performance somewhat. However, in practical application they are still much faster than traditional SSDs – at what is usually a very moderate cost delta. 
If the performance of SATA SSDs is insufficient for a specific use case, moving to SAS SSDs is usually not worth the expense, as NVMe devices, which offer much better performance, are usually not more expensive than their SAS counterparts, so a move directly to NVMe is preferable.
One more note: If NVMes operate with the same number of CPU cores as SSDs, they are still somewhat faster and very comparable financially. The calculations below are designed to include more CPU cores for NVMe for performance applications.
Let’s look at how the numbers work out for different situations.
Small Environments
Let’s have a look at a 100TB environment with increasing performance requirements. Consider the following table that looks at HDDs, SSDs, and NVMe. Street prices are in US$x1000, and IOPS are rough estimates:

td { font-size: small; padding: 5px; }
.winner { background-color: lightgreen; }

100TB
HDD 6TB 12/4U cost [x1000 US$]
HDD 2TB 20/2U cost [x1000 US$]
SSD
Layout
SSD cost [x1000 US$]
NVMe
Layout
NVMe cost [x1000 US$]

10k IOPS
132
135
5X 10x 7.68TB
91
5x 4x 15.36TB
102

30k IOPS
345
271
5X 10x 7.68TB
91
5x 4x 15.36TB
102

50k IOPS
559
441
5X 10x 7.68TB
91
5x 4x 15.36TB
102

100k IOPS
1,117
883
5X 10x 7.68TB
91
5x 4x 15.36TB
102

200k IOPS
2,206
1,767
7x 14x 3.84TB
113
5x 10x 7.68TB
113

500k IOPS
5,530
4,419
14x 14x 1.92TB
168
7x 14x 3.84TB
133

1000k IOPS
11,034
8,804
42x 14x 1.92TB
470
13x 14x 2TB 
168

In this relatively small cluster, as expected, HDDs are no longer viable. The more IOPS required, the more extra capacity must be purchased to provide enough spindles. This culminates in a completely absurd $11 million for a 1000K IOPS cluster built on 6TB hard disks.
Middle of the Road
Of course, we all know that larger amounts of SSD storage are more expensive, so let’s quadruple storage requirements and see where we get. HDDs should become more viable, wouldn’t you think?

400TB
HDD 6TB 12/4U
HDD 2TB 20/2U
SSD
Layout
SSD cost [x1000 US$]
NVMe
Layout
NVMe cost [x1000 US$]

10k IOPS
250
510
14x 14x 7.68TB
326
7x 14x 15.36TB
348

30k IOPS
405
510
14x 14x 7.68TB
326
7x 14x 15.36TB
348

50k IOPS
655
510
14x 14x 7.68TB
326
7x 14x 15.36TB
348

100k IOPS
1,311
883
14x 14x 7.68TB
326
7x 14x 15.36TB
348

200k IOPS
2,593
1,767
14x 14x 7.68TB
326
7x 14x 15.36TB
348

500k IOPS
6,495
4,419
14x 14x 7.68TB
326
7x 14x 15.36TB
348

1000k IOPS
12,961
8,804
27x 14x 3.84TB
437
14x 14x 7.68TB
413

Surprise! Again we find that HDD is only viable for the slower speed requirements of archival storage. Note that 16TB NVMes are not much more expensive than the SSD solution!
A note about chassis: To get good performance out of NVMe devices, a lot more CPU cores are needed than in HDD based solutions. Four OSDs per NVMe and 2 cores per OSD are a rule of thumb. This means that stuffing 24 NVMes into a 2U chassis and calling it a day is not going to provide exceptional performance.  We recommend 1U chassis with 5-8 NVMe devices to reduce bottlenecking on the OSD code itself. (I’m also assuming that the network connectivity is up to transporting the enormous amount of data traffic.)
Petabyte Scale
If we enter petabyte scale, hard disks become slightly more viable, but at this scale (we are talking 64 4U nodes) the sheer physical size of the hard disk based cluster can become a problem:

1PB
HDD 6TB 12/4U
HDD 2TB 20/2U
SSD
Layout
SSD cost [x1000 US$]
NVMe
Layout
NVMe cost [x1000 US$]

10k IOPS
453
1257
34x 14x 7.68TB
789
17x 14x 15.36tb
871

30k IOPS
453
1257
34x 14x 7.68TB
789
17x 14x 15.36tb
871

50k IOPS
488
1257
34x 14x 7.68TB
789
17x 14x 15.36tb
871

100k IOPS
1850
1257
34x 14x 7.68TB
789
17x 14x 15.36tb
871

200k IOPS
2101
1767
34x 14x 7.68TB
789
17x 14x 15.36tb
871

500k IOPS
4619
4365
34x 14x 7.68TB
789
17x 14x 15.36tb
871

1000k IOPS
10465
8720
34x 14x 7.68TB
789
17x 14x 15.36tb
871

Note: The performance data for SSD and NVMe OSDs is estimated conservatively. Depending on the use case performance will vary.
So what do we learn from all this? 
The days of HDD are numbered. 
For most use cases even today SSD is superior. Also, SSD and NVMe are still nosediving in terms of cost/unit. SSD/NVMe based nodes also make for much more compact installations and are a lot less vulnerable to vibration, dust and heat. 
The health question
Of course, cost isn’t the only issue. SSDs do wear. The current crop is way more resilient long term than SSDs from a couple of years ago, but they will still eventually wear out. On the other hand, SSDs are not prone to sudden catastrophic failure triggered by either a mechanical event or marginal manufacturing tolerances.
That means that the good news is that SSDs do not fail suddenly in almost all cases. They develop bad blocks, which for a time are replaced with fresh blocks from an invisible capacity reserve on the device. You will not see capacity degradation until the capacity reserve runs out of blocks, wear leveling does all this automatically.
You can check the health of the SSDs by using SMART (smartmontools on Linux), which will show how many blocks have been relocated, and the relative health of the drive as a percentage of the overall reserve capacity.  
Bonus round: SSD vs 10krpm
In the world of low latency and high IOPS, the answer for HDD manufacturers is to bump up rotation speed of the drives. Unfortunately, while this does make them faster, it also makes them more mechanically complex, more thermally stressed and in a word: expensive.
SSDs are naturally faster and mechanically simple. They also — traditionally at least — were more expensive than the 10krpm disks, which is why storage providers have still been selling NASes and SANs with 10 or 15krpm disks. (I know this from experience, as I used to run high performance environments for a web content provider.)
Now have a look at this:

Device type
Cost [US$]
Cost/GB [US$]

HDD 1.8TB SAS 10krpm Seagate
370
0.21

SSD 1.92TB Micron SATA
335
0.17

NVMe 2.0TB Intel
399
0.20

HDD 0.9TB Seagate
349
0.39

In other words, 10krpm drives are obsolete not only from the cost/performance ratio, but even from the cost/capacity ratio! The 15krpm drives are even worse. The hard disks in this sector have no redeeming qualities; they are more expensive, drastically slower, more mechanically complex, and cost enormous amounts of money to run.
So why is there so much resistance to moving beyond them? I have heard the two arguments against SSDs:
Lifespan: With today’s wear leveling, this issue has largely evaporated. Yes, it is possible to wear out an SSD, but have a look at the math: a read optimized SSD is rated for about one Device Writes Per Day (DWPD) (that is, a write of the whole capacity of the device) over 5 years. Let’s compare this with an 1.8TB 10krpm HDD. With a workload that averages out at 70MB/s write (with a mix of small and large writes) and a 70/30 read/write ratio, this 10krpm HDD can write 1.81TB/day. 
In other words, you won’t wear out the SSD under the same conditions within 5 years. If you want to step up to 3xDWPD (mixed use), the drive is still below the cost for the HDD (US$350), and you will have enough resilience even for very write heavy workloads.
TCO: It is true that an SSD uses more power as throughput increases. Most top out at about 3x the power consumption of a comparable HDD if they are driven hard. They also will provide ~10x the throughput and >100x the small block IOPS of the HDD. If the SSD is ambling along at the sedate pace of the 10krpm HDD, it will consume less power than the HDD. If you stress the performance envelope of the SSD, you would have to have a large number of HDDs to match the single SSD performance, which would not even be in the same ballpark in both initial cost and TCO.
In other words, imagine having to put up a whole node with 20 HDDs to match the performance of this single $350 mixed use SSD that consumes 20W at full tilt operation. You would have to buy a $4000 server with 20 $370 HDDs — which would, by the way, consume an average of maybe 300W. 
So as you can see, an SSD is the better deal, even from a purely financial perspective, whether you drive it hard or not.
Of course there are always edge cases. Ask us about your specific use case, and we can do a tailored head-to-head comparison for your specific application.
So what’s next?
We are already nearing the point where NVMe will supersede the SATA or SAS interface in SSDs. So the SSDs, which came out on top when we started this discussion, are already on their way out again.
NVMe has the advantage of being an interface created specifically for flash memory. It does not pretend to be on a HDD, as SAS and SATA do, thus it does not need the protocol translation needed to transform the HDD access protocol internally into a flash memory access protocol and transform back on the way out. You can see the difference by looking at the performance envelope of the devices. 
New flash memory technologies push the performance envelope and the interface increasingly hampers performance, so the shift from SAS/SATA to NVMe is imminent. NVMe even comes in multiple form factors, with one closely resembling 2.5” HDDs for hotswap purposes, and one (m.2, which resembles a memory module) for internal storage that does not need hot swap capability. Intel’s ruler design and Supermicro’s m.2 carriers will further increase storage density with NVMe devices.
On the horizon, new technologies such as Intel Optane again increase performance and resilience to wear, currently still much higher cost to traditional flash modules.  
Maybe a few years from now everything is going to be nonvolatile memory and we simply can cut power to the devices. Either way, we will see further increases in density, performance and reliability and further decrease in cost. 
Welcome to the future of storage!
The post The silent revolution: the point where flash storage takes over for HDD appeared first on Mirantis | Pure Play Open Cloud.
Quelle: Mirantis

Published by