Safe FC max_xfer_size settings (AIX 6.1 TL6 and VIOS 2.2)

AlexDeGruven · December 2012

This isn't PC hardware, and is more directed at @rootwyrm than anything else, but I thought I'd put it here for posterity and the possibility that there are some closet IBM nerds in the house outside of Phil and I.

On a data warehouse box, we're doing some deep I/O tuning to push as much performance out of the hardware as we can. We've adjusted disk queues, which has helped with IO, and the clients are working on parallelizing their workload moreso than they ever have in the past (some of their workloads were using less than 10% of the box and doing nothing else, etc). This is dedicated hardware, so our ultimate goal is to push it as hard as we can without blowing the storage subsystems out of the water completely (HP EVAs backing SVC).

I'm currently adjusting the max_xfer_size parameter on the VIO's FC adapters. Considering we're working with PCIe2.0 8Gbit dual-port cards (running at 4 to match the SAN fabric), I'm pretty sure we can push this memory-driven parameter pretty far before negatively impacting the system (i.e. creating a boot failure).

So my question is: for anyone that has tuned this, how hard have you pushed it?

At this point, since we're past TL2, I'm not touching lg_term_dma. These settings were separated a while ago, and are no longer dependent on each other.

RootWyrm · December 2012

Ohai!

It depends entirely on the backing disk and controller, there. Generally as a rule of thumb IBM tends toward 0x100000 (default of 1MB) and tuning queue_depth and num_cmd_elems.
The problem is you've got what I refer to as the 'Good vs. Bad.' SVC's good, EVA's bad. If it's anything short of an 8400, you're going to run out of that very quickly with any sort of tuning. Presuming the EVA's shared, you will end up tuning to get the warehouse out of the way of everything else.

Rule of thumb on SVC for 8G4 and later nodes with 5.1 and later is to put queue_depth as a multiple of backing controller multiplied by controller capability divided by 2. So for example on DS4800's I've tuned being fronted by a set of 8G4s, that'd be:
(Base Queue 32 * 4 Controllers) * Tuned Backing Disk Queue 64 = 8192 / 2 = 4096
For each FC4/8 port, your maximum num_cmd_elems is 200/port BUT in a VIOS environment should not exceed 256 to prevent contention stalling. (This presumes a VIOS on a 550 or larger.)

Presuming EVA8400 with maximum possible configuration on controllers, your controller queue depth presumption should be 24 or lower. Yes, 24. The EVA8400 is not a powerful box, and is exceptionally painful and difficult to tune for shared storage. Doubly so for the P family as they have a whopping 8GB of cache. WOW a whole 8GB with 4GB effective for 3.2TB! I don't have any empirical for queue depth on those, but even CF8's and double-cache amplification doesn't help them that much. Presume effective queue depth per port on the HVS430's probably still around 4. Yes, a whole 4.

All that said, Oracle likes to disagree just to be jerks about it, tending toward recommending 8MB on Solaris and 4MB on AIX (while also living in the past and recommending 16MB DMA tuning.) Delusional? Yes. Very.
My general recommendation there is to align your max transfer size with the database behavior (typically 4MB) and then reduce by half based on "2 Operations {} Core" / "2 Operations {} VIOS Core" e.g. a 4 core 2GB VIOS would start at 4MB, drop to 2MB for 8 parallel operations, and 1MB for 16 parallel operations (NEVER go below 1MB.) That's for sets in the 1-10TB range approximately with average batch sizes of under 500GB. Emphasis on batch. For large batch operations, you start tuning max_xfer_size upward in 1MB increments.

Don't exceed I think it's 12MB or 16MB on SVC. It can actually break the caching behavior in subtle ways.

Edit: Oh crap. I forgot to explain the DMA roughly.
DMA, don't tune it unless you're on PCI-X/266, 533 or PCI-Express and you're doing it based on empirically tested bus contention. Because these buses have such high clocks and serialization in the case of PCIe, tuning the DMA upward tends to have a zero benefit and can actually have a detriment in some scenarios. The only cases where I've seen a need to tune lg_term_dma upward past default is when you've got seriously large backing disk - like a maxed 795 with multiple dedicated HDS VSPs or DS8k's - and a lot of operations that are EXTREMELY large individually.

The only other time you'd turn lg_term_dma up is when you're doing a lot of very large single operations. (e.g. pulling in single records - not rows or columns but individual records) that are very large. Otherwise, it's going to get broken down into multiple operations where lg_term_dma tuning wouldn't help anyway. (The default is 8MB.)

AlexDeGruven · December 2012

I'm planning on leaving lg_term_dma alone for the time being, and we're only going to touch it if IBM recommends it, which is doubtful, anyway.

For queues, I'm bumping to 100/LUN, 1500 at the VIO FC, and 1400 at the LPAR FC (num_cmd_elems maxes at 2048 for the FCs, and 256 for the LUNs). Our current settings of 60/LUN, 1024 VIO FC, and 1000 LPAR FC haven't caused a stir with the SAN guys yet, so I think we're pretty safe to go higher.

I've done initial testing at 2MB for max_xfer_size, and things have been doing ok there. I'm going to bump higher (this is on test LPARs, not yet on the main warehouse) this afternoon and see if I can push it 'till I break it, and or start catching crap from storage. Fortunately, our SVC environment is spread wide with lots of redundancy, so I can hammer a few without hurting anyone else immediately.

So from what I've been reading, since 6.1 TL2, max_xfer_size has no bearing on lg_term_dma. So if I'm reading that correctly, there's no danger in increasing that beyond the 8MB of the default dma setting, correct?

I've pinged the storage team to verify model/level of the SVCs and EVAs so I can get a better picture.

AlexDeGruven · December 2012

EVA 8400
SVC 2145-CF8 running 5.1.x

RootWyrm · December 2012

Okay, you're way too high actually. Again: going too high has negative performance impacts. That's why you do not mess with lg_term_dma.
max_xfer_size is the maximum single-block transfer; lg_term_dma is "Long Term DMA" not Large. That's the size of DMA buffer. On high clock rate, highly serialized bus, it's detrimental to increase it unless you're doing very large single operations because they are and are not divorced. max_xfer_size again, is single operation limit. Long Term DMA at 8MB means it will hold 8 max_xfer_size operations in DRAM until they are flushed by one or more mechanisms. There's some funniness post TL2 with the mechanisms so they're no longer directly related, but gives you a rough idea.

The SVC itself requires tuning on CF8 with 5.1, but I'd have to sit down with your storage guys, and chances are very good the baseball bat would come out. (I have very little patience there since 40% of it is first-hand experience and 60% of it is from the people who develop SVC.) Lesson One: AUTO-TIERING OFF NOW AND FOREVER AND NO SSD CACHING. EVER. UNDER ANY CIRCUMSTANCES. If you do either of those with Oracle VDisks, performance will instantly go to shit. And that's the nice way to describe it. (Doubly so if you're tampering with max_xfer_size.) The rest of the SVC tuning depends on the number of nodes. Thank gods they're EVA 8400's - up to 22GB (11GB effective.) However, they MUST have write cache mirroring on always.

Can't share the exact algorithm (it are proprietary trade sekrit) but you should generally see a read cache overlap of around 20%+ - so 2.2GB+ giving the CF8's effective caching of around 8GB DRAM per IO Group. Which is a whole other issue.
IO Groups should be balanced in a round-robin fashion and corrected. Which means it's probably already done if it's 4 nodes, and it's almost definitely wrong if it's 6 nodes or 8 nodes, because the auto round robin provisioning is still badly designed and they can't be bothered to fix it. And yes rebalancing is a huge pain, but also critical because you have an inverse cache scenario (backing has less than or equal to SVC.)

You definitely have the queues too high. Remember: a queue is a holding space for pending operations, not actual disk I/O. Everyone makes this mistake, including myself in this case, because I presume they know the proper method. Step 1 is to tune to disk limits, which means clamping the queues way the hell down. Forget the point where disk performance decreases - you deliberately want to drive it down to the point where you're seeing disk wait on high random loads. THEN you start turning back up testing against actual workload. SVCs will give false results on sequential testing by design without extreme and difficult levels of tuning on the SVC and backing disk. In an inverse cache scenario you should be looking for tuned queue to give you more than 200% for reads up to 8GB and around 150% for longer than 8GB presuming throughput of not less than 400MB/s on cache operations. (I forget what exactly the EVA8400 taps out at.. but let's just say it's not much by comparison to a lot of the stuff I work on. It's not well suited to a lot of I/O intensive operations - and especially not shared storage - though it's a damn sight safer than the DS5k.)

AlexDeGruven · December 2012

Oh we're definitely seeing queues fill up under normal work loads, even at 60, 1000, and 1024. We have a per-lun limit of 80MB/sec and ~30 LUNs in play. Barring any other traffic, we're going with a max of 400MB/sec read and 100MB/sec write as our goal, as the CPU on the EVAs starts going down the tubes at much more than that, and they're shared.

I've got the max_xfer_size at 4MB right now on my testing LPAR and running some heavy loads on it via ndisk64. I have to push it really far to get any results in sqfull right now, which is what we're looking for. In default tuning, we were hitting 20k+ on sqfull under normal operations.

Under original tuning, we were getting really heavy traffic on "No Adapter Elements" and "No DMA Resource..." counters with normal workloads, which is why we bumped the queues so high and started looking at max_xfer_size.

The problem with this particular warehouse is that it's starting to move into repository workloads with rapidly updating data (weekly now, with some people clamoring for daily) and massive ad-hoc queries, so we have lots of disparate workloads happening simultaneously on top of the long-running loads.

And to reiterate: We're not touching lg_term_dma.

RootWyrm · December 2012

Ugh. Yeah.

Dug up my notes on the EVA's. Unless you've got, oh, a lot more 8400's just sitting around? Forget 400MB/s. You're going to have to ramp the stripe, significantly - across cabinets, not spindles. The queue filling is because the EVA 8400 cannot service the requests because the EVA is not a tier 1 array and marginally a T2. (This is why we were prepared to stay DS4800 over EVA8400. An obsoleted LSI/Engenio rebrand was significantly faster.) Guarantee you that what I'm going to find if I look at the MDisk XML is the EVA 8400 pushing 20-30ms latency average. Latency basically goes completely to hell past about 50% CPU load. (Caution: PDF from SPC.) So forget 100% - you cannot exceed 50% without crippling everything else. Past 70% and fault tolerance can be put at risk - note that SVC's should similarly not be pushed past 75% CPU average. Also note that those IOPS numbers don't go up significantly between 8000 and 8400. The 8400 is a minor speed bump relatively speaking - remember that your per-spindle IOPS peak will be 180-200, subtracting up to 40% for overhead depending on Vraid, snapshots, mirrors, etcetera. The HSV450 is not a particularly powerful controller in many ways, especially compared to many of it's contemporaries.

The EVA8400 is not dissimilar to other virtualized arrays in that you don't address blocks or stripes but 'virtual pages' which tend to clash because they're a very odd size. (HDS is 42MB on USP and VSP, as I recall.) It's also an FC-AL backend meaning any given shelf is limited to 4Gbit as loops are redundant not parallel, with a limit of 6/2 * 4Gb = 1.5GB/s for shelf operations with arrays potentially spanning shelves and impacting performance.. it's messy.
Observe on page 15 the Occupancy Alarm settings for example (Warning: PDF)

Since you're in a shared scenario behind an SVC, it gets tricky. The SVC will not balance requests effectively because it's simply incapable of it. Increasing performance above what you're seeing is just plain going to require buying more 8400's or something else entirely. So unfortunately instead of being able to tune upward, you're probably going to have to tune downwards. Even with VIOS, excluding the IVE 10GbE, much more powerful disk setups have been brought to their knees by mere POWER 550s. The SVC has no active share-balancing mechanism - it's essentially FIFO - so you're going to have to start moving the other direction to preserve the other applications unfortunately.

AlexDeGruven · December 2012

We've actually safely pushed our most recently-installed EVA at 800M/sec sustained before it was deployed to the rest of the environment, which is why we're trying to keep aggregate I/O from the host to 400/100 r/w. One of the other reasons we decided to bump queues so high: since the I/O is being throttled at the LUN level, we figured a larger queue would reduce overall waits, as the system wasn't busy trying to continually hit full queues as much.

In some testing of my currently-proposed queue settings in our Q/A environment, I was able to push one LUN at the host to 30k IOPS with a sustained max service time of ~20ms, and an avgq of around 60-80. This was using ndisk64 running for 120sec sustained with 90 threads and 3MB/thread (this LUN is capped at 80MB/sec so we get a better picture of the load that we see from the warehouse).

RootWyrm · December 2012

AlexDeGruven said:
We've actually safely pushed our most recently-installed EVA at 800M/sec sustained before it was deployed to the rest of the environment, which is why we're trying to keep aggregate I/O from the host to 400/100 r/w. One of the other reasons we decided to bump queues so high: since the I/O is being throttled at the LUN level, we figured a larger queue would reduce overall waits, as the system wasn't busy trying to continually hit full queues as much.

In some testing of my currently-proposed queue settings in our Q/A environment, I was able to push one LUN at the host to 30k IOPS with a sustained max service time of ~20ms, and an avgq of around 60-80. This was using ndisk64 running for 120sec sustained with 90 threads and 3MB/thread (this LUN is capped at 80MB/sec so we get a better picture of the load that we see from the warehouse).

This is the #1 error I see in installations. People bench things like the EVA with a single system usually going "well I threw some LPARs and lots of random IO at it so it's representative." Very much not true. They also run the tests too short to get a real picture.
The two things that kill EVA more than anything else are cache runout and number of hosts. That and the throttling features. I can get more than 700MB/sec from an IBM DS4200 - which is far more than it's technically supposed to be rated for. Throw more than one host at it, that number more than halves in an instant. Throw a transfer rate exceeding cache capability at it, and you're looking at maybe 150MB/sec if you're insanely lucky and it's all sequential.

The numbers you're seeing there are DANGEROUS. I cannot emphasize that part enough. Anywhere near enough. SVC CF8's when caching properly reduce visible latency. Hitting 20ms visible on SVC means you're exceeding 30ms on MDisk side. Exceeding 30ms means you're out of controller on the EVA. Which means the software's still under-reporting. 30K visible at 20ms means 25-27K at 30ms. The erratic performance is symptomatic of exceeding hardware capabilities, period. If you were within tolerances, it would be a flat line.

Doing the math, that gives you:
120 * 90 * 3MB = 32GB / 80MB per second = 405 seconds required per 32GB
You only moved a total of 9.6GB/LUN - meaning it was more than 80% cache.
You have way, way too many threads and the run was unusably short. And you ran out of burst - sustained load is going to be orders of magnitude worse. The proper methodology is to run not LESS than:

(Backend Cache + SVC Cache * IO Group) * 4 || VIOS FC Ports * 4 Threads * Hosts || 120 Minutes per set, minimum 4
All LUNs must be UNLIMITED for this testing.
You also need to do at least 4 runs of ORION. It's normal for results to differ.

Runs 1 and 3 should be sequential read and sequential write with runs 2 and 4 being random runs at 80/20 R/W bias and 60/40 R/W bias. You're looking for 100MB/s write parallel to 400MB/s read - not unless you've got a good number of 8400's. Unless using unprotected storage (Vraid0) write costs ~2.6x read. So that means you need aggregate raw of 660MB/s (5.1Gb/s) for one host before snapshots. Should work out to roughly 11-18K IOPS - again, for a single application. However, that 100MB/s of write is subject to array amplification making the actual disk IOs N+1 as Vraid does not use specific dedicated parity - any write generates two writes within that disk group.

Really, all you need to know is right here from HP themselves. For OLTP at 60/40 the absolute maximum safe for the EVA 8400 is ~20K IOPS - for SVC 5.1 add 10% to that. And those benchmarks were run with maximum sized single LUNs with a maximum configuration EVA 8400 using 324 spindles. The spiking occurs in Vraid1 and Vraid5 on identical curves, indicating CPU limitation and not spindle limitation people claim.
That gives you an effective per-disk IOPS number of 64 IOPS for 15K RPM dual-port for OLTP workloads on Vraid5. And that's straight from HP. They officially (but bury) how much the EVA disks are derated - 10K FC Vraid5, 53 IOPS/disk; 10K FC Vraid1, 83 IOPS/disk; 15K Vraid5, 73 IOPS/disk; 15K Vraid1, 115 IOPS/disk. According to even these derated numbers, OLTP should have at minimum another 10K IOPS in Vraid1 and at least another 5K IOPS in Vraid5.
Those numbers work out to 242MB/s OLTP and 164MB/s OLTP capacity based on the given 8KB transfer size with transfer rate unlimited. (IOPS * 8KB.) That's per cabinet - so to reach your desired performance will require a minimum 2 x EVA 8400 with 324 15K spindles each in Vraid1 with an approximate 1:1 HVS450:SVC port mapping. And no other hosts attached.

Like I said; you're going to have to tune downward or buy more cabinets with the disk guys doing the joy of manually migrating every single VDisk on the SVC twice since that remains the only way to rebalance. There's just not going to be enough resources based on all the numbers I'm seeing here.

AlexDeGruven · December 2012

And that's actually what we're aiming for. We are wanting to tune the system to push the bottleneck entirely onto the storage subsystems. Then we let the SAN guys deal with that, heh.

Also, we set our max target overall at 400MB/sec for the EVAs, we were only blasting at 800 as an overall stress test before deployment.

RootWyrm · December 2012

Well, problem is, defaults are going to give you that with very little tweaking. Which is the problem. You have no overhead, so you can't tune, because any tuning results in bashing the array to death. I'm going to presume they have two 8400's behind the SVC at least, but without knowing the Vraid behind it, impossible to say how far at/past limit you are. So I'm a bit hesitant to recommend too much tuning, just because the latency's going to give you some really skewed numbers.

Like I said; hit it with ORION a couple times, let's see what it does there.. ORION's a lot better for predicting Oracle behavior, and should be able to throw together a set of parameters that will let you pretty much destroy the backing disk. The rest of the tools aren't going to be much use here because you can't really get a true 60/40 biasing or Oracle-style behavior with parallel blocking writes.

With a little bit of luck, should be able to leverage the parallel blocking writes and some tuning to basically cripple the backing disk though.

AlexDeGruven · December 2012

Not looking for crippling, heh. Just looking to hit the barking point, then backing off a touch from there. I think the settings I'm putting in place tomorrow night will get us where we want to be.

We have quite a few EVAs behind the SVC's, actually. Storage-wise, we're a pretty large shop. We don't pull out the big guns like 770/80/95 on the front end mainly because our workload is mostly data storage and retrieval, rather than number-crunching (much as I would like some serious number-crunching hardware, hello Watson), but our storage farm is wide and deep, which is part of why we can get away with beating on the backend more than some places could.

We're also looking into moving to a higher tier of backend storage to alleviate some of the IO bottlenecks that the EVAs have presented us with. The EVAs were our first foray into non high-end storage (our old storage was all DS8xxx), and we've learned pretty quickly what our customers prefer.

RootWyrm · December 2012

AlexDeGruven said:
Not looking for crippling, heh. Just looking to hit the barking point, then backing off a touch from there. I think the settings I'm putting in place tomorrow night will get us where we want to be.

We have quite a few EVAs behind the SVC's, actually. Storage-wise, we're a pretty large shop. We don't pull out the big guns like 770/80/95 on the front end mainly because our workload is mostly data storage and retrieval, rather than number-crunching (much as I would like some serious number-crunching hardware, hello Watson), but our storage farm is wide and deep, which is part of why we can get away with beating on the backend more than some places could.

Hee hee.. not to troll, but no, EVA8400's not wide or deep.. it's kinda like XIV. Except not as lulzfail since, you know, EVA's can actually be managed and monitored and don't lose all your data in the event of a minor hiccup. But most of the stuff I work with for perspective, tends toward AMS2k's by the half dozen tuned for ~40-50K IOPS/ea (so yes, aggregates over 300K with 6+ node SVC) and multi-cluster SVC fronting 4+ DS8700's. So EVA's definitely at the low end of things. (Though it requires a hell of a lot less tuning to get that 20K out of - takes even me two to three months to get a DS4800 to over 30K IOPS and I know all the recipes!)

That said.. the latency's indicating too much breakage somewhere in the chain, but I can't really tell you where without a lot more data. And the storage guys probably don't know how to get the statistical data out of the SVC that's required to analyze. Not dissing them - it's actually a tremendous pain in the ass to get out, much less analyze. And it's not available or accurate through the built-in performance monitoring, since we're specifically looking for MDisk spiking relative to visible VDisk spiking. MIGHT be able to get it out of the EVA's performance monitoring, presuming they haven't integrated with OpenView / Matrix. Man, talk about doing storage wrong there...

Other problem is: fix the latency problem and you just spiked the shit out of your tuning. High latency tuning means increasing queue depth, increasing buffers, trying to pad things out, and forcing writes as hard as you can so the application has less visibility to the painful delays and less data risk. Soon as you fix the latency problem, you've now got a huge data risk because you've got umpteen dozen layers of buffering and caching based on trying to mask slow commits that aren't slow any more. (Which trust me, can really piss off Oracle.) We can presume it's the EVA8400's because you're hitting 'em hard enough, but also no guarantee there isn't other factors in play.. but of course, I'm guessing storage admins are all 'NO! MINE! YOU'RE TOO STUPID TO UNDERSTAND THIS STUFF!' (Sorry, you'll have to invest in your own cluebat there.)

We're also looking into moving to a higher tier of backend storage to alleviate some of the IO bottlenecks that the EVAs have presented us with. The EVAs were our first foray into non high-end storage (our old storage was all DS8xxx), and we've learned pretty quickly what our customers prefer.

Yeah, the EVA is a 10+ year old architecture that hasn't really been revamped up till now. (Sadly, the revamp basically consisted of admitting it's not capable of being high end.) The fact is that short of the DS8k, which has a large number of architecturally unique features, your choices are: HDS USP-V (now VSP), IBM DS8k, HDS VSP (formerly USP-V), and IBM DS8k. IBM will tell you XIV is just as good - nope. It's not. They'll say V7000 is just as good - nope. It's not. And I just got done assisting a site evaluating a third contender with a respectable offering on paper. They fell flat. Said site has a very large number of DS8300's fronted by multiple SVC clusters.
HP/3PAR's got some very interesting offerings, but I've not seen latest SVC qualification data on them. The first and second generation stuff was.. really quite bad, nowhere near enough CPU and a horrible OS and very underwhelming on paper - except for bulk storage with low performance requirements. But the newer stuff is rather impressive on paper and has turned in some very impressive numbers on SPC as well. So basically I'm saying "bitchslap your VAR and/or HP rep who sold you on EVA and demand to know why they didn't present 3PAR."

Safe FC max_xfer_size settings (AIX 6.1 TL6 and VIOS 2.2)

Comments