1. 程式人生 > >理解I/O:隨機和順序

理解I/O:隨機和順序

Storage for DBAs: Ever been to one of those sushi restaurants where the food comes round in dishes on a conveyor belt? As each dish travels around the loop you eye it up and, as long as you can make your mind up in time, grab it. However, if you are as indecisive as me, there’s a chance it will be out of range before you come to your senses – in which case you have to wait for it to complete a further full revolution before getting another chance. And that’s assuming someone else doesn’t get to it first.

曾經去過壽司店嗎,那裡的食物都是放在一個傳送帶上。隨著每份食品在帶上的傳送,你瞄準了一些食物,等它們來到跟前時,立刻拿走它。然而如果你像我一樣那麼遲疑,那就有可能食物已經超出了你能夠著的範圍了。這時候你就得再等上一圈才可能拿到,前提還是別人沒取走

Let’s assume that it takes a dish exactly 4 minutes to complete a whole lap of the conveyor belt. And just for simplicity’s sake let’s also assume that no two dishes on the belt are identical. As a hungry diner you look in the little menu and see a particular dish which you decide you want. It’s somewhere on the belt, so how long will it take to arrive?

我們假定一道食品在傳送帶上走完一圈需要4分鐘,為簡單起見還假定傳送帶上的食品互不相同。作為一個吃貨,你看了看選單,找到了幾樣你想要的食物,它就在帶上的某個地方,那麼需要多久才會到達你的旁邊呢?

Probability dictates that it could be anywhere on the belt. It could be passing by right now, requiring no wait time – or it could have just passed out of reach, thus requiring 4 minutes of wait time to go all the way round again. As you follow this random method (choose from the menu then look at the belt) it makes sense that the average wait time will tend towards halfway between the min and max wait times, i.e. 2 minutes in this case. So every time you pick a dish you wait an average of 2 minutes: if you have eight dishes the odds say that you will spend (8 x 2) = 16 minutes waiting for your food. Welcome to the disk data diet, I hope you weren’t too hungry?

我們指定它可以在帶上的任何地方。可能正在經過,你不需要等待就可以拿到手,或者它剛剛過了你的範圍,那麼需要4分鐘的等待轉完一圈。當你遵循這套隨機規則(選單中選擇食品然後看傳送帶),就會意識到平均等待時間將會傾向於最大和最小等待時間的中間位置,也就是2分鐘。於是乎你每次取食物時都需要等待2分鐘,如果你有8個盤子,那很可能你需要等上16分鐘才能取完。歡迎來到磁碟資料飲食,希望你不會太餓?

Now let’s consider an alternative option, where you order eight dishes from the chef and he or she places all of them sequentially (i.e. next to each other) somewhere on the conveyor belt. That location is random, so again you might have to wait anywhere between 0 and 4 minutes (an average of 2 minutes) for the first dish to pass… but the next seven will follow one after the other with no wait time. So now, in this scenario, you only had to wait 2 minutes for all eight dishes. Much better.

現在讓我們來考慮另一種方案,你訂了8道菜,廚師依次把他們放在了傳送帶的某個地方,位置是隨機的,所以你需要等上平均時間2分鐘取得第一份菜。然而剩下的7份菜都不需要等待。所以在這種場景下,取8道菜你只需等2分鐘,比剛才好多了。

I’m sure you will have seen through my analogy right from the start. The conveyor belt is a hard disk and the sushi dishes are blocks which are being eaten / read. I haven’t yet worked out how to factor a bottle Asahi Super Dry into this story, but I’ll have one all the same thanks.

我確定你能看懂我在文章開頭所作的類比了。傳送帶就是磁碟,食品就好比要吃/讀的塊。【最後一句翻譯不來。555……】

Random versus Sequential I/O

I have another article planned for later in this series which describes the inescapable mechanics of disk. For now though, I’ll outline the basics: every time you need to access a block on a disk drive, the disk actuator arm has to move the head to the correct track (the seek time), then the disk platter has to rotate to locate the correct sector (the rotational latency). This mechanical action takes time, just like the sushi travelling around the conveyor belt.

我改日會有另外一篇文章來談磁碟原理。但現在,我大概說一下基本內容:每次訪問磁碟的一個塊時,磁臂就需移動到正確的磁軌上(這段時間為定址時間),然後碟片就需旋轉到正確的扇區上(這叫旋轉時延)。這套動作需要時間,正如壽司在傳送帶上傳送需要時間一樣。

Obviously the amount of time depends on where the head was previously located and how fortunate you are with the location of the sector on the platter: if it’s directly under the head you do not need to wait, but if it just passed the head you have to wait for a complete revolution. Even on the fastest 15k RPM disk that takes 4 milliseconds (15,000 rotations per minute = 250 rotations per second, which means one rotation is 1/250th of a second or 4ms). Admittedly that’s faster than the sushi in my earlier analogy, but the chances are you will need to read or write a far larger number of blocks than I can eat sushi dishes (and trust me, on a good day I can pack a fair few away).

很明顯總共的時間依賴於磁頭的初使位置,還有要訪問的扇區的位置。如果它剛好就在磁頭下方,那不需要等待;如果剛剛經過磁頭,那就不得不等上一個週期時間。哪怕對於最快的15k RPM磁碟,每分鐘15000轉,每秒250轉,那麼一轉需要4ms。很明顯比剛才壽司的情況要快得多,但是很多時候需要讀上大量的資料塊,遠遠超過我要吃的壽司量。相信我,這種時候的時間我都可以打包好幾份了。

What about the next block? Well, if that next block is somewhere else on the disk, you will need to incur the same penalties of seek time and rotational latency. We call this type of operation a random I/O. But if the next block happened to be located directly after the previous one on the same track, the disk head would encounter it immediately afterwards, incurring no wait time (i.e. no latency). This, of course, is a sequential I/O.

那下一個磁碟塊又是如何呢?如果它在磁碟的某個地方,訪問它會有同樣的尋道和旋轉時延,我們就把這種方式的IO叫做隨機IO;但是如果它剛好就在你剛才訪問的那一個磁碟塊的後面,磁頭就能立刻遇到,不需等待,這種IO就叫順序IO

Size Matters

In my last post I described the Fundamental Characteristics of Storage: Latency, IOPS and Bandwidth (or Throughput). As a reminder, IOPS stands for I/Os Per Second and indicates the number of distinct Input/Output operations (i.e. reads or writes) that can take place within one second. You might use an IOPS figure to describe the amount of I/O created by a database, or you might use it when defining the maximum performance of a storage system. One is a real-world value and the other a theoretical maximum, but they both use the term IOPS.

在我上一篇博文中講到了磁碟的基本特徵:延時、IOPS和頻寬(或叫吞吐量)。這裡再說一次,IOPS是每秒I/O數的簡稱,表示一秒中輸入輸出操作(比如讀和寫)的次數。可以用IOPS數值來描述一個數據庫的IO操作量,或者在定義一個儲存系統的最大效能時採用這個詞。前者是一種真實世界的值,後者是一個理論最大值,它們都IOPS這個術語。

When describing volumes of data, things are slightly different. Bandwidth is usually used to describe the maximum theoretical limit of data transfer, while throughput is used to describe a real-world measurement. You might say that the bandwidth is the maximum possible throughput. Bandwidth and throughput figures are usually given in units of size over units of time, e.g. Mb/sec or GB/sec. It pays to look carefully at whether the unit is using bits (b) or bytes (B), otherwise you are likely to end up looking a bit silly (sadly, I speak from experience). In the previous post we stated that IOPS and throughput were related by the following relationship:

當描述大量資料時,情況就有所不同了。頻寬用來描述資料傳輸的理論最大值,而吞吐量是實際值。你可以說頻寬是吞吐量的上限。頻寬和吞吐量數值經常帶有單位時間上的單位大小的單位,如Mb/sec,Gb/sec.注意這裡b和B是不同的,前者是位,後者是位元組。在上一篇博文中,我們講到了IOPS和吞吐量之間有這樣的關係:

Throughput   =   IOPS   x   I/O size

吞吐量 = IOPS * I/O大小

It’s time to start thinking about that I/O size now. If we read or write a single random block in one second then the number of IOPS is 1 and the I/O size is also 1 (I’m using a unit of “blocks” to keep things simple). The Throughput can therefore be calculated as (1 x 1) = 1 block / second.

現在有必要來談談IO 大小了。如果一秒中讀一個單個隨機塊,那麼 IOPS就是1,IO大小也是1(這裡用塊作單位是使問題簡化)。那麼吞吐量就是1*1=1塊/s

Alternatively, if we wanted to read or write eight contiguous blocks from disk as a sequential operation then this again would only result in the number of IOPS being 1, but this time the I/O size is 8. The throughput is therefore calculated as (1 x 8) = 8 blocks / second.
Hopefully you can see from this example the great benefit of sequential I/O on disk systems: it allows increased throughput. Every time you increase the I/O size you get a corresponding increase in throughput, while the IOPS figure remains resolutely fixed. But what happens if you increase the number of IOPS?

或者另外一種方式,順序讀連續8個數據塊,那麼此時IOPS仍是1,但大小為8,所以吞吐量是1*8=8塊/s。相信你能看出順序IO的優勢了,它支援遞增式的吞吐量,每一次增加IO資料塊數量就能獲得吞吐量的提升,然而IOPS恆定不變。要是它增加了呢?

Latency Kills Disk Performance

In the example above I described a single-threaded process reading or writing a single random block on a disk. That I/O results in a certain amount of latency, as described earlier on (the seek time and rotational latency). We know that the average rotational latency of a 15k RPM disk is 4ms, so let’s add another millisecond for the disk head seek time and call the average I/O latency 5ms. How many (single-threaded) random IOPS can we perform if each operation incurs an average of 5ms wait? The answer is 1 second / 5 ms = 200 IOPS. Our process is hitting a physical limit of 200 IOPS on this disk.
What do you do if you need more IOPS? With a disk system you only really have one choice: add more disks. If each spindle can drive 200 IOPS and you require 80,000 IOPS then you need (80,000 / 200) = 400 spindles. Better clear some space in that data centre, eh?

在上面的例子中,我描述了一個單執行緒的程序讀寫磁碟的單個隨機塊的情況。那種IO將會有很大的延時,如前所說的尋道時間和旋轉時延。已經知道對於15k RPM 的磁碟而言,平均旋轉時延是4ms,我們假定磁頭的尋道時間是1ms,那麼平均IO時延是5ms。那麼這種情況下,每次操作要5ms,在一秒內可以有多少次操作呢,也就是IOPS的值。答案是1s/5ms=200 IOPS(單執行緒情況下)。那想增加IOPS該怎麼做呢?只有一個法子就是增加更多磁碟。如果一個轉軸驅動200 IOPS,那麼若想達到80000 IOPS的值,就需要80000/200=400個轉軸。對於這種資料中心的空間情況更清楚了嗎?

On the other hand, if you can perform the I/O sequentially you may be able to reduce the IOPS requirement and increase the throughput, allowing the disk system to deliver more data. I know of Oracle customers who spend large amounts of time and resources carving up and re-ordering their data in order to allow queries to perform sequential I/O. They figure that the penalty incurred from all of this preparation is worth it in the long run, as subsequent queries perform better. That’s no surprise when the alternative was to add an extra wing to the data centre to house another bunch of disk arrays, plus more power and cooling to run them. This sort of “no pain, no gain” mentality used to be commonplace because there really weren’t any other options. Until now.

在另一方面,如果你能執行順序IO,你將可以降低對IOPS的要求而提升吞吐量,使磁碟系統傳送更多的資料。我瞭解到Oracle使用者就花大量時間和資源對資料進行重新劃分和排序,這樣請求就會是順序IO。【這裡我個人疑問是:對資料的重新劃分和排序就能保證在磁碟上的排列是符合順序IO的嗎?】他們認為這樣的初始準備工作雖然麻煩,但長久看來卻是值得的,因為後來的請求將會執行得更好。當然增加額外的磁碟去儲存更多的資料,增加更多的電能和冷卻裝置也是可以的。這種NO PAIN, NO GAIN心理是很常見的,因為迄今沒有其它選擇

Flash Offers Another Way

The idea of sequential I/O doesn’t exist with flash memory, because there is no physical concept of blocks being adjacent or contiguous. Logically, two blocks may have consecutive block addresses, but this has no bearing on where the actual information is electronically stored. You might therefore say that all flash I/O is random, but in truth the principles of random I/O versus sequential I/O are disk concepts so don’t really apply. And since the latency of flash is sub-millisecond, it should be possible to see that, even for a single-threaded process, a much larger number of IOPS is possible. When we start considering concurrent operations things get even more interesting… but that topic is for another day.

快閃記憶體中是不存在順序IO的概念的,因為沒有鄰接或連續這種塊的物理概念。邏輯上,兩個塊可以有連續的塊地址,但不能確定實際的資訊存在哪裡。你也許會說所以快閃記憶體IO是隨機的,但事實上隨機IO和順序IO只是磁碟概念,所以不要這麼用。由於快閃記憶體的時延是亞毫秒級,所對對於單執行緒的程序而言可以有很大的IOPS。當考慮併發時就更有趣了,但以後再講。

Back to the sushi analogy, there is no longer a conveyor belt – the chefs are standing right in front of you. When you order a dish, it is placed in front of you immediately. Order a number of dishes and you might want to enlist the help of a few friends to eat in parallel, because the food will start arriving faster than you can eat it on your own. This is the world of flash memory, where hunger for data can be satisfied and appetites can be fulfilled. Time to break that disk diet, eh?

Looking back at the disk model, all that sitting around waiting for the sushi conveyor belt just takes too long. Sure you can add more conveyor belts or try to get all of your sushi dishes arranged in a line, but at the end of the day the underlying problem remains: it’s disk. And now that there’s an alternative, disk just seems a bit too fishy to me…

回到壽司這個類比,不再有傳送帶了,廚師就站在你面前。當你訂一道食品,它就立刻出現在你面前;訂很多食品你可以請求朋友們的幫助一起吃,因為上食物的速度總會比你一個人吃的速度快。這就是快閃記憶體實現的機制了。

最後一段,略