1. 程式人生 > >What every programmer should know about memory (Part 2-2) 譯

What every programmer should know about memory (Part 2-2) 譯

What Every Programmer Should Know About Memory Ulrich Drepper Red Hat, Inc. [email protected] November 21, 2007

2.2 DRAM Access Technical Details

In the section introducing DRAM we saw that DRAM chips multiplex the addresses in order to save resources. We also saw that accessing DRAM cells takes time since the capacitors in those cells do not discharge instantaneously to produce a stable signal; we also saw that DRAM cells must be refreshed. Now it is time to put this all together and see how all these factors determine how the DRAM access has to happen.

We will concentrate on current technology; we will not discuss asynchronous DRAM and its variants as they are simply not relevant anymore. Readers interested in this topic are referred to [highperfdram] and [arstechtwo]. We will also not talk about Rambus DRAM (RDRAM) even though the technology is not obsolete. It is just not widely used for system memory. We will concentrate exclusively on Synchronous DRAM (SDRAM) and its successors Double Data Rate DRAM (DDR).

Synchronous DRAM, as the name suggests, works relative to a time source. The memory controller provides a clock, the frequency of which determines the speed of the Front Side Bus (FSB) — the memory controller interface used by the DRAM chips. As of this writing, frequencies of 800MHz, 1,066MHz, or even 1,333MHz are available with higher frequencies (1,600MHz) being announced for the next generation. This does not mean the frequency used on the bus is actually this high. Instead, today’s buses are double- or quad-pumped, meaning that data is transported two or four times per cycle. Higher numbers sell so the manufacturers like to advertise a quad-pumped 200MHz bus as an “effective” 800MHz bus.

For SDRAM today each data transfer consists of 64 bits — 8 bytes. The transfer rate of the FSB is therefore 8 bytes multiplied by the effective bus frequency (6.4GB/s for the quad-pumped 200MHz bus). That sounds a lot but it is the burst speed, the maximum speed which will never be surpassed. As we will see now the protocol for talking to the RAM modules has a lot of downtime when no data can be transmitted. It is exactly this downtime which we must understand and minimize to achieve the best performance.

2.2.1 Read Access Protocol

Figure 2.8 shows the activity on some of the connectors of a DRAM module which happens in three differently colored phases. As usual, time flows from left to right. A lot of details are left out. Here we only talk about the bus clock,RASandCASsignals, and the address and data buses. A read cycle begins with the memory controller making the row address available on the address bus and lowering theRASsignal. All signals are read on the rising edge of the clock (CLK) so it does not matter if the signal is not completely square as long as it is stable at the time it is read. Setting the row address causes the RAM chip to start latching the addressed row.

TheCASsignal can be sent aftertRCD(RAS-to-CASDelay) clock cycles. The column address is then transmitted by making it available on the address bus and lowering theCASline. Here we can see how the two parts of the address (more or less halves, nothing else makes sense) can be transmitted over the same address bus.

Now the addressing is complete and the data can be transmitted. The RAM chip needs some time to prepare for this. The delay is usually calledCASLatency (CL). In Figure 2.8 theCASlatency is 2. It can be higher or lower, depending on the quality of the memory controller, motherboard, and DRAM module. The latency can also have half values. With CL=2.5 the first data would be available at the first falling flank in the blue area.

With all this preparation to get to the data it would be wasteful to only transfer one data word. This is why DRAM modules allow the memory controller to specify how much data is to be transmitted. Often the choice is between 2, 4, or 8 words. This allows filling entire lines in the caches without a newRAS/CASsequence. It is also possible for the memory controller to send a newCASsignal without resetting the row selection. In this way, consecutive memory addresses can be read from or written to significantly faster because theRASsignal does not have to be sent and the row does not have to be deactivated (see below). Keeping the row “open” is something the memory controller has to decide. Speculatively leaving it open all the time has disadvantages with real-world applications (see [highperfdram]). Sending newCASsignals is only subject to the Command Rate of the RAM module (usually specified as Tx, where x is a value like 1 or 2; it will be 1 for high-performance DRAM modules which accept new commands every cycle).

In this example the SDRAM spits out one word per cycle. This is what the first generation does. DDR is able to transmit two words per cycle. This cuts down on the transfer time but does not change the latency. In principle, DDR2 works the same although in practice it looks different. There is no need to go into the details here. It is sufficient to note that DDR2 can be made faster, cheaper, more reliable, and is more energy efficient (see [ddrtwo] for more information).

2.2.2 Precharge and Activation

Figure 2.8 does not cover the whole cycle. It only shows parts of the full cycle of accessing DRAM. Before a newRASsignal can be sent the currently latched row must be deactivated and the new row must be precharged. We can concentrate here on the case where this is done with an explicit command. There are improvements to the protocol which, in some situations, allows this extra step to be avoided. The delays introduced by precharging still affect the operation, though.

Figure 2.9 shows the activity starting from oneCASsignal to theCASsignal for another row. The data requested with the firstCASsignal is available as before, after CL cycles. In the example two words are requested which, on a simple SDRAM, takes two cycles to transmit. Alternatively, imagine four words on a DDR chip.

Even on DRAM modules with a command rate of one the precharge command cannot be issued right away. It is necessary to wait as long as it takes to transmit the data. In this case it takes two cycles. This happens to be the same as CL but that is just a coincidence. The precharge signal has no dedicated line; instead, some implementations issue it by lowering the Write Enable (WE) andRASline simultaneously. This combination has no useful meaning by itself (see [micronddr] for encoding details).

Once the precharge command is issued it takestRP(Row Precharge time) cycles until the row can be selected. In Figure 2.9 much of the time (indicated by the purplish color) overlaps with the memory transfer (light blue). This is good! ButtRPis larger than the transfer time and so the nextRASsignal is stalled for one cycle.

If we were to continue the timeline in the diagram we would find that the next data transfer happens 5 cycles after the previous one stops. This means the data bus is only in use two cycles out of seven. Multiply this with the FSB speed and the theoretical 6.4GB/s for a 800MHz bus become 1.8GB/s. That is bad and must be avoided. The techniques described in Section 6 help to raise this number. But the programmer usually has to do her share.

There is one more timing value for a SDRAM module which we have not discussed. In Figure 2.9 the precharge command was only limited by the data transfer time. Another constraint is that an SDRAM module needs time after aRASsignal before it can precharge another row (denoted astRAS). This number is usually pretty high, in the order of two or three times thetRPvalue. This is a problem if, after aRASsignal, only oneCASsignal follows and the data transfer is finished in a few cycles. Assume that in Figure 2.9 the initialCASsignal was preceded directly by aRASsignal and thattRASis 8 cycles. Then the precharge command would have to be delayed by one additional cycle since the sum oftRCD, CL, andtRP(since it is larger than the data transfer time) is only 7 cycles.

DDR modules are often described using a special notation: w-x-y-z-T. For instance: 2-3-2-8-T1. This means:

There are numerous other timing constants which affect the way commands can be issued and are handled. Those five constants are in practice sufficient to determine the performance of the module, though.

It is sometimes useful to know this information for the computers in use to be able to interpret certain measurements. It is definitely useful to know these details when buying computers since they, along with the FSB and SDRAM module speed, are among the most important factors determining a computer’s speed.

The very adventurous reader could also try to tweak a system. Sometimes the BIOS allows changing some or all these values. SDRAM modules have programmable registers where these values can be set. Usually the BIOS picks the best default value. If the quality of the RAM module is high it might be possible to reduce the one or the other latency without affecting the stability of the computer. Numerous overclocking websites all around the Internet provide ample of documentation for doing this. Do it at your own risk, though and do not say you have not been warned.

2.2.3 Recharging

A mostly-overlooked topic when it comes to DRAM access is recharging. As explained in Section 2.1.2, DRAM cells must constantly be refreshed. This does not happen completely transparently for the rest of the system. At times when a row {Rows are the granularity this happens with despite what [highperfdram] and other literature says (see [micronddr]).} is recharged no access is possible. The study in [highperfdram] found that “[s]urprisingly, DRAM refresh organization can affect performance dramatically”.

Each DRAM cell must be refreshed every 64ms according to the JEDEC specification. If a DRAM array has 8,192 rows this means the memory controller has to issue a refresh command on average every 7.8125µs (refresh commands can be queued so in practice the maximum interval between two requests can be higher). It is the memory controller’s responsibility to schedule the refresh commands. The DRAM module keeps track of the address of the last refreshed row and automatically increases the address counter for each new request.

There is really not much the programmer can do about the refresh and the points in time when the commands are issued. But it is important to keep this part to the DRAM life cycle in mind when interpreting measurements. If a critical word has to be retrieved from a row which currently is being refreshed the processor could be stalled for quite a long time. How long each refresh takes depends on the DRAM module.

2.2.4 Memory Types

It is worth spending some time on the current and soon-to-be current memory types in use. We will start with SDR (Single Data Rate) SDRAMs since they are the basis of the DDR (Double Data Rate) SDRAMs. SDRs were pretty simple. The memory cells and the data transfer rate were identical.

In Figure 2.10 the DRAM cell array can output the memory content at the same rate it can be transported over the memory bus. If the DRAM cell array can operate at 100MHz, the data transfer rate of the bus is thus 100Mb/s. The frequency f for all components is the same. Increasing the throughput of the DRAM chip is expensive since the energy consumption rises with the frequency. With a huge number of array cells this is prohibitively expensive. {Power = Dynamic Capacity × Voltage2 × Frequency.} In reality it is even more of a problem since increasing the frequency usually also requires increasing the voltage to maintain stability of the system. DDR SDRAM (called DDR1 retroactively) manages to improve the throughput without increasing any of the involved frequencies.

The difference between SDR and DDR1 is, as can be seen in Figure 2.11 and guessed from the name, that twice the amount of data is transported per cycle. I.e., the DDR1 chip transports data on the rising and falling edge. This is sometimes called a “double-pumped” bus. To make this possible without increasing the frequency of the cell array a buffer has to be introduced. This buffer holds two bits per data line. This in turn requires that, in the cell array in Figure 2.7, the data bus consists of two lines. Implementing this is trivial: one only has the use the same column address for two DRAM cells and access them in parallel. The changes to the cell array to implement this are also minimal.

The SDR DRAMs were known simply by their frequency (e.g., PC100 for 100MHz SDR). To make DDR1 DRAM sound better the marketers had to come up with a new scheme since the frequency did not change. They came with a name which contains the transfer rate in bytes a DDR module (they have 64-bit busses) can sustain:

Hence a DDR module with 100MHz frequency is called PC1600. With 1600 > 100 all marketing requirements are fulfilled; it sounds much better although the improvement is really only a factor of two. { I will take the factor of two but I do not have to like the inflated numbers.}

To get even more out of the memory technology DDR2 includes a bit more innovation. The most obvious change that can be seen in Figure 2.12 is the doubling of the frequency of the bus. Doubling the frequency means doubling the bandwidth. Since this doubling of the frequency is not economical for the cell array it is now required that the I/O buffer gets four bits in each clock cycle which it then can send on the bus. This means the changes to the DDR2 modules consist of making only the I/O buffer component of the DIMM capable of running at higher speeds. This is certainly possible and will not require measurably more energy, it is just one tiny component and not the whole module. The names the marketers came up with for DDR2 are similar to the DDR1 names only in the computation of the value the factor of two is replaced by four (we now have a quad-pumped bus). Table 2.1 shows the names of the modules in use today.

There is one more twist to the naming. The FSB speed used by CPU, motherboard, and DRAM module is specified by using the effective frequency. I.e., it factors in the transmission on both flanks of the clock cycle and thereby inflates the number. So, a 133MHz module with a 266MHz bus has an FSB “frequency” of 533MHz.

The specification for DDR3 (the real one, not the fake GDDR3 used in graphics cards) calls for more changes along the lines of the transition to DDR2. The voltage will be reduced from 1.8V for DDR2 to 1.5V for DDR3. Since the power consumption equation is calculated using the square of the voltage this alone brings a 30% improvement. Add to this a reduction in die size plus other electrical advances and DDR3 can manage, at the same frequency, to get by with half the power consumption. Alternatively, with higher frequencies, the same power envelope can be hit. Or with double the capacity the same heat emission can be achieved.

The cell array of DDR3 modules will run at a quarter of the speed of the external bus which requires an 8 bit I/O buffer, up from 4 bits for DDR2. See Figure 2.13 for the schematics.

Initially DDR3 modules will likely have slightly higherCASlatencies just because the DDR2 technology is more mature. This would cause DDR3 to be useful only at frequencies which are higher than those which can be achieved with DDR2, and, even then, mostly when bandwidth is more important than latency. There is already talk about 1.3V modules which can achieve the sameCASlatency as DDR2. In any case, the possibility of achieving higher speeds because of faster buses will outweigh the increased latency.

One possible problem with DDR3 is that, for 1,600Mb/s transfer rate or higher, the number of modules per channel may be reduced to just one. In earlier versions this requirement held for all frequencies, so one can hope that the requirement will at some point be lifted for all frequencies. Otherwise the capacity of systems will be severely limited.

Table 2.2 shows the names of the expected DDR3 modules. JEDEC agreed so far on the first four types. Given that Intel’s 45nm processors have an FSB speed of 1,600Mb/s, the 1,866Mb/s is needed for the overclocking market. We will likely see more of this towards the end of the DDR3 lifecycle.

All DDR memory has one problem: the increased bus frequency makes it hard to create parallel data busses. A DDR2 module has 240 pins. All connections to data and address pins must be routed so that they have approximately the same length. Even more of a problem is that, if more than one DDR module is to be daisy-chained on the same bus, the signals get more and more distorted for each additional module. The DDR2 specification allow only two modules per bus (aka channel), the DDR3 specification only one module for high frequencies. With 240 pins per channel a single Northbridge cannot reasonably drive more than two channels. The alternative is to have external memory controllers (as in Figure 2.2) but this is expensive.

What this means is that commodity motherboards are restricted to hold at most four DDR2 or DDR3 modules. This restriction severely limits the amount of memory a system can have. Even old 32-bit IA-32 processors can handle 64GB of RAM and memory demand even for home use is growing, so something has to be done.

One answer is to add memory controllers into each processor as explained in Section 2. AMD does it with the Opteron line and Intel will do it with their CSI technology. This will help as long as the reasonable amount of memory a processor is able to use can be connected to a single processor. In some situations this is not the case and this setup will introduce a NUMA architecture and its negative effects. For some situations another solution is needed.

Intel’s answer to this problem for big server machines, at least for the next years, is called Fully Buffered DRAM (FB-DRAM). The FB-DRAM modules use the same components as today’s DDR2 modules which makes them relatively cheap to produce. The difference is in the connection with the memory controller. Instead of a parallel data bus FB-DRAM utilizes a serial bus (Rambus DRAM had this back when, too, and SATA is the successor of PATA, as is PCI Express for PCI/AGP). The serial bus can be driven at a much higher frequency, reverting the negative impact of the serialization and even increasing the bandwidth. The main effects of using a serial bus are

  1. more modules per channel can be used.
  2. more channels per Northbridge/memory controller can be used.
  3. the serial bus is designed to be fully-duplex (two lines).

An FB-DRAM module has only 69 pins, compared with the 240 for DDR2. Daisy chaining FB-DRAM modules is much easier since the electrical effects of the bus can be handled much better. The FB-DRAM specification allows up to 8 DRAM modules per channel.

Compared with the connectivity requirements of a dual-channel Northbridge it is now possible to drive 6 channels of FB-DRAM with fewer pins: 2×240 pins versus 6×69 pins. The routing for each channel is much simpler which could also help reducing the cost of the motherboards.

Fully duplex parallel busses are prohibitively expensive for the traditional DRAM modules, duplicating all those lines is too costly. With serial lines (even if they are differential, as FB-DRAM requires) this is not the case and so the serial bus is designed to be fully duplexed, which means, in some situations, that the bandwidth is theoretically doubled alone by this. But it is not the only place where parallelism is used for bandwidth increase. Since an FB-DRAM controller can run up to six channels at the same time the bandwidth can be increased even for systems with smaller amounts of RAM by using FB-DRAM. Where a DDR2 system with four modules has two channels, the same capacity can handled via four channels using an ordinary FB-DRAM controller. The actual bandwidth of the serial bus depends on the type of DDR2 (or DDR3) chips used on the FB-DRAM module.

We can summarize the advantages like this:

Pins 240 69 Channels 2 6 DIMMs/Channel 2 8 Max Memory 16GB 192GB Throughput ~10GB/s ~40GB/s

There are a few drawbacks to FB-DRAMs if multiple DIMMs on one channel are used. The signal is delayed—albeit minimally—at each DIMM in the chain, which means the latency increases. But for the same amount of memory with the same frequency FB-DRAM can always be faster than DDR2 and DDR3 since only one DIMM per channel is needed; for large memory systems DDR simply has no answer using commodity components.

2.2.5 Conclusions

This section should have shown that accessing DRAM is not an arbitrarily fast process. At least not fast compared with the speed the processor is running and with which it can access registers and cache. It is important to keep in mind the differences between CPU and memory frequencies. An Intel Core 2 processor running at 2.933GHz and a 1.066GHz FSB have a clock ratio of 11:1 (note: the 1.066GHz bus is quad-pumped). Each stall of one cycle on the memory bus means a stall of 11 cycles for the processor. For most machines the actual DRAMs used are slower, thusly increasing the delay. Keep these numbers in mind when we are talking about stalls in the upcoming sections.

The timing charts for the read command have shown that DRAM modules are capable of high sustained data rates. Entire DRAM rows could be transported without a single stall. The data bus could be kept occupied 100%. For DDR modules this means two 64-bit words transferred each cycle. With DDR2-800 modules and two channels this means a rate of 12.8GB/s.

But, unless designed this way, DRAM access is not always sequential. Non-continuous memory regions are used which means precharging and newRASsignals are needed. This is when things slow down and when the DRAM modules need help. The sooner the precharging can happen and theRASsignal sent the smaller the penalty when the row is actually used.

Hardware and software prefetching (see Section 6.3) can be used to create more overlap in the timing and reduce the stall. Prefetching also helps shift memory operations in time so that there is less contention at later times, right before the data is actually needed. This is a frequent problem when the data produced in one round has to be stored and the data required for the next round has to be read. By shifting the read in time, the write and read operations do not have to be issued at basically the same time.

2.2 DRAM 訪問技術細節 在介紹DRAM的章節我們看到了DRAM晶片通過複用了地址線節省了資源。我們也看到了訪問DRAM單元花費了時間因為在單元中的電容不能瞬間放電圖去產生一個穩定的電流訊號。我們也看到了DRAM單元需要被重新整理(64ms/16ms)。現在我們需要綜合考慮這些因素並且看一下這些因素是怎麼決定DRAM的訪問過程的。

我們將關注於當前的技術;我們將不再討論非同步DRAM(asynchronous DRAM)和他的變種,僅僅只是和接下來的內容不相關。讀者如果對這感興趣,可以參考[highperfdram]和[arstechtwo].我們將不再討論Rambus DRAM,即使這個技術沒有過時。它沒有被系統記憶體廣泛的使用。我們將專注於同步DRAM(Synchronous DRAM SDRAM)並且 它的後繼者 雙速率 DDR(Double Data DRAM)

同步DRAM,正如其名,和一個同步時鐘相關連。記憶體控制器提供了一個時鐘,這個時鐘的頻率決定了FSB(前端匯流排)的速率。DRAM晶片使用了記憶體控制器的介面-FSB。在寫作的這時,FSB的頻率已經有800MHz,1066MHz,甚至1333MHz,下一代更高的頻率1600MHz也正在被宣佈。這並不意味匯流排頻率在使用時有這麼高。事實上,今天匯流排是雙倍或者4倍傳輸的,意味著資料在週期內被兩倍或者4倍的傳輸。為了更高的銷售,生產商們喜歡宣傳4倍的200MHz匯流排頻率為效果為800MHz的匯流排。

目前SDRAM每次資料傳輸包含64bits-8bytes。因此FSB的速率計算方法是8B*有效的匯流排速率(200MHz的4倍的匯流排速率 為6.4GB/s,8*4*200*10的6次)。這聽起來很高,但是這只是峰值速率,這個速度是無法達到的。正如我們所知道的,在訪問RAM模組的時候,有一段空檔期是無法傳輸資料的。我們必須非常準確的瞭解這個空檔期才能最小化它來獲取最好的效能。

圖2.8用三種顏色展示了不同階段中與DRAM模組相關部件的活動。照例,時間發生從左到右。許多細節被忽略。這兒我們只討論匯流排時鐘,RAS和CAS,資料匯流排和地址匯流排。讀週期開始於記憶體控制器傳輸行地址行到地址匯流排並且降低了RAS訊號。所有的訊號在CLK的上升沿被讀取(/RAS引腳被賦予低電平而被啟用,行地址被送到行地址選通器,行地址解碼器根據接收到的資料選擇相應的行),所以不關心這是不是矩形的方波,只有讀的時候訊號穩定即可。設定行地址將會導致RAM晶片圖去鎖地址行。

CAS訊號在tRCD個時鐘週期後發出。傳輸列地址到地址匯流排並且降低了CAS訊號(/CAS引腳被賦予低電平而被啟用,列地址被送到列地址選通器Column Address Latch,列地址解碼器C olumn Address Decode根據接收到的資料選擇相應的列)。這兒我們可以看到兩部分的地址是怎麼傳輸通過相同的地址匯流排。

現在定址完成並且這個資料可以被傳輸。RAM晶片需要一些時間去準備。這個延遲被成為CL(CAS 時延因素)。在圖2.8中,CL是2個週期。它可以或大或小,取決於記憶體控制器,主機板,DRAM模組的質量。這個因素可以有半個週期。如果CL=2.5,那麼第一個資料在藍色區域的第一個下降沿是可用的。

所有以上的準備去獲得一個字將是浪費的。這就是為什麼DRAM模組允許記憶體控制器去制定多少資料可以被傳輸。經常這個選擇是2,4,8字。這可以允許在沒有新RAS/CAS序列的前提下填存快取的整個線。記憶體控制器在沒有重置行選擇的前提下發送一個CAS訊號也是可行的。這樣的話,因為RAS訊號不需要重新發送而且行不需要失活,連續的記憶體地址可以被更快的讀或寫。記憶體控制器能夠決定行是否一直open,如果一直開啟的話會對應用有不利的影響。傳送新的CAS訊號只與RAM模組的命令速率有關(通常指定為Tx,像T1,T2,高效能DRAM模組中為1表示在每個週期可以接受一個新的命令)。

這個例子中SRAM 每個週期可以輸出一個字。這就是第一代。DDR有能力在週期內傳輸兩個位元組。這降低了傳輸速率但是無法改變時延。原則上看,DDR2實際上是相同的工作原理,僅僅是看上去不同罷了。知道DDR2可以更快,更便宜,更可靠,更節能便足夠了。

2.2.2 預存電和啟用

圖2.8沒有展示全過程。他只是展示了訪問DRAM的全過程。在一個新的RAS訊號被髮送到行選通器之前,行必須失活並且行必須預存電,我們關注的是在顯示命令傳送之後的情況。這個協議做了一些改進,在一些情況下,允許這個額外的步驟省略。這個預存電帶來的延遲可能會影響操作。

圖2.9 展示了一個CAS訊號到另一個CAS訊號的全過程。在CL週期之後,第一個CAS訊號所請求的資料是可用的。在一個簡單的SDRAM中,消耗兩個週期去傳輸兩個被請求的位元組。在DDR晶片中,可以傳輸4個位元組。

即使在一個命令速率為1的DRAM上,預充電命令也不能立刻發出。這是重要的去等資料傳輸完成。在這個例子中,傳輸資料消耗了兩個時鐘週期。正好巧合和CL一致。預充電訊號沒有專線。相反的,一些實現通過同時降低WE線和RAS線。這種組合本身沒有什麼意義。

一旦預充電命令被髮出,它將花費tRP(Row Prechange time)個週期直到行可以被選擇。在圖2.9中,大部分的時間(紫色部分)是重疊的傳輸時間(淡藍色)。tRP時間大於傳輸時間,所以我們需要多等一個週期來發送RAS訊號。(在資料傳輸完之後,我們必須使得RAS和CAS都失活一個週期)

如果我們將圖中的時間線補齊,我們會發現下一次資料傳輸將會發生在這次資料傳輸完的5個週期之後。這意味著有效資料傳輸只佔到了2/7。每次資料傳輸*前端匯流排速率在理論上應該為6.4GB/s,但是實際上只有1.8GB/s。這是十分糟糕的,必須去避免。第6節提到的技術將會提高這個數值。但是程式設計師也必須盡力努力。

有些SDRAM模組的時間引數我們沒有談及到。在圖2.9中這個預充電命令只被資料傳輸時間限制。另外一個限制因素是SDRAM在發出RAS訊號到下一次進行行的預充電之前是有一段時間間隔的(tRAS)。這個數值通常是相當高的,通常是兩到三倍的tRP時間值。假如在RAS訊號之後,只有一個CAS訊號跟隨並且這個資料可以在很短的週期內就可以傳輸結束。在圖2.9中,假設第一個CAS訊號是緊跟在RAS訊號兩個週期(RCD還是2),並且tRAS訊號是8個週期,那麼這個預充電命令就必須推遲一個時間,因為tRAS=8,(RCD+CL+tRP =7)

DDR模組經常使用一個特殊的符號來描述:w-x-y-z-T.比如:2-3-2-8-T1.意思如下: w 2 CAS延遲(CL) x 3 RAS到CAS延遲(tRCD) y 2 RAS預充電 tRP z 8 RAS啟用到預充電的時間(tRAS) T T1 命令速率

這兒還有一些其他的時間引數來影響命令的傳送和處理.但是這5個引數足以決定模組的效能. 知道計算機的這些資訊在說明一些引數的時候是有用的.在買計算機時知道這些也肯定有幫助的.這些資訊和FSB,SDRAM模組速度都是最重要的因素來決定計算機的速度.

有冒險精神的讀者可以嘗試去微調系統.一些時候BIOS允許改變其中一些或者全部的值.SDRAM模組有可程式設計的暫存器,這些值可以被修改.通常BIOS都會選擇最優的值.如果RAM的質量足夠的好.在不影響系統計算機系統穩定性的前提下可能去降低一個或其他的時延引數.在網上有大量超頻的網站提供了足夠的文件.做這個是有風險的,我已經警告過了.

2.2.3 預充電 在提到DRAM訪問時重充電總是被忽略的話題.正如在2.1.2節中提到的,DRAM單元總是不斷的被重新整理.This does not happen completely transparently for the rest of the system.(不理解什麼意思,我是sholck222,我是一個英語學渣).在一個行預充電的時候,是不能訪問的.在[highperfdram]研究中發現”驚人的,DRAM 重新整理機制可以顯著的影響效能. 依據JEDEC說明書,每一個DRAM單元必須在每64ms(目前我好像見過16ms的)內重新整理一次.如果一個DRAM陣列有8192行,則意味這記憶體控制器平均0.0078125ms(64ms/8192)需要發出一個重新整理命令(在實際中重新整理命令也可以納入佇列,所以在兩次請求之間的間隔可以更高 Why?).記憶體控制器需要去排程重新整理命令.DRAM模組記錄著最後一次重新整理的行並且在新請求之後自動的增加這個計數.

程式設計師無法對重新整理和命令何時發出做出更改,但是在解讀引數時應該記住這對DRAM生命週期是重要的.如果在一個行正在重新整理時,而在這行請求某個關鍵字,處理器將會延遲相當長的一段時間.DRAM模組重新整理時常取決於本身.

2.2.4記憶體型別

這是非常值得花費一些時間去研究當前和將要使用的記憶體型別。我們開始研究SDRAM(同步DRAM)因為它是DDR(雙倍速率)的基礎。SDRAM是簡單的。這些記憶體單元和資料傳輸是一直的。

在圖2-10 DRAM單元陣列輸出記憶體內容的速率和在記憶體總線上傳輸資料的速率是一致的。如果DRAM單元陣列速率可以為100MHz,在匯流排的資料傳輸速率也是100MHz.全部元件的速率是一致的。增加DRAM晶片的吞吐量是代價昂貴的因為能源消耗會隨著頻率一起增高。一個巨大的記憶體單元陣列是非常非常昂貴的。{功率= 電容*電壓的平方*頻率}。實際中在增加頻率的時候會帶來問題,通常需要提高電壓來維持系統的穩定性。DDR SDRAM (DDR1)可以在不提高參與頻率的前提下提高吞吐量。

正如圖2.11看到的,和SDR和DDR1的名字上我們也能看出它們的差異,DDR1在每個週期傳輸兩倍的資料。DDR1晶片傳輸資料在上升沿和下降沿。一些時候我們稱之為’雙泵’匯流排。為了使在不提高單元陣列頻率的前提下,我們需要引入緩衝。在這個快取中,每條資料線擁有兩位。這反過來要求,在單元陣列中資料匯流排包含兩條線。實現這個是簡單的。我們使用相同的列地址來平行的訪問兩個DRAM。單元陣列實現這個只需作很小的改變。

SDR DRAMS被熟知只是因為它們的頻率。因為這個頻率沒有改變,為了讓DDR1 DRAM 聽起來更好,營銷人員提出了一種新的命名策略。這種新的命令包含了DDR模組(64位的匯流排)的傳輸速率。

100MHZ * 64bit * 2 = 1600MB/s (*2 是因為內部的雙泵 匯流排)

從此我們把有著100MHZ頻率的DDR被叫作PC1600。因為1600>100,所以銷售需求被滿足了。儘管只是提高了兩倍,但是這是聽起來非常好的。(我承認提高了兩倍,但是我不喜歡這種數字遊戲)

DDR2包含了一點兒創新從而是得記憶體技術更進一步。最明顯的變化是翻倍了匯流排的頻率。(圖2-12)翻倍匯流排的頻率意味這翻倍頻寬。因為翻倍單元陣列的頻率不是經濟的,因此這要求I/O 快取區在每個時鐘週期內可以獲得4位,這4位之後傳送到匯流排。這意味DDR2模組的改變只是讓DIMM封裝的晶片I/O快取區可以執行的更快。這種方案是可行的,同時不會使得供電量提高,這只是一個小的元件而不是整個模組。銷售商命名DDR2的方式是和DDR1的方式似的,只是快取內部由4位雙泵匯流排代替了2位雙泵匯流排。表2-1展示了今日這些使用模組的名稱。

CPU,主機板,DRAM模組使用有效的頻率來表示FSB速度,也就是把在每個時鐘週期上升和下降沿傳輸資料的因素考慮進去,這使得FSB被撐大。比如:一個有著266MHZ匯流排的133MHZ模組卻有一個533的FSB頻率。

DDR3(這個不是在顯示卡中使用的假冒GDDR3)相對DDR2改變了很多。電壓從DDR2的1.8V降低到1.5V.因為電能的消耗是和電壓的平方成正比而且這大概節省了30%。加上晶片尺寸的縮小和電氣技術的進步,DDR3可以實現在相同的頻率下,但是隻消耗一半的電能。或者在相同的電能消耗下,或得更高的頻率。或者在相同熱量排放的前提下使得電容翻倍。

DDR3單元陣列的執行速度是內部匯流排速度的1/4,使得I/O快取區的記憶體從DDR2的4位提升到8位,可以看圖2.13中的電路圖。

起初DDR3模組相比DDR2可能會有輕微較高的CAS延遲,這是因為DDR2技術是更加的成熟。這造成了DDR3在DDR2無法達到的高頻或者頻寬遠比延遲更重要的前提下更有用,之前討論過的1.3V的DDR3模組可以實現DDR2相同的CAS延遲。在任何情況下,提高速度總是比增加的延遲更加重要。

DDR3可能有一個問題,對於1600Mb/s傳輸速率或者更高速率的DDR3模組,每個通道的模組數量可能會降低到1.在早期的版本,所有不同頻率的模組都有這個限制,所以我們希望對於這些晶片,這個限制能夠有所改善,否則這將嚴重的影響系統的限制。

表2-2展示了預期的DDR3模組的名字。JEDEC目前同意了前面4種。Inter 45nm的處理器有一個1600Mb/s的FSB速度。一個FBS 1866Mb/s的處理器可以用在超頻的市場中。在DDR3的發展中,我們也將看到更多的模組型別。

所有的DDR 記憶體都有一個問題:提高匯流排頻率使得搭建平行的資料匯流排變得困難。一個DDR2模組有240個引腳。所有連線到資料和地址匯流排的引腳必須被安排成幾乎一樣的長度。一個更大的問題是,如果超過一個的DDR模組通過菊花鏈的方式連線到相同的匯流排,這個訊號將會越來越變形。DDR2指定只允許一個匯流排和兩個模組相連,DDR3對於高頻指定只允許一個模組。每個匯流排帶著240個引腳使得北橋無法合理的驅動超過兩個匯流排。替代解決辦法是將記憶體控制器移到記憶體外(正如圖2.2所示),但是這種方法是昂貴的。

這意味這現代主機板被限制只能最多有4個DDR2或者DDR3模組。這嚴重限制了一個系統所能擁有的記憶體數量。即使一個久的32位IA-32的處理器也可以處理64GB的RAM並且家庭主機也對記憶體需求日漸增長,所以我們需要做一些改變了。

Inter針對大型伺服器方面,在未來幾年,將使用Full Buffered DRAM來處理此問題。這個FB-DRAM使用和DDR2模組相同的元件來使得生產它們是容易的。區別在於與記憶體控制器的連線方式。FB-DRAM沒有使用並行匯流排,反而使用了序列匯流排。序列匯流排可以在一個更高的頻率下工作,同時也reverting的序列匯流排的消極影響,甚至增加頻寬。使用一個序列匯流排的影響如下: 1.每個通道上可以使用更多的模組。 2.在北橋和記憶體控制器之間可以使用更多的通道。 3.序列匯流排可以被設計成全雙工的。

一個FB-DRAM模組相比DDR2的240引腳,只有69引腳。因為匯流排的電氣影響可以被很好的處理,所以通過菊花鏈相連多個FB-DRAM模組也是更容易的。FB-DRAM指定允許每個通道上有8個DRAM。

與雙通道北橋的連線條件對比,FB-DRAM用很少的引腳可以驅動6個模組,2*240(雙通道北橋)vs 6*69(FB-DRAM)。每一個通道的排線也更加的簡單,這可以降低主機板的成本。

對於傳統的DRAM模組,全雙工並行匯流排是極其昂貴的,(duplicating all those lines is too costly /what this means?)序列匯流排使其不是問題(即使和全雙工並行匯流排有著細微的差距),並且序列匯流排設計成全雙工的,這意味著,在一些情況下,僅靠這一點,匯流排的頻寬在理論下就可以翻倍。因為FB-DRAM控制器可以同時和6個控制器相連,所以可以用其來增加一些小記憶體系統的頻寬。一個有著雙通道,4個記憶體模組的DDR2的系統,可以被一個普通的,有著4通道的FB-DRAM代替。這個序列匯流排實際的頻寬具體是由FB-DRAM模組所使用的DDR2(或者DDR3)來決定的。

我們對比之後可以總結的優點如下:

當多個DIMMs 在一個通道上被使用時,FB-DRAM將會有一些缺點。訊號在鏈路上得每個DIMM上都有很小的延遲,但是這樣會造成疊加。相同容量的記憶體和相同頻率的前提下FB-DRAM總是比DDR2和DDR3快的。因為每個通道上只需有一個DIMM模組。對於大型的記憶體系統,DDR更是沒有商用元件的解決方法。

2.2.5總結 這一節展示了訪問DRAM並不是一個快速的過程。至少與處理器和訪問暫存器想比並不是快的。CPU和記憶體的頻率不同是非常重要的。Inter Core 2處理器頻率是2.933GHz,並且FSB頻率是1.066GHz,它們的時鐘比是11:1(1.066GHz匯流排是一個四泵結構)。在記憶體匯流排週期上的每一個延遲都意味著在處理器上延遲11個週期。對於大多數的機器目前的DRAM是慢的因此導致在處理器上延遲更高。 在之後的章節中我們討論延遲的時候還會關聯到時鐘比。

之前讀命令的時序圖已經展示了DRAM模組能實現高速的資料傳輸。DRAM一整行可以沒有延遲的被傳輸。資料匯流排可以被100%的佔用。對於DDR模組這意味著在每個週期內可以傳輸2*64bits,對於DDR2-800和雙通道這意味這12.GB/s的傳輸速率。

但是,DRAM的訪問不都是序列的。在訪問不連續的記憶體區域時就意味著需要預充電和新的RAS訊號,所以這使得速度慢下來,DRAM就需要一些改進。預充電的時間越短,RAS訊號傳送啟用行帶來的負面影響就越小。

硬體和軟體的預充電會創造更多的時序重疊區並且降低了延遲。預取可以提前記憶體操作的時間,有利於在資料被請求時減少競爭。如果沒有預充電,這一輪產生的資料必須儲存同時下一輪被請求的資料必須讀出是經常發生的問題。但是通過提前讀的時間,這個讀和寫操作就基本不會在同一時間被髮出。