如何查看主板pcie maxpcie payload size

PCI-PCIe_FAQ_百度文库
两大类热门资源免费畅读
续费一年阅读会员,立省24元!
PCI-PCIe_FAQ
上传于|0|0|文档简介
&&关于PCI和PCIE的区别介绍。
阅读已结束,如果下载本文需要使用1下载券
想免费下载本文?
定制HR最喜欢的简历
下载文档到电脑,查找使用更方便
还剩11页未读,继续阅读
定制HR最喜欢的简历
你可能喜欢&|&&|&&|&&|&&
当前位置: >
PCIe学习笔记分享
作者:mantis_1984 & 来源:转载 &
本总结随着DM8168中PCIe的学习,以及PCIe原理的学习而添加,都是一些零碎的知识点,有部分个人理解,将来温故知新用。《PCIExpress体系结构导读》,《PCIEXpress系统体系结构标准教材》DM8168相关文档,网上搜集资料。2、DM8168的PCIe总
& & & & & & & & & & & & & & & & & & & & & & & & & & & & & & & & & & & & & & & & & & & & & & & & & & & & & & & & & &
&&& 本总结随着DM8168中PCIe的学习,以及PCIe原理的学习而添加,都是一些零碎的知识点,有部分个人理解,将来温故知新用。
《PCI Express 体系结构导读》,《PCI EXpress系统体系结构标准教材》DM8168相关文档,网上搜集资料。
2、DM8168的PCIe总线
3、RC(Root Complex)
&&& RC的主要功能与PCI总线中的HOST主桥类似,但是在HOST主桥的基础上增加了许多功能,RC可以理解为一个PCIe总线控制器,在X86处理器的体系结构中,RC并不仅仅管理PCIe设备的数据访问,而且还包含访问控制、错误处理和虚拟化技术等一系列内容&P100,《PCI Express 体系结构导读》&。
4、关键的参考时钟REFCLK+/-,100Mhz
&&& PCIe设备需要使用REFCLK+,REFCLK作为本地参考时钟,其时钟频率固定为100Mhz,在DM8168系统中,需要提供两个100Mhz时钟,一个给DM8168用于检测PCIe并且同步PCIe设备,一个给PCIe插槽用于EP的本地参考时钟,两个时钟必须同源,以保证PCIe挂载设备与系统同步。
5、WAKE#,JTAG,PRSNT1#,PRSNT2#,SMCLK#,SMDAT#
WAKE#: 当PCIe设备进入休眠需要唤醒时,PCIe设备通过WAKE#信号使处理器重新为PCIe设备供电,达到唤醒目的。
JTAG: 常见调试接口,在PCIe接口中由TRST#,TCK, TDK, TDO和TMS信号组成。
PRSNT1#,PRSNT2#: 与PCIe设备热插拔有关的一组信号。DM816x设备不支持热插拔。
SMCLK#,SMDAT#:P105~106《导读》
P105~106《导读》
6、PCIe总线层次结构
&&& PCIe总线采用串行的连接方式,并使用数据包(Packet)进行数据传输,在数据包的接收和发送过程中,需要经过多个层次,包括事务层,数据链路层和物理层三层。类似网络传输的OSI七层模型。
&&&PCIe为互联设备提供高速,高性能,点对点,双—单工,差分信号链路。数据从一组信号线发送,在另一组信号线上接收。
7、PCIe设备的复位方式
&&&PCIe总线规定了两大类的复位方式,一种是传统的复位方式(Conventional Reset),另一种是FLR(Function-Level Reset)方式。
&&& 传统的复位方式有Cord,Warm和Hot Reset三种方式。cordReset方式就是PCIe上电时系统通过PERST#脚进行复位,类似9650外围芯片的阻容复位,我理解是一种硬复位。而warm Reset类似于看门狗,在系统上电运行后,通过看门狗等方式对PCIe进行的复位,应该属于PCIe设备全局复位,复位后PCIe设备重新启动运行。Hot Reset复位,是使用软件对PCIe设备进行复位,如系统软件对Bridge Control Register某位置1,该桥片对PCIe设备进行复位。
&&& FLR方式,即Function-LevelReset方式,这个方式与上述传统复位有什么区别,举个例子:当PCIe网卡的某个功能模块“与网络部分相关”的逻辑需要复位,而传统方式(cold,warm,hot reset)不能做到局部复位,而FLR方式就是一种局部复位。DM8168不支持FLR方式。<span style="color:#.1.3sprugx8b.pdf
8、北桥芯片
&&& 北桥芯片   它主要负责CPU与内存之间的数据交换,并控制AGP、PCI数据在其内部的传输,是主板性能的主要决定因素。随着芯片的集成度越来越高,它也集成了不少其它功能。如:由于Althon64内部整合了内存控制器;nVidia在其NF3 250、NF4等芯片组中,去掉了南桥,而在北桥中则加入千兆网络、串口硬盘控制等功能。现在主流的北桥芯征的牌子有VIA、NVIDIA及SIS等。  当然这些芯片的好坏并不是由主板生产厂家所决定的,但是主板生产商采取什么样的芯片生产却是直接决定了主板的性能。如:同样是采用VIA的芯片,性能上则有KT600&KT400A&KT333&KT266A等。目前主流的 AMD平台上,可选的芯片组有:KT600、NF2、K8T800、NF3等;对于INTEL平台,则有915、865PE、PT880、845PE、 848P等。
9、南桥芯片
&&& 南桥芯片主要是负责I/O接口等一些外设接口的控制、IDE设备的控制及附加功能等等。常见的有VIA的等;INTEL的有CH4、CH5、CH6等;nVIDIA的MCP、MCP-T、MCP RAID等。在这部分上,名牌主板与一般的主板并没有很大的差异,但是名牌主板凭着其出色的做工,还是成为不少人的首选。而不排除一部分质量稍差的主板为了在竞争中取得生存,可能会采用功能更强的南桥以求在功能上取胜。
10、FSB(Front Side BUS)前端总线
&&& FSB只指CPU与北桥芯片之间的数据传输总线,又称前端总线。  对于P4来说,FSB频率=CPU外频*4。  这个参数指的就是前端总线的频率,它是处理器与主板交换数据的通道北桥芯片负责联系内存、显卡等数据吞吐量最大的部件,并和南桥芯片连接。CPU就是通过前端总线(FSB)连接到北桥芯片,进而通过北桥芯片和内存、显卡交换数据。前端总线是CPU和外界交换数据的最主要通道,因此前端总线的数据传输能力对计算机整体性能作用很大,如果没足够快的前端总线,再强的CPU也不能明显提高计算机整体速度。数据传输最大带宽取决于所有同时传输的数据的宽度和传输频率,即数据带宽=(总线频率×数据位宽)÷8。目前PC机上所能达到的前端总线频率有266MHz、333MHz、400MHz、533MHz、800MHz几种,前端总线频率越大,代表着CPU与北桥芯片之间的数据传输能力越大,更能充分发挥出CPU的功能。
11、Switch
&&& 由于PCIe总线使用端到端的连接方式,一条PCIe链路只能连接一个设备。当一个PCIe链路需要挂接多个EP时,需要使用Switch进行扩展。一个标准的Switch具有一个上游端口和多个下游端口。上游端口与RC或者其他Switch的下游端口相连,而下游端口可以与EP、PCIe-to-PCI桥或者下游Switch的上游端口相连。Switch是一种链路扩展的方式。
12、VC和端口仲裁
13、PCIe-to-PCI/PCI-X桥片
&&& PCIe-to-PCI桥片:将PCIe总线转换为PCI总线,以实现PCIe设备与PCI设备的互联,理解为转换器。
&&& 同理PCIe-to-PCI-X桥片。
14、TLP(Transaction Layer Packet)
  TLP称为事务层数据报文,当处理器或者其他PCIe设备访问PCIe设备时,所传送的数据报文首先通过事务层被封装为一个或者多个TLP,之后才能通过PCIe总线的各个层次发送出去。TLP的概念类似于TCP/IP协议中的UDP包,或者TCP包等。具体结构看(p155~p156《PCI Express体系结构导读》)
15、TLP的路由
&&& TLP的路由是指TLP通过Switch或者PCIe桥片时采用那条路径,最终到达EP或者RC的方法。PCIe总线一共定义了三种路由方法,分别是基于地址(Address)的路由,基于ID的路由和隐式路由(Implicit)方式。
16、PCI Express Capability, PCI Express Extended Capabillities, PowerManagement Capability
&& 这部分内容在sprugx8b.pdf17.4.8~17.4.11中有关于DM81xx的PCIe相关寄存器配置,其寄存器具体定义参见《PCI Express体系结构导读》4.3节,p123~p133。
17、Max_Payload_Size参数
&& &Max_Payload_Size参数决定了一个TLP报文可能使用的最大有效负载,PCIe总线规定Max_Payload_Size的最大值为4096B,但是许多PCIe设备不一定能够支持这么大的负载。在实际应用中,一个PCIe设备的支持的Max_Payload_Size参数通常是128B,256B或者512B。
&&& 还有一个参数Max_Payload_Size_Supported,表示PCIe设备中允许使用的Max_Payload_size参数。通常是两端的PCIe设备进行协商,确定实际使用值。P128《导读》。
18、LTSSM状态机
P218《导读》,sprugx8b.pdf17.4.3节也有提到。
19、MSI,MSI-X,Legacy中断机制&&&&&&&&&&
&&& 区别于使用INTx引脚提交的中断请求,MSI和MSI-X中断机制使用存储器写请求的TLP向处理器提交中断请求,也就是说这种中断的方法是嵌在TLP报文中的。不同的处理器对PCIe设备发出的MSI报文的解释并不相同。但是PCIe设备在提交MSI中断请求时,都是向MSI/MSI-X Capability结构中的Message Address的地址写Message Data数据,从而组成一个存储器写TLP,向处理器提交中断请求。
&&& 有些PCIe设备还支持Legacy中断方式,比如DM8168,这种中断方式是通过发送Assert_INTx和Deassert_INTx消息报文进行中断请求,即虚拟中断线(INTx)的方式。在PCIe体系结构中,仍然存在许多PCI设备,而这些设备通过PCIe桥连接到PCIe总线中,这些设备可能并不支持MSI/MSI-X中断机制,因此必须使用INTx信号进行中断请求。
P263《导读》
21、type 0, type 1
22、PERST#, PWRGD#
&&& 该信号为全局复位信号,由处理器系统提供,处理器系统需要为PCIe插槽和PCIe设备提供该复位信号。PCIe设备使用该信号复位内部逻辑,当该信号有效时,PCIe设备将进行复位操作。PCIe总线的Cold Reset和Warm Reset这两种复位方式的实现与该信号有关。P103《导读》
23、inbound, outbound
&&& 这个见于pcie设备和系统内存互访问时,outbound是指CPU到设备方向;inbound指Device--& RC(CPU端)方向。从这个概念上说,设备(device)都是外部的,没有内部设备之说。CPU读写RC 端的寄存器时,还是属于片上系统的范围,所以既不是inbound 也不是outbound。
&&& 简单的说,如果CPU读写PCI BAR 的总线地址,就是outbound,如果设备读写CPU端的主存,就是inbound。
http://blog.csdn.net/JuanA1/article/details/6695939
24、Lane Reversal, Polarity Inversion
&&& PCIe总线规定,PCIe链路两端的设备所使用的Lane可以错序进行连接,PCIe总线规范该功能为“LaneReversal”。在相同的Lane上,差分信号的极性也可以错序连接,PCIe总线规范将该功能称为“Polarity Inversion”。
25、PCI 枚举
&&& 所谓枚举,就是从Host/PCI桥开始进行探测和扫描,逐个“枚举”连接在第一条PCI总线上的所有设备并记录在案。如果其中的某个设备是PCI-PCI桥,则又进一步再探测和扫描连在这个桥上的次级PCI总线。就这样递归下去,直到穷尽系统中的所有PCI设备。其结果,是在内存中建立起一棵代表着这些PCI总线和设备的PCI树。每个PCI设备(包括PCI桥设备)都由一个pci_dev结构体来表示,而每条PCI总线则由pci_bus结构来表示。
http://blog.csdn.net/linuxdrivers/article/details/5849698
26、PCIE 插槽图
热拨插存在检测
系统管理总线时钟
测试时钟、JTAG接口输出时钟
系统管理总线数据
测试数据输出
测试模式选择
+3.3.v电压
测试模式选择
测试复位,JTAG接口复位时钟
+3.3.v电压
3.3v辅助电源
+3.3.v电压
链接激活信号
PCIe复位信号
差分信号对的参考时钟
0号信道发送
0号信道接收
热拨插存在检测
参考:/view/8252090bbb68a98271fefab5.html
版权所有 IT知识库 CopyRight (C)
, All Rights Reserved.->【Xilinx技术小组】
在sp6下,生成的PIO的testbench,在RP下,发现cfg_dev_control_maxpayload 参数不能设置,一直都是128B,请问如何设置RP中的type寄存器,可以使cfg_dev_control_maxpayload配置为256B或者512B
专家答复:
MPS(Max Payload Size)有两个,一个在device capability 里边,是指器件支持多大的MPS;另一个在device control 里边,是实际工作中的MPS,这个MPS是链路两端协商、由系统软件或者说RC写入的,个人是不能配置的。宝存 shannon PCI-E SSD VS OCZ RevoDrive3 X2 PCI-E SSD on CentOS 6.5 2.6.32-431.el6.x86_64
今天拿到一块上海宝存 shannon 的1.2TB SSD一块, 以前一直在用的是OCZ RevoDrive3 X2, 所以对比一下这两款的性能如何.
感谢上海宝存提供SSD卡进行测试.
首先看一下宝存官方提供的性能指标.
Shannon Direct-IO SSD G2
用户可用容量
随机读取 IOPS
随机写入 IOPS
用户可用容量
随机读延迟 -4KB
随机写延迟 -4KB
用户可用 容量
800GB/1200GB
1600GB/3200GB
PCIe 2.0x8 half-height half-length
PCIe 2.0x8 Full-height half-length
非工作状态温度
介绍一下两款进行对比的SSD的测试环境.
DELL R720xd
8核 Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz
shannon 1.2TB
CentOS 6.5 x64
2.6.32-431.el6.x86_64
SSD 驱动信息
# modinfo shannon
/lib/modules/2.6.32-431.el6.x86_64/extra/shannon/shannon.ko
pci:v000275sv*sd*bc*sc*i*
pci:v000265sv*sd*bc*sc*i*
pci:v000010EEdsv*sd*bc*sc*i*
2.6.32-431.el6.x86_64 SMP mod_unload modversions
shannon_sector_size:int
shannon_debug_level:int
shannon_force_rw:int
shannon_major:int
shannon_auto_attach:int
PCI接口信息
# lspci -vvvv -s 04:00.0
<font size="2" color="#ff.0 SCSI storage controller: OCZ Technology Group, Inc. RevoDrive 3 X2 PCI-Express SSD 240 GB (Marvell Controller) (rev 02)
Subsystem: OCZ Technology Group, Inc. RevoDrive 3 X2 PCI-Express SSD 240 GB (Marvell Controller)
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast &TAbort- &TAbort- &MAbort- &SERR- &PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 122
Region 0: Memory at df1a0000 (64-bit, non-prefetchable) [size=128K]
Region 2: Memory at df1c0000 (64-bit, non-prefetchable) [size=256K]
Expansion ROM at df100000 [disabled] [size=64K]
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1+ D2- AuxCurrent=375mA PME(D0+,D1+,D2-,D3hot+,D3cold-)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit+
Address: 0000
Data: 0000
Capabilities: [70] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 4096 bytes, PhantFunc 0, Latency L0s &1us, L1 &8us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop-
MaxPayload 256 bytes, MaxReadReq 256 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 5GT/s, Width x8, ASPM L0s L1, Latency L0 &512ns, L1 &64us
ClockPM- Surprise- LLActRep- BwNot-
LnkCtl: ASPM L0s E RCB 64 bytes Disabled- Retrain- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Not Supported, TimeoutDis+
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis+
LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-, Selectable De-emphasis: -6dB
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB
Capabilities: [100 v1] Advanced Error Reporting
DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt+ UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP+ FCP+ CmpltTO+ CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol-
RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ NonFatalErr+
AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
Capabilities: [140 v1] Virtual Channel
LPEVC=0 RefClk=100ns PATEntryBits=1
Fixed- WRR32- WRR64- WRR128-
ArbSelect=Fixed
Status: InProgress-
PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
Status: NegoPending- InProgress-
Kernel driver in use: ocz10xx
Kernel modules: ocz10xx
主机 DELL R610
8核 Intel(R) Xeon(R) CPU
OCZ RevoDrive3 X2 960G
CentOS 5.8 x64
2.6.18-308.el5
SSD 驱动信息
# modinfo ocz10xx
/lib/modules/2.6.18-308.el5/extra/ocz10xx.ko
2.3.1.1977
Proprietary
description:
OCZ Linux driver
OCZ Technology Group, Inc.
srcversion:
27F0A3AF2BD189FDFA8ED54
pci:v01084sv*sd*bc*sc*i*
pci:v01083sv*sd*bc*sc*i*
pci:v01044sv*sd*bc*sc*i*
pci:v01043sv*sd*bc*sc*i*
pci:v01042sv*sd*bc*sc*i*
pci:v01041sv*sd*bc*sc*i*
pci:v01022sv*sd*bc*sc*i*
pci:v01021sv*sd*bc*sc*i*
pci:v01080sv*sd*bc*sc*i*
2.6.18-308.4.1.el5 SMP mod_unload gcc-4.1
ocz_msi_enable: Enable MSI Support for OCZ VCA controllers (default=0) (int)
PCI接口信息
# lspci -vvvv -s 41:00.0
<font size="2" color="#ff.0 Mass storage controller: Device 1cb0:0275
Subsystem: Device 1cb0:0275
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast &TAbort- &TAbort- &MAbort- &SERR- &PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 150
Region 0: Memory at d40fc000 (32-bit, non-prefetchable) [size=16K]
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [48] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: fee000d8
Data: 0000
Capabilities: [60] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s &64ns, L1 unlimited
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
MaxPayload 256 bytes, MaxReadReq 4096 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 5GT/s, Width x8, ASPM L0s, Latency L0 unlimited, L1 unlimited
ClockPM- Surprise- LLActRep- BwNot-
LnkCtl: ASPM D RCB 64 bytes Disabled- Retrain- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range B, TimeoutDis-, LTR-, OBFF Not Supported
DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-, LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
Capabilities: [9c] MSI-X: Enable- Count=1 Masked-
Vector table: BAR=0 offset=
PBA: BAR=0 offset=
Capabilities: [fc] #00 [0000]
Capabilities: [100 v1] Device Serial Number 00-00-00-01-01-00-0a-35
Kernel driver in use: shannon
Kernel modules: shannon
shannon驱动的安装方法 :
1. 使用已编译好的包
找到与内核版本一直的驱动rpm
# uname -r
<font size="2" color="#ff.32-431.el6.x86_64
shannon-2.6.32-431.el6.x86_64.x86_64-v2.6-9.x86_64.rpm
2. 或者使用源码build
如果安装包里没有对应的内核版本包, 那么可以使用源码包直接build
shannon-v2.6-9.src.rpm
# rpm -ivh shannon-v2.6-9.src.rpm
# cd /root/rpmbuild
drwxr-xr-x 2 root root 4096 Jun 19 21:00 SOURCES
drwxr-xr-x 2 root root 4096 Jun 19 21:00 SPECS
# cd SPECS/
-rw-rw-r--. 1 spike spike 7183 May 21 17:10 shannon-driver.spec
# rpmbuild -bb shannon-driver.spec
drwxr-xr-x 3 root root 4096 Jun 19 21:05 BUILD
drwxr-xr-x 2 root root 4096 Jun 19 21:05 BUILDROOT
drwxr-xr-x 3 root root 4096 Jun 19 21:05 RPMS
drwxr-xr-x 2 root root 4096 Jun 19 21:00 SOURCES
drwxr-xr-x 2 root root 4096 Jun 19 21:00 SPECS
drwxr-xr-x 2 root root 4096 Jun 19 21:05 SRPMS
drwxr-xr-x 2 root root 4096 Jun 19 21:05 x86_64
# cd x86_64/
-rw-r--r-- 1 root root 401268 Jun 19 21:05 shannon-2.6.32-431.el6.x86_64.x86_64-v2.6-9.x86_64.rpm
# rpm -ivh shannon-2.6.32-431.el6.x86_64.x86_64-v2.6-9.x86_64.rpm
安装完就可以把rpmbuild目录删掉了.
shannon linux下的命令行管理工具使用介绍
可以监控SSD的状态, 擦除SSD数据, 修改保留容量等.
(因为shannon的数据未被收录到smartmontools工具中, 所以不能使用smartctl查看该SSD的状态)
我下载了最新的smartctl还是无法查看shannon的信息.
# /opt/smartmontools-6.2/sbin/smartctl -A /dev/dfa
smartctl 6.2
r3841 [x86_64-linux-2.6.32-431.el6.x86_64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
/dev/dfa: Unable to detect device type
Please specify device type with the -d option.
Use smartctl -h to get a usage summary
我们看看shannon的驱动包带了哪些工具
# rpm -qa|grep shan
shannon-2.6.32-431.el6.x86_64.x86_64-v2.6-9.x86_64
# rpm -ql shannon-2.6.32-431.el6.x86_64.x86_64-v2.6-9.x86_64
/lib/modules/2.6.32-431.el6.x86_64/extra/shannon/Module.symvers
/lib/modules/2.6.32-431.el6.x86_64/extra/shannon/shannon.ko
/lib/udev/rules.d/60-persistent-storage-shannon.rules
/usr/bin/shannon-attach
/usr/bin/shannon-beacon
/usr/bin/shannon-bugreport
/usr/bin/shannon-detach
/usr/bin/shannon-format
/usr/bin/shannon-status
挨个介绍一下这些管理工具的使用.
1. 绑定SSD卡给操作系统使用. 这个一般不需要操作, 除非你detach了.
# /usr/bin/shannon-attach -h
Usage: shannon-attach [OPTIONS] [device]
Attaches Direct-IO PCIe SSD card and makes it available to the operating system.
-h, --help
Display this usage message.
-n, --readonly
Attach the Direct-IO drive and set access mode to readonly.
-r, --reduced-write
Attach the Direct-IO drive and set access mode to reduced-write.
If no -n or -r option is given, set access mode to normal readwrite.
Device node for the Direct-IO PCIe SSD card (/dev/sctx).
2. LED标示, 一般用于在多个SSD中物理的区分.
# /usr/bin/shannon-beacon -h
Usage: shannon-beacon [OPTIONS] [device]
Lights the Direct-IO PCIe SSD card's yellow LED to locate the device.
This utility always turns the LED on, unless you specifically use the -o option.
-h, --help
Display this usage message.
Light on the yellow LED.
Turn off the yellow LED.
Device node for the Direct-IO PCIe SSD card (/dev/sctx).
3. 收集BUG报告. 例如
# /usr/bin/shannon-bugreport -h
usage: shannon-bugreport
# /usr/bin/shannon-bugreport
$hostname$
Linux-2.6.32-431.el6.x86_64
Copying files into system...
Copying files into system...
Copying files into system...
Copying files into proc...
Copying files into proc/self...
Copying files into log...
Copying files into log...
Copying files into log...
Copying files into log...
Dumping dmesg ...
Copying files into sys...
Copying files into sys...
Copying files into sys...
Copying files into sys...
Dumping lspci -vvvvv...
Dumping uname -a...
Dumping hostname ...
Dumping ps aux...
Dumping ps aux --sort start_time...
Dumping pstree ...
Dumping lsof ...
Dumping w ...
Dumping lsmod ...
Dumping dmidecode ...
Dumping sar -A...
Dumping sar -r...
Dumping sar ...
Dumping iostat -dmx 1 5...
Dumping vmstat 1 5...
Dumping top -bc -d1 -n5...
Copying files into disk...
Copying files into disk...
Dumping df -h...
Dumping pvs ...
Dumping vgs ...
Dumping lvs ...
Dumping dmsetup table...
Dumping dmsetup status...
Gathering information using shannon-status...
Dumping numactl --hardware...
Dumping numactl --show...
Dumping numastat ...
Copying files into debug...
Building tarball...
存放到/tmp下的一个TAR包, 便于给厂商提供分析数据.
Tarball: /tmp/shannon-bugreport-326.tar.gz
Plz send it to our customer service, including steps to reproduce the problem.
All the information would help us address the issue tremendously.
4. 把一个SSD设备从操作系统中删除, 即解绑.
# /usr/bin/shannon-detach -h
Usage: shannon-detach [OPTIONS] [device]
Detaches and removes the corresponding /dev/dfx Direct-IO block device.
-h, --help
Display this usage message.
Device node for the Direct-IO PCIe SSD card (/dev/sctx).
5. SSD设备格式化, 查看当前设备的物理容量, 擦除, 设置最小逻辑访问单元, 设置用户可用空间, 等.
注意SSD一般物理容量都要大于实际的可用容量, 保留的容量用于纠错, 替换坏块. 因为SSD的存储单元有擦写次数限制, 超出限制就会造成物理损伤, 变成坏块, 坏块就需要额外的容量来替换和修复.
shannon保留的容量一般为27% 这个保留值可以使用状态命令看到.
当坏块比较多之后, SSD还可以继续使用, 但是如果保留容量也用的差不多的话, SSD随着坏块的增多, 最后就会报废.
所以一般坏块多了, 还可以通过缩小用户可用空间, 保持27%甚至更多的可用空间来维持SSD继续使用, 直到所有块都变成坏块. 就不能再写入了.
# /usr/bin/shannon-format -h
Usage: shannon-format [OPTIONS] [device]
Direct-IO PCIe SSD card is pre-formated before shipped to the customer.
This tool can perform a re-format as needed.
Re-format will erase all your data on the drive.
Please use shannon-detach to detach block device before using this tool,
-h, --help
Display this usage message.
-p, --probe
Probe current physical capacity and advertised user capacity.
-e, --erase
Erase all data and re-format the drive without changing setting.
-y, --always-yes
Auto-answer "yes" to all queries from this tool (i.e. bypass prompts).
-i DELAY, --interrupt=DELAY
Set interrupt delay (unit: us).
-b SIZE, --logical-block=SIZE
Set logical block size, i.e the minimum access unit from host.
-s CAPACITY, --capacity=CAPACITY
Set user capacity as a specific size(in TB, GB, or MB)
or as a percentage(such as 70%) of the advertised capacity.
-o CAPACITY, --overformat=CAPACITY
Over-format user capacity (to greater than the advertised capacity)
as a specific size(in TB, GB, or MB)
or as a percentage(e.g. 70%) of physical capacity.
-a, --advertised
Set user capacity to advertised capacity directly.
Warning: -s, -o , -a options are mutually exclusive!
Device node for the Direct-IO PCIe SSD card (/dev/sctx).
6. 前面我们说shannon SSD没有录入smartmontools字典, 还好shannon厂商提供了这个工具来查看SSD的状态, 包括使用寿命, 坏块占比, 预留空间, 温度传感器值, 等.
# /usr/bin/shannon-status -h
Usage: shannon-status [OPTIONS] [device]
Shows information about Direct-IO PCIe card(s) installed on the system.
-h, --help
Display this usage message.
-r SECS, --refresh=SECS
Set refresh interval of monitoring (unit: second, default: 2 seconds).
Find all Direct-IO drive(s), and provide basic information of them.
-m, --monitor
If given, this tool will open a monitoring window,
which dynamically shows detailed information of the specified drive.
-p, --print
Generate key=value format output for easier parsing.
Device node for the Direct-IO PCIe SSD card (/dev/sctx).
# /usr/bin/shannon-status -a
Found Shannon PCIE SSD card /dev/scta:
Direct-IO drive scta at PCI Address:41:00:0:
PCI接口信息, 使用lspci可以查看对应的驱动信息等.
Model:sh-shannon-pcie-ssd, SN: b014
Device state: attached as disk /dev/dfa, Access mode: readwrite
Firmware version: 0c321351, Driver version: 2.6.9
固件版本, 驱动版本
Vendor:1cb0, Device:0275, Sub vendor:1cb0, Sub device:0275
Flash manufacturer: 98, Flash id: 3a
Channels: 7, Lunsets in channel: 8, Luns in lunset: 2, Available luns: 112
Eblocks in lun: 2116, Pages in eblock: 256, Nand page size: 32768
Logical sector: 512, Physical sector: 4096
逻辑扇区和SSD的物理扇区(字节), 分区对齐时使用物理扇区大小, 也就是4K对齐
User capacity: 1200.00 GB/1117.59 GiB
用户可用容量
Physical capacity: 1632.37 GB/1520.26 GiB
物理容量, 可用看到预留了大概400GB作为修复坏块用
Overprovision: 27%, warn at 10%
Error correction: 35 bits per 880 bytes codeword
错误校验信息
Controller internal temperature: 71 degC, max 77 degC
温度传感器信息
Controller board temperature: 53 degC, max 59 degC
NAND Flash temperature: 53 degC, max 63 degC
Internal voltage: 1001 mV, max 1028 mV
Auxiliary voltage: 1795 mV, max 1804 mV
Media status: 1.0760% bad block
存储介质状态, 已经有1%的坏块了.
Power on hours: 9 hours 15 minutes, Power cycles: 3
加电时间和加电次数
Lifetime data volumes:
Host write data
: 13774.36 GB / 12828.37 GiB
Host read data
: 5172.11 GB / 4816.90 GiB
Total write data
: 19974.61 GB / 18602.80 GiB
Write amplifier
Estimated life left: 99% left
评估的剩余生命. 这个可以作为是否需要换盘的参考.
Totally found 1 Direct-IO PCIe SSD card on this system.
KEY-VALUE输出格式, 信息和前面的差不多.
# /usr/bin/shannon-status -p /dev/scta
drive=/dev/scta
pci_address=41:00:0
model=sh-shannon-pcie-ssd
serial_number=b014
device_state=attached as disk /dev/dfa
access_mode=readwrite
firmware_version=0c321351
driver_version=2.6.9
vendor_id=1cb0
device_id=0275
subsystem_vendor_id=1cb0
subsystem_device_id=0275
flash_manufacturer=98
flash_id=3a
channels=7
lunsets_in_channel=8
luns_in_lunset=2
available_luns=112
eblocks_in_lun=2116
pages_in_eblock=256
nand_page_size=32768
logical_sector=512
physical_sector=4096
user_capacity=1200.00 GB
physical_capacity=1632.37 GB
overprovision=27%
error_correction=35 bits per 880 bytes codeword
controller_temp=71 degC
controller_temp_max=77 degC
board_temp=53 degC
board_temp_max=59 degC
flash_temp=53 degC
flash_temp_max=63 degC
internal_voltage=1004 mV
internal_voltage_max=1028 mV
auxiliary_voltage=1790 mV
auxiliary_voltage_max=1804 mV
bad_block_percentage=1.0760%
power_on_hours=9 hours 15 minutes
power_cycles=3
host_write_data=13801.32 GB
host_read_data=5172.11 GB
total_write_data=20001.57 GB
write_amplifier=1.4493
estimated_life_left=99%
实时监控举例
# /usr/bin/shannon-status -m -r 1 /dev/scta
Direct-IO PCIe SSD Card Monitor Program
Commands: q|Q g|G m|M main window
We are now monitoring disk 'scta' at PCI:41:00:0
Capacity: 1200.00 GB/1117.59 GiB, Block size: 4096, Overprovision: 27%
Power on hours
: 9 hours 20 minutes
Power cycles
Controller internal temp : 72 degC, max 77 degC
Controller board temp
: 53 degC, max 59 degC
NAND Flash temperature
: 53 degC, max 63 degC
Internal voltage
: 990 mV, max 1028 mV
Auxiliary voltage
: 1781 mV, max 1804 mV
Free block count
Host write data
: 13987.89 GB / 13027.24 GiB
Write Bandwith
: 733.258 MB/s / 699.289 MiB/s
Write IOPS
: 89.509 K
Avg write latency
: 0.013 ms
Host read data
: 5172.11 GB / 4816.90 GiB
Read Bandwith
: 0.000 MB/s / 0.000 MiB/s
Avg read latency
: 0.000 ms
Total write data
: 20188.16 GB / 18801.69 GiB
Total write Bandwith
: 733.258 MB/s / 699.289 MiB/s
Write amplifier
: life 1.443, transient 1.000
Buffer write percentage
使用lspci查看对应的pci接口信息.
# lspci -vvvv -s 41:00.0
<font size="2" color="#ff.0 Mass storage controller: Device 1cb0:0275
Subsystem: Device 1cb0:0275
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast &TAbort- &TAbort- &MAbort- &SERR- &PERR- INTx-
Latency: 0, Cache Line Size: 64 bytes
Interrupt: pin A routed to IRQ 150
Region 0: Memory at d40fc000 (32-bit, non-prefetchable) [size=16K]
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1+,D2+,D3hot+,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [48] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: fee000d8
Data: 0000
Capabilities: [60] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s &64ns, L1 unlimited
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
DevCtl: Report errors: Correctable- Non-Fatal+ Fatal+ Unsupported+
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
MaxPayload 256 bytes, MaxReadReq 4096 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 5GT/s, Width x8, ASPM L0s, Latency L0 unlimited, L1 unlimited
ClockPM- Surprise- LLActRep- BwNot-
LnkCtl: ASPM D RCB 64 bytes Disabled- Retrain- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range B, TimeoutDis-, LTR-, OBFF Not Supported
DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis-, LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
Capabilities: [9c] MSI-X: Enable- Count=1 Masked-
Vector table: BAR=0 offset=
PBA: BAR=0 offset=
Capabilities: [fc] #00 [0000]
Capabilities: [100 v1] Device Serial Number 00-00-00-01-01-00-0a-35
Kernel driver in use: shannon
Kernel modules: shannon
重点来了, 接下来测2块东西, 一块是fsync接口的性能, 这个直接影响到数据库的checkpoint, xlog的flush等; 另一块是直接测试PostgreSQL数据库的读写性能, 测试模型后面给出.
1. 测试fsync接口的性能.
注意分区时物理块的对齐, 本文场景使用4K对齐;
据昨天宝存的工程师阐述宝存SSD盘不存在这个问题.
另外需要写超过50%后进行测试, 因为OCZ有这样的问题, 使用超过容量1半后, 性能会下降.
最后要注意的是因为CPU的差异, 可能导致IOPS未能达到测试瓶颈, 这样的话需要开多个进程进行测试, 直到iostat 块设备的 %util = 100
本文将开启2个进程同时测试, 测试的单次fsync数据块大小为8KB. (如果要测4K的块, 可以修改一下--with-wal-blocksize=4重新编译PostgreSQL, 目前支持的块大小为1,2,4,8,16,32,64), 或者直接使用dd obs指定数据块大小, oflag=sync,nonblock,noatime指定同步调用.
OCZ测试结果
# /data_ssd0/pgsql9.3.4/bin/pg_test_fsync -f /ssd/1
5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.
Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync
96 usecs/op
997 usecs/op
532.699 ops/sec
1877 usecs/op
fsync_writethrough
153 usecs/op
Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync
323 usecs/op
939.144 ops/sec
1065 usecs/op
481.663 ops/sec
2076 usecs/op
fsync_writethrough
296 usecs/op
Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB
in different write open_sync sizes.)
1 * 16kB open_sync write
225 usecs/op
8kB open_sync writes
320 usecs/op
4kB open_sync writes
521 usecs/op
2kB open_sync writes
978.837 ops/sec
1022 usecs/op
1kB open_sync writes
518.352 ops/sec
1929 usecs/op
Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
write, fsync, close
493.002 ops/sec
2028 usecs/op
write, close, fsync
522.625 ops/sec
1913 usecs/op
Non-Sync'ed 8kB writes:
5 usecs/op
# /data_ssd0/pgsql9.3.4/bin/pg_test_fsync -f /ssd/2
5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.
Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync
305 usecs/op
930 usecs/op
496.757 ops/sec
2013 usecs/op
fsync_writethrough
169 usecs/op
Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync
302 usecs/op
994 usecs/op
524.735 ops/sec
1906 usecs/op
fsync_writethrough
320 usecs/op
Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB
in different write open_sync sizes.)
1 * 16kB open_sync write
226 usecs/op
8kB open_sync writes
321 usecs/op
4kB open_sync writes
523 usecs/op
2kB open_sync writes
978.444 ops/sec
1022 usecs/op
1kB open_sync writes
551.965 ops/sec
1812 usecs/op
Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
write, fsync, close
553.674 ops/sec
1806 usecs/op
write, close, fsync
601.139 ops/sec
1664 usecs/op
Non-Sync'ed 8kB writes:
5 usecs/op
两个进程已经到达了OCZ的fsync瓶颈.
宝存测试结果
# /opt/pgsql9.3.4/bin/pg_test_fsync -f /ssd/1
5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.
Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync
27 usecs/op
26 usecs/op
30 usecs/op
fsync_writethrough
21 usecs/op
Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync
46 usecs/op
47 usecs/op
58 usecs/op
fsync_writethrough
43 usecs/op
Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB
in different write open_sync sizes.)
1 * 16kB open_sync write
44 usecs/op
8kB open_sync writes
44 usecs/op
4kB open_sync writes
74 usecs/op
2kB open_sync writes
432.588 ops/sec
2312 usecs/op
1kB open_sync writes
269.104 ops/sec
3716 usecs/op
Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
write, fsync, close
64 usecs/op
write, close, fsync
64 usecs/op
Non-Sync'ed 8kB writes:
5 usecs/op
# /opt/pgsql9.3.4/bin/pg_test_fsync -f /ssd/2
5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.
Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync
27 usecs/op
26 usecs/op
30 usecs/op
fsync_writethrough
21 usecs/op
Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync
46 usecs/op
48 usecs/op
57 usecs/op
fsync_writethrough
43 usecs/op
Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB
in different write open_sync sizes.)
1 * 16kB open_sync write
44 usecs/op
8kB open_sync writes
44 usecs/op
4kB open_sync writes
73 usecs/op
2kB open_sync writes
462.776 ops/sec
2161 usecs/op
1kB open_sync writes
260.950 ops/sec
3832 usecs/op
Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
write, fsync, close
65 usecs/op
write, close, fsync
64 usecs/op
Non-Sync'ed 8kB writes:
5 usecs/op
看起来2个进程fsync对宝存毫无压力, 所以开了3个进程, 以下是测试结果
# /opt/pgsql9.3.4/bin/pg_test_fsync -f /ssd/1
5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.
Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync
33 usecs/op
37 usecs/op
52 usecs/op
fsync_writethrough
36 usecs/op
Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync
71 usecs/op
63 usecs/op
80 usecs/op
fsync_writethrough
74 usecs/op
Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB
in different write open_sync sizes.)
1 * 16kB open_sync write
53 usecs/op
8kB open_sync writes
72 usecs/op
4kB open_sync writes
115 usecs/op
2kB open_sync writes
347.167 ops/sec
2880 usecs/op
1kB open_sync writes
269.654 ops/sec
3708 usecs/op
Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
write, fsync, close
73 usecs/op
write, close, fsync
76 usecs/op
Non-Sync'ed 8kB writes:
6 usecs/op
# /opt/pgsql9.3.4/bin/pg_test_fsync -f /ssd/2
5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.
Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync
34 usecs/op
37 usecs/op
51 usecs/op
fsync_writethrough
36 usecs/op
Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync
71 usecs/op
62 usecs/op
80 usecs/op
fsync_writethrough
73 usecs/op
Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB
in different write open_sync sizes.)
1 * 16kB open_sync write
54 usecs/op
8kB open_sync writes
73 usecs/op
4kB open_sync writes
117 usecs/op
2kB open_sync writes
336.883 ops/sec
2968 usecs/op
1kB open_sync writes
314.388 ops/sec
3181 usecs/op
Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
write, fsync, close
73 usecs/op
write, close, fsync
78 usecs/op
Non-Sync'ed 8kB writes:
7 usecs/op
# /opt/pgsql9.3.4/bin/pg_test_fsync -f /ssd/3
5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.
Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync
33 usecs/op
38 usecs/op
51 usecs/op
fsync_writethrough
37 usecs/op
Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync
71 usecs/op
63 usecs/op
79 usecs/op
fsync_writethrough
73 usecs/op
Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB
in different write open_sync sizes.)
1 * 16kB open_sync write
53 usecs/op
8kB open_sync writes
75 usecs/op
4kB open_sync writes
117 usecs/op
2kB open_sync writes
349.729 ops/sec
2859 usecs/op
1kB open_sync writes
311.943 ops/sec
3206 usecs/op
Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
write, fsync, close
72 usecs/op
write, close, fsync
78 usecs/op
Non-Sync'ed 8kB writes:
7 usecs/op
将所有进程的测试结果汇总后的one 8KB write结果对比, 宝存1.2 TB 远超OCZ RevoDrive3 X2 960G
open_datasync
fsync_writethrough
open_datasync
fsync_writethrough
2. 测试数据库的tps性能(对SSD性能本身没有太大的参考价值, 因为瓶颈可能出现在CPU上)
数据库版本 9.3.4
数据库配置, 主要是wal fsync method, 从测试结果来看, 我这里选择了两款SSD产品都最好的宝存open_datasync, OCZ open_sync(不支持open_datasync).
grep "^[a-z]" postgresql.conf
listen_addresses = '0.0.0.0'
# what IP address(es)
port = 5432
# (change requires restart)
max_connections = 100
# (change requires restart)
unix_socket_directories = '.'
# comma-separated list of directories
shared_buffers = 2048MB
# min 128kB
maintenance_work_mem = 512MB
vacuum_cost_delay = 10
# 0-100 milliseconds
vacuum_cost_limit = 10000
# 1-10000 credits
bgwriter_delay = 10ms
# 10-10000ms between rounds
wal_level = hot_standby
# minimal, archive, or hot_standby
synchronous_commit = on
wal_sync_method = open_datasync
# the default is the first option
wal_buffers = 16384kB
# min 32kB, -1 sets based on shared_buffers
checkpoint_segments = 128
# in logfile segments, min 1, 16MB each
effective_cache_size = 24000MB
log_destination = 'csvlog'
# Valid values are combinations of
logging_collector = on
# Enable capturing of stderr and csvlog
log_truncate_on_rotation = on
# If on, an existing log file with the
log_timezone = 'PRC'
autovacuum = on
# Enable autovacuum subprocess?
log_autovacuum_min_duration = 0 # -1 disables, 0 logs all actions and
datestyle = 'iso, mdy'
timezone = 'PRC'
lc_messages = 'C'
# locale for system error message
lc_monetary = 'C'
# locale for monetary formatting
lc_numeric = 'C'
# locale for number formatting
lc_time = 'C'
# locale for time formatting
default_text_search_config = 'pg_catalog.english'
postgres=# create table test (id int primary key, info text, crt_time timestamp);
CREATE TABLE
postgres=# create or replace function f(v_id int) returns void as
update test set info=md5(now()::text),crt_time=now() where id=v_
if not found then
insert into test values (v_id, md5(now()::text), now());
exception when others then
CREATE FUNCTION
$ vi test.sql
\setrandom vid 1
select f(:vid);
pgbench -M prepared -n -r -f ./test.sql -c 12 -j 4 -T 300
测试结果对比
pgbench -M prepared -n -r -f ./test.sql -c 12 -j 4 -T 300 -h /data_ssd1/test -p 5432 -U postgres postgres
transaction type: Custom query
scaling factor: 1
query mode: prepared
number of clients: 12
number of threads: 4
duration: 300 s
number of transactions actually processed: 6215372
(including connections establishing)
(excluding connections establishing)
statement latencies in milliseconds:
\setrandom vid 1
select f(:vid);
与异步提交效率有差距.
pgbench -M prepared -n -r -f ./test.sql -c 12 -j 4 -T 30 -h /data_ssd1/test -p 5432 -U postgres postgres
transaction type: Custom query
scaling factor: 1
query mode: prepared
number of clients: 12
number of threads: 4
duration: 30 s
number of transactions actually processed: 1284258
(including connections establishing)
(excluding connections establishing)
statement latencies in milliseconds:
\setrandom vid 1
select f(:vid);
$ /opt/pgsql9.3.4/bin/pgbench -M prepared -n -r -f ./test.sql -c 12 -j 4 -T 300 -h /mnt/pg_root -p 5432 -U postgres postgres
transaction type: Custom query
scaling factor: 1
query mode: prepared
number of clients: 12
number of threads: 4
duration: 300 s
number of transactions actually processed:
(including connections establishing)
(excluding connections establishing)
statement latencies in milliseconds:
\setrandom vid 1
select f(:vid);
与异步提交(synchronous_commit = off)的效率还是有较大差异.
$ /opt/pgsql9.3.4/bin/pgbench -M prepared -n -r -f ./test.sql -c 12 -j 4 -T 30 -h /mnt/pg_root -p 5432 -U postgres postgres
transaction type: Custom query
scaling factor: 1
query mode: prepared
number of clients: 12
number of threads: 4
duration: 30 s
number of transactions actually processed: 1740723
(including connections establishing)
(excluding connections establishing)
statement latencies in milliseconds:
\setrandom vid 1
select f(:vid);
从异步测试的数据来看, OCZ所在服务器因为CPU的性能较弱, 明显要低于宝存所在服务器的性能.
另外就是, 数据库的测试TPS结果, OCZ和宝存的差异并没有直接测试FSYNC接口差异那么大, 因为瓶颈不是在IO上面了, 已经出现在CPU了.
测试数据汇总 :
(注意测试机的CPU分别为2.0G和2.5G, 可以对OCZ的数据乘以1.25的系数. 有时间我会把OCZ的硬盘插到同一台服务器, 同样速率的PCI-E接口再测一遍. 现在的测试数据仅供参考. )
(同时需要注意数据库这块的测试, 因为不是单纯的IO测试, 还有大多数的开销在CPU上, 所以仅供参考)
(fsync接口测试单8K数据块的fsync)
关于这块宝存SSD的cell寿命, 因为才拿到一天, 还没有擦写够. 以下是当前的一个健康状态.
如果按照现在的强度使用(基本属于虐待), 根据厂家提供的寿命数据(容量*10000), 可以使用200天左右报废.
Direct-IO drive scta at PCI Address:41:00:0:
Model:sh-shannon-pcie-ssd, SN: b014
Device state: attached as disk /dev/dfa, Access mode: readwrite
Firmware version: 0c321351, Driver version: 2.6.9
Vendor:1cb0, Device:0275, Sub vendor:1cb0, Sub device:0275
Flash manufacturer: 98, Flash id: 3a
Channels: 7, Lunsets in channel: 8, Luns in lunset: 2, Available luns: 112
Eblocks in lun: 2116, Pages in eblock: 256, Nand page size: 32768
Logical sector: 512, Physical sector: 4096
User capacity: 1200.00 GB/1117.59 GiB
Physical capacity: 1632.37 GB/1520.26 GiB
Overprovision: 27%, warn at 10%
Error correction: 35 bits per 880 bytes codeword
Controller internal temperature: 70 degC, max 77 degC
Controller board temperature: 52 degC, max 59 degC
NAND Flash temperature: 52 degC, max 63 degC
Internal voltage: 999 mV, max 1028 mV
Auxiliary voltage: 1787 mV, max 1807 mV
Media status: 1.0760% bad block
Power on hours: 21 hours 9 minutes, Power cycles: 3
Lifetime data volumes:
Host write data
: 41516.25 GB / 38665.02 GiB
Host read data
: 5172.11 GB / 4816.90 GiB
Total write data
: 48235.25 GB / 44922.58 GiB
Write amplifier
Estimated life left: 99% left
最后简单介绍一下SSD一般的应用场景.
1. 作为文件系统二级缓存, 例如ZFS的 L2ARC(越大越好,但是它只存非脏数据), SLOG(slog一般10GB以内就够了, 建议mirror).
2. 类似文件系统二级缓存, FB的flashcache.
3. 作为数据库统计信息文件夹(stats_temp_directory).
4. 作为数据库活跃数据表空间.
5. 操作系统交换分区.
应用注意事项
1. 分区物理访问单元对齐.
2. $SRC/contrib/pg_test_fsync/pg_test_fsync.c
#define XLOG_BLCKSZ_K
(XLOG_BLCKSZ / 1024)

我要回帖

更多关于 pcie4.0主板 的文章

 

随机推荐