tcmu-design.txt 13 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378
  1. Contents:
  2. 1) TCM Userspace Design
  3. a) Background
  4. b) Benefits
  5. c) Design constraints
  6. d) Implementation overview
  7. i. Mailbox
  8. ii. Command ring
  9. iii. Data Area
  10. e) Device discovery
  11. f) Device events
  12. g) Other contingencies
  13. 2) Writing a user pass-through handler
  14. a) Discovering and configuring TCMU uio devices
  15. b) Waiting for events on the device(s)
  16. c) Managing the command ring
  17. 3) Command filtering and pass_level
  18. 4) A final note
  19. TCM Userspace Design
  20. --------------------
  21. TCM is another name for LIO, an in-kernel iSCSI target (server).
  22. Existing TCM targets run in the kernel. TCMU (TCM in Userspace)
  23. allows userspace programs to be written which act as iSCSI targets.
  24. This document describes the design.
  25. The existing kernel provides modules for different SCSI transport
  26. protocols. TCM also modularizes the data storage. There are existing
  27. modules for file, block device, RAM or using another SCSI device as
  28. storage. These are called "backstores" or "storage engines". These
  29. built-in modules are implemented entirely as kernel code.
  30. Background:
  31. In addition to modularizing the transport protocol used for carrying
  32. SCSI commands ("fabrics"), the Linux kernel target, LIO, also modularizes
  33. the actual data storage as well. These are referred to as "backstores"
  34. or "storage engines". The target comes with backstores that allow a
  35. file, a block device, RAM, or another SCSI device to be used for the
  36. local storage needed for the exported SCSI LUN. Like the rest of LIO,
  37. these are implemented entirely as kernel code.
  38. These backstores cover the most common use cases, but not all. One new
  39. use case that other non-kernel target solutions, such as tgt, are able
  40. to support is using Gluster's GLFS or Ceph's RBD as a backstore. The
  41. target then serves as a translator, allowing initiators to store data
  42. in these non-traditional networked storage systems, while still only
  43. using standard protocols themselves.
  44. If the target is a userspace process, supporting these is easy. tgt,
  45. for example, needs only a small adapter module for each, because the
  46. modules just use the available userspace libraries for RBD and GLFS.
  47. Adding support for these backstores in LIO is considerably more
  48. difficult, because LIO is entirely kernel code. Instead of undertaking
  49. the significant work to port the GLFS or RBD APIs and protocols to the
  50. kernel, another approach is to create a userspace pass-through
  51. backstore for LIO, "TCMU".
  52. Benefits:
  53. In addition to allowing relatively easy support for RBD and GLFS, TCMU
  54. will also allow easier development of new backstores. TCMU combines
  55. with the LIO loopback fabric to become something similar to FUSE
  56. (Filesystem in Userspace), but at the SCSI layer instead of the
  57. filesystem layer. A SUSE, if you will.
  58. The disadvantage is there are more distinct components to configure, and
  59. potentially to malfunction. This is unavoidable, but hopefully not
  60. fatal if we're careful to keep things as simple as possible.
  61. Design constraints:
  62. - Good performance: high throughput, low latency
  63. - Cleanly handle if userspace:
  64. 1) never attaches
  65. 2) hangs
  66. 3) dies
  67. 4) misbehaves
  68. - Allow future flexibility in user & kernel implementations
  69. - Be reasonably memory-efficient
  70. - Simple to configure & run
  71. - Simple to write a userspace backend
  72. Implementation overview:
  73. The core of the TCMU interface is a memory region that is shared
  74. between kernel and userspace. Within this region is: a control area
  75. (mailbox); a lockless producer/consumer circular buffer for commands
  76. to be passed up, and status returned; and an in/out data buffer area.
  77. TCMU uses the pre-existing UIO subsystem. UIO allows device driver
  78. development in userspace, and this is conceptually very close to the
  79. TCMU use case, except instead of a physical device, TCMU implements a
  80. memory-mapped layout designed for SCSI commands. Using UIO also
  81. benefits TCMU by handling device introspection (e.g. a way for
  82. userspace to determine how large the shared region is) and signaling
  83. mechanisms in both directions.
  84. There are no embedded pointers in the memory region. Everything is
  85. expressed as an offset from the region's starting address. This allows
  86. the ring to still work if the user process dies and is restarted with
  87. the region mapped at a different virtual address.
  88. See target_core_user.h for the struct definitions.
  89. The Mailbox:
  90. The mailbox is always at the start of the shared memory region, and
  91. contains a version, details about the starting offset and size of the
  92. command ring, and head and tail pointers to be used by the kernel and
  93. userspace (respectively) to put commands on the ring, and indicate
  94. when the commands are completed.
  95. version - 1 (userspace should abort if otherwise)
  96. flags - none yet defined.
  97. cmdr_off - The offset of the start of the command ring from the start
  98. of the memory region, to account for the mailbox size.
  99. cmdr_size - The size of the command ring. This does *not* need to be a
  100. power of two.
  101. cmd_head - Modified by the kernel to indicate when a command has been
  102. placed on the ring.
  103. cmd_tail - Modified by userspace to indicate when it has completed
  104. processing of a command.
  105. The Command Ring:
  106. Commands are placed on the ring by the kernel incrementing
  107. mailbox.cmd_head by the size of the command, modulo cmdr_size, and
  108. then signaling userspace via uio_event_notify(). Once the command is
  109. completed, userspace updates mailbox.cmd_tail in the same way and
  110. signals the kernel via a 4-byte write(). When cmd_head equals
  111. cmd_tail, the ring is empty -- no commands are currently waiting to be
  112. processed by userspace.
  113. TCMU commands start with a common header containing "len_op", a 32-bit
  114. value that stores the length, as well as the opcode in the lowest
  115. unused bits. Currently only two opcodes are defined, TCMU_OP_PAD and
  116. TCMU_OP_CMD. When userspace encounters a command with PAD opcode, it
  117. should skip ahead by the bytes in "length". (The kernel inserts PAD
  118. entries to ensure each CMD entry fits contigously into the circular
  119. buffer.)
  120. When userspace handles a CMD, it finds the SCSI CDB (Command Data
  121. Block) via tcmu_cmd_entry.req.cdb_off. This is an offset from the
  122. start of the overall shared memory region, not the entry. The data
  123. in/out buffers are accessible via tht req.iov[] array. Note that
  124. each iov.iov_base is also an offset from the start of the region.
  125. TCMU currently does not support BIDI operations.
  126. When completing a command, userspace sets rsp.scsi_status, and
  127. rsp.sense_buffer if necessary. Userspace then increments
  128. mailbox.cmd_tail by entry.hdr.length (mod cmdr_size) and signals the
  129. kernel via the UIO method, a 4-byte write to the file descriptor.
  130. The Data Area:
  131. This is shared-memory space after the command ring. The organization
  132. of this area is not defined in the TCMU interface, and userspace
  133. should access only the parts referenced by pending iovs.
  134. Device Discovery:
  135. Other devices may be using UIO besides TCMU. Unrelated user processes
  136. may also be handling different sets of TCMU devices. TCMU userspace
  137. processes must find their devices by scanning sysfs
  138. class/uio/uio*/name. For TCMU devices, these names will be of the
  139. format:
  140. tcm-user/<hba_num>/<device_name>/<subtype>/<path>
  141. where "tcm-user" is common for all TCMU-backed UIO devices. <hba_num>
  142. and <device_name> allow userspace to find the device's path in the
  143. kernel target's configfs tree. Assuming the usual mount point, it is
  144. found at:
  145. /sys/kernel/config/target/core/user_<hba_num>/<device_name>
  146. This location contains attributes such as "hw_block_size", that
  147. userspace needs to know for correct operation.
  148. <subtype> will be a userspace-process-unique string to identify the
  149. TCMU device as expecting to be backed by a certain handler, and <path>
  150. will be an additional handler-specific string for the user process to
  151. configure the device, if needed. The name cannot contain ':', due to
  152. LIO limitations.
  153. For all devices so discovered, the user handler opens /dev/uioX and
  154. calls mmap():
  155. mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0)
  156. where size must be equal to the value read from
  157. /sys/class/uio/uioX/maps/map0/size.
  158. Device Events:
  159. If a new device is added or removed, a notification will be broadcast
  160. over netlink, using a generic netlink family name of "TCM-USER" and a
  161. multicast group named "config". This will include the UIO name as
  162. described in the previous section, as well as the UIO minor
  163. number. This should allow userspace to identify both the UIO device and
  164. the LIO device, so that after determining the device is supported
  165. (based on subtype) it can take the appropriate action.
  166. Other contingencies:
  167. Userspace handler process never attaches:
  168. - TCMU will post commands, and then abort them after a timeout period
  169. (30 seconds.)
  170. Userspace handler process is killed:
  171. - It is still possible to restart and re-connect to TCMU
  172. devices. Command ring is preserved. However, after the timeout period,
  173. the kernel will abort pending tasks.
  174. Userspace handler process hangs:
  175. - The kernel will abort pending tasks after a timeout period.
  176. Userspace handler process is malicious:
  177. - The process can trivially break the handling of devices it controls,
  178. but should not be able to access kernel memory outside its shared
  179. memory areas.
  180. Writing a user pass-through handler (with example code)
  181. -------------------------------------------------------
  182. A user process handing a TCMU device must support the following:
  183. a) Discovering and configuring TCMU uio devices
  184. b) Waiting for events on the device(s)
  185. c) Managing the command ring: Parsing operations and commands,
  186. performing work as needed, setting response fields (scsi_status and
  187. possibly sense_buffer), updating cmd_tail, and notifying the kernel
  188. that work has been finished
  189. First, consider instead writing a plugin for tcmu-runner. tcmu-runner
  190. implements all of this, and provides a higher-level API for plugin
  191. authors.
  192. TCMU is designed so that multiple unrelated processes can manage TCMU
  193. devices separately. All handlers should make sure to only open their
  194. devices, based opon a known subtype string.
  195. a) Discovering and configuring TCMU UIO devices:
  196. (error checking omitted for brevity)
  197. int fd, dev_fd;
  198. char buf[256];
  199. unsigned long long map_len;
  200. void *map;
  201. fd = open("/sys/class/uio/uio0/name", O_RDONLY);
  202. ret = read(fd, buf, sizeof(buf));
  203. close(fd);
  204. buf[ret-1] = '\0'; /* null-terminate and chop off the \n */
  205. /* we only want uio devices whose name is a format we expect */
  206. if (strncmp(buf, "tcm-user", 8))
  207. exit(-1);
  208. /* Further checking for subtype also needed here */
  209. fd = open(/sys/class/uio/%s/maps/map0/size, O_RDONLY);
  210. ret = read(fd, buf, sizeof(buf));
  211. close(fd);
  212. str_buf[ret-1] = '\0'; /* null-terminate and chop off the \n */
  213. map_len = strtoull(buf, NULL, 0);
  214. dev_fd = open("/dev/uio0", O_RDWR);
  215. map = mmap(NULL, map_len, PROT_READ|PROT_WRITE, MAP_SHARED, dev_fd, 0);
  216. b) Waiting for events on the device(s)
  217. while (1) {
  218. char buf[4];
  219. int ret = read(dev_fd, buf, 4); /* will block */
  220. handle_device_events(dev_fd, map);
  221. }
  222. c) Managing the command ring
  223. #include <linux/target_core_user.h>
  224. int handle_device_events(int fd, void *map)
  225. {
  226. struct tcmu_mailbox *mb = map;
  227. struct tcmu_cmd_entry *ent = (void *) mb + mb->cmdr_off + mb->cmd_tail;
  228. int did_some_work = 0;
  229. /* Process events from cmd ring until we catch up with cmd_head */
  230. while (ent != (void *)mb + mb->cmdr_off + mb->cmd_head) {
  231. if (tcmu_hdr_get_op(&ent->hdr) == TCMU_OP_CMD) {
  232. uint8_t *cdb = (void *)mb + ent->req.cdb_off;
  233. bool success = true;
  234. /* Handle command here. */
  235. printf("SCSI opcode: 0x%x\n", cdb[0]);
  236. /* Set response fields */
  237. if (success)
  238. ent->rsp.scsi_status = SCSI_NO_SENSE;
  239. else {
  240. /* Also fill in rsp->sense_buffer here */
  241. ent->rsp.scsi_status = SCSI_CHECK_CONDITION;
  242. }
  243. }
  244. else {
  245. /* Do nothing for PAD entries */
  246. }
  247. /* update cmd_tail */
  248. mb->cmd_tail = (mb->cmd_tail + tcmu_hdr_get_len(&ent->hdr)) % mb->cmdr_size;
  249. ent = (void *) mb + mb->cmdr_off + mb->cmd_tail;
  250. did_some_work = 1;
  251. }
  252. /* Notify the kernel that work has been finished */
  253. if (did_some_work) {
  254. uint32_t buf = 0;
  255. write(fd, &buf, 4);
  256. }
  257. return 0;
  258. }
  259. Command filtering and pass_level
  260. --------------------------------
  261. TCMU supports a "pass_level" option with valid values of 0 or 1. When
  262. the value is 0 (the default), nearly all SCSI commands received for
  263. the device are passed through to the handler. This allows maximum
  264. flexibility but increases the amount of code required by the handler,
  265. to support all mandatory SCSI commands. If pass_level is set to 1,
  266. then only IO-related commands are presented, and the rest are handled
  267. by LIO's in-kernel command emulation. The commands presented at level
  268. 1 include all versions of:
  269. READ
  270. WRITE
  271. WRITE_VERIFY
  272. XDWRITEREAD
  273. WRITE_SAME
  274. COMPARE_AND_WRITE
  275. SYNCHRONIZE_CACHE
  276. UNMAP
  277. A final note
  278. ------------
  279. Please be careful to return codes as defined by the SCSI
  280. specifications. These are different than some values defined in the
  281. scsi/scsi.h include file. For example, CHECK CONDITION's status code
  282. is 2, not 1.