tuxonice-internals.txt 22 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477
  1. TuxOnIce 3.0 Internal Documentation.
  2. Updated to 26 March 2009
  3. 1. Introduction.
  4. TuxOnIce 3.0 is an addition to the Linux Kernel, designed to
  5. allow the user to quickly shutdown and quickly boot a computer, without
  6. needing to close documents or programs. It is equivalent to the
  7. hibernate facility in some laptops. This implementation, however,
  8. requires no special BIOS or hardware support.
  9. The code in these files is based upon the original implementation
  10. prepared by Gabor Kuti and additional work by Pavel Machek and a
  11. host of others. This code has been substantially reworked by Nigel
  12. Cunningham, again with the help and testing of many others, not the
  13. least of whom is Michael Frank. At its heart, however, the operation is
  14. essentially the same as Gabor's version.
  15. 2. Overview of operation.
  16. The basic sequence of operations is as follows:
  17. a. Quiesce all other activity.
  18. b. Ensure enough memory and storage space are available, and attempt
  19. to free memory/storage if necessary.
  20. c. Allocate the required memory and storage space.
  21. d. Write the image.
  22. e. Power down.
  23. There are a number of complicating factors which mean that things are
  24. not as simple as the above would imply, however...
  25. o The activity of each process must be stopped at a point where it will
  26. not be holding locks necessary for saving the image, or unexpectedly
  27. restart operations due to something like a timeout and thereby make
  28. our image inconsistent.
  29. o It is desirous that we sync outstanding I/O to disk before calculating
  30. image statistics. This reduces corruption if one should suspend but
  31. then not resume, and also makes later parts of the operation safer (see
  32. below).
  33. o We need to get as close as we can to an atomic copy of the data.
  34. Inconsistencies in the image will result in inconsistent memory contents at
  35. resume time, and thus in instability of the system and/or file system
  36. corruption. This would appear to imply a maximum image size of one half of
  37. the amount of RAM, but we have a solution... (again, below).
  38. o In 2.6, we choose to play nicely with the other suspend-to-disk
  39. implementations.
  40. 3. Detailed description of internals.
  41. a. Quiescing activity.
  42. Safely quiescing the system is achieved using three separate but related
  43. aspects.
  44. First, we note that the vast majority of processes don't need to run during
  45. suspend. They can be 'frozen'. We therefore implement a refrigerator
  46. routine, which processes enter and in which they remain until the cycle is
  47. complete. Processes enter the refrigerator via try_to_freeze() invocations
  48. at appropriate places. A process cannot be frozen in any old place. It
  49. must not be holding locks that will be needed for writing the image or
  50. freezing other processes. For this reason, userspace processes generally
  51. enter the refrigerator via the signal handling code, and kernel threads at
  52. the place in their event loops where they drop locks and yield to other
  53. processes or sleep.
  54. The task of freezing processes is complicated by the fact that there can be
  55. interdependencies between processes. Freezing process A before process B may
  56. mean that process B cannot be frozen, because it stops at waiting for
  57. process A rather than in the refrigerator. This issue is seen where
  58. userspace waits on freezeable kernel threads or fuse filesystem threads. To
  59. address this issue, we implement the following algorithm for quiescing
  60. activity:
  61. - Freeze filesystems (including fuse - userspace programs starting
  62. new requests are immediately frozen; programs already running
  63. requests complete their work before being frozen in the next
  64. step)
  65. - Freeze userspace
  66. - Thaw filesystems (this is safe now that userspace is frozen and no
  67. fuse requests are outstanding).
  68. - Invoke sys_sync (noop on fuse).
  69. - Freeze filesystems
  70. - Freeze kernel threads
  71. If we need to free memory, we thaw kernel threads and filesystems, but not
  72. userspace. We can then free caches without worrying about deadlocks due to
  73. swap files being on frozen filesystems or such like.
  74. b. Ensure enough memory & storage are available.
  75. We have a number of constraints to meet in order to be able to successfully
  76. suspend and resume.
  77. First, the image will be written in two parts, described below. One of these
  78. parts needs to have an atomic copy made, which of course implies a maximum
  79. size of one half of the amount of system memory. The other part ('pageset')
  80. is not atomically copied, and can therefore be as large or small as desired.
  81. Second, we have constraints on the amount of storage available. In these
  82. calculations, we may also consider any compression that will be done. The
  83. cryptoapi module allows the user to configure an expected compression ratio.
  84. Third, the user can specify an arbitrary limit on the image size, in
  85. megabytes. This limit is treated as a soft limit, so that we don't fail the
  86. attempt to suspend if we cannot meet this constraint.
  87. c. Allocate the required memory and storage space.
  88. Having done the initial freeze, we determine whether the above constraints
  89. are met, and seek to allocate the metadata for the image. If the constraints
  90. are not met, or we fail to allocate the required space for the metadata, we
  91. seek to free the amount of memory that we calculate is needed and try again.
  92. We allow up to four iterations of this loop before aborting the cycle. If we
  93. do fail, it should only be because of a bug in TuxOnIce's calculations.
  94. These steps are merged together in the prepare_image function, found in
  95. prepare_image.c. The functions are merged because of the cyclical nature
  96. of the problem of calculating how much memory and storage is needed. Since
  97. the data structures containing the information about the image must
  98. themselves take memory and use storage, the amount of memory and storage
  99. required changes as we prepare the image. Since the changes are not large,
  100. only one or two iterations will be required to achieve a solution.
  101. The recursive nature of the algorithm is miminised by keeping user space
  102. frozen while preparing the image, and by the fact that our records of which
  103. pages are to be saved and which pageset they are saved in use bitmaps (so
  104. that changes in number or fragmentation of the pages to be saved don't
  105. feedback via changes in the amount of memory needed for metadata). The
  106. recursiveness is thus limited to any extra slab pages allocated to store the
  107. extents that record storage used, and the effects of seeking to free memory.
  108. d. Write the image.
  109. We previously mentioned the need to create an atomic copy of the data, and
  110. the half-of-memory limitation that is implied in this. This limitation is
  111. circumvented by dividing the memory to be saved into two parts, called
  112. pagesets.
  113. Pageset2 contains most of the page cache - the pages on the active and
  114. inactive LRU lists that aren't needed or modified while TuxOnIce is
  115. running, so they can be safely written without an atomic copy. They are
  116. therefore saved first and reloaded last. While saving these pages,
  117. TuxOnIce carefully ensures that the work of writing the pages doesn't make
  118. the image inconsistent. With the support for Kernel (Video) Mode Setting
  119. going into the kernel at the time of writing, we need to check for pages
  120. on the LRU that are used by KMS, and exclude them from pageset2. They are
  121. atomically copied as part of pageset 1.
  122. Once pageset2 has been saved, we prepare to do the atomic copy of remaining
  123. memory. As part of the preparation, we power down drivers, thereby providing
  124. them with the opportunity to have their state recorded in the image. The
  125. amount of memory allocated by drivers for this is usually negligible, but if
  126. DRI is in use, video drivers may require significants amounts. Ideally we
  127. would be able to query drivers while preparing the image as to the amount of
  128. memory they will need. Unfortunately no such mechanism exists at the time of
  129. writing. For this reason, TuxOnIce allows the user to set an
  130. 'extra_pages_allowance', which is used to seek to ensure sufficient memory
  131. is available for drivers at this point. TuxOnIce also lets the user set this
  132. value to 0. In this case, a test driver suspend is done while preparing the
  133. image, and the difference (plus a margin) used instead. TuxOnIce will also
  134. automatically restart the hibernation process (twice at most) if it finds
  135. that the extra pages allowance is not sufficient. It will then use what was
  136. actually needed (plus a margin, again). Failure to hibernate should thus
  137. be an extremely rare occurrence.
  138. Having suspended the drivers, we save the CPU context before making an
  139. atomic copy of pageset1, resuming the drivers and saving the atomic copy.
  140. After saving the two pagesets, we just need to save our metadata before
  141. powering down.
  142. As we mentioned earlier, the contents of pageset2 pages aren't needed once
  143. they've been saved. We therefore use them as the destination of our atomic
  144. copy. In the unlikely event that pageset1 is larger, extra pages are
  145. allocated while the image is being prepared. This is normally only a real
  146. possibility when the system has just been booted and the page cache is
  147. small.
  148. This is where we need to be careful about syncing, however. Pageset2 will
  149. probably contain filesystem meta data. If this is overwritten with pageset1
  150. and then a sync occurs, the filesystem will be corrupted - at least until
  151. resume time and another sync of the restored data. Since there is a
  152. possibility that the user might not resume or (may it never be!) that
  153. TuxOnIce might oops, we do our utmost to avoid syncing filesystems after
  154. copying pageset1.
  155. e. Power down.
  156. Powering down uses standard kernel routines. TuxOnIce supports powering down
  157. using the ACPI S3, S4 and S5 methods or the kernel's non-ACPI power-off.
  158. Supporting suspend to ram (S3) as a power off option might sound strange,
  159. but it allows the user to quickly get their system up and running again if
  160. the battery doesn't run out (we just need to re-read the overwritten pages)
  161. and if the battery does run out (or the user removes power), they can still
  162. resume.
  163. 4. Data Structures.
  164. TuxOnIce uses three main structures to store its metadata and configuration
  165. information:
  166. a) Pageflags bitmaps.
  167. TuxOnIce records which pages will be in pageset1, pageset2, the destination
  168. of the atomic copy and the source of the atomically restored image using
  169. bitmaps. The code used is that written for swsusp, with small improvements
  170. to match TuxOnIce's requirements.
  171. The pageset1 bitmap is thus easily stored in the image header for use at
  172. resume time.
  173. As mentioned above, using bitmaps also means that the amount of memory and
  174. storage required for recording the above information is constant. This
  175. greatly simplifies the work of preparing the image. In earlier versions of
  176. TuxOnIce, extents were used to record which pages would be stored. In that
  177. case, however, eating memory could result in greater fragmentation of the
  178. lists of pages, which in turn required more memory to store the extents and
  179. more storage in the image header. These could in turn require further
  180. freeing of memory, and another iteration. All of this complexity is removed
  181. by having bitmaps.
  182. Bitmaps also make a lot of sense because TuxOnIce only ever iterates
  183. through the lists. There is therefore no cost to not being able to find the
  184. nth page in order 0 time. We only need to worry about the cost of finding
  185. the n+1th page, given the location of the nth page. Bitwise optimisations
  186. help here.
  187. b) Extents for block data.
  188. TuxOnIce supports writing the image to multiple block devices. In the case
  189. of swap, multiple partitions and/or files may be in use, and we happily use
  190. them all (with the exception of compcache pages, which we allocate but do
  191. not use). This use of multiple block devices is accomplished as follows:
  192. Whatever the actual source of the allocated storage, the destination of the
  193. image can be viewed in terms of one or more block devices, and on each
  194. device, a list of sectors. To simplify matters, we only use contiguous,
  195. PAGE_SIZE aligned sectors, like the swap code does.
  196. Since sector numbers on each bdev may well not start at 0, it makes much
  197. more sense to use extents here. Contiguous ranges of pages can thus be
  198. represented in the extents by contiguous values.
  199. Variations in block size are taken account of in transforming this data
  200. into the parameters for bio submission.
  201. We can thus implement a layer of abstraction wherein the core of TuxOnIce
  202. doesn't have to worry about which device we're currently writing to or
  203. where in the device we are. It simply requests that the next page in the
  204. pageset or header be written, leaving the details to this lower layer.
  205. The lower layer remembers where in the sequence of devices and blocks each
  206. pageset starts. The header always starts at the beginning of the allocated
  207. storage.
  208. So extents are:
  209. struct extent {
  210. unsigned long minimum, maximum;
  211. struct extent *next;
  212. }
  213. These are combined into chains of extents for a device:
  214. struct extent_chain {
  215. int size; /* size of the extent ie sum (max-min+1) */
  216. int allocs, frees;
  217. char *name;
  218. struct extent *first, *last_touched;
  219. };
  220. For each bdev, we need to store a little more info:
  221. struct suspend_bdev_info {
  222. struct block_device *bdev;
  223. dev_t dev_t;
  224. int bmap_shift;
  225. int blocks_per_page;
  226. };
  227. The dev_t is used to identify the device in the stored image. As a result,
  228. we expect devices at resume time to have the same major and minor numbers
  229. as they had while suspending. This is primarily a concern where the user
  230. utilises LVM for storage, as they will need to dmsetup their partitions in
  231. such a way as to maintain this consistency at resume time.
  232. bmap_shift and blocks_per_page apply the effects of variations in blocks
  233. per page settings for the filesystem and underlying bdev. For most
  234. filesystems, these are the same, but for xfs, they can have independent
  235. values.
  236. Combining these two structures together, we have everything we need to
  237. record what devices and what blocks on each device are being used to
  238. store the image, and to submit i/o using bio_submit.
  239. The last elements in the picture are a means of recording how the storage
  240. is being used.
  241. We do this first and foremost by implementing a layer of abstraction on
  242. top of the devices and extent chains which allows us to view however many
  243. devices there might be as one long storage tape, with a single 'head' that
  244. tracks a 'current position' on the tape:
  245. struct extent_iterate_state {
  246. struct extent_chain *chains;
  247. int num_chains;
  248. int current_chain;
  249. struct extent *current_extent;
  250. unsigned long current_offset;
  251. };
  252. That is, *chains points to an array of size num_chains of extent chains.
  253. For the filewriter, this is always a single chain. For the swapwriter, the
  254. array is of size MAX_SWAPFILES.
  255. current_chain, current_extent and current_offset thus point to the current
  256. index in the chains array (and into a matching array of struct
  257. suspend_bdev_info), the current extent in that chain (to optimise access),
  258. and the current value in the offset.
  259. The image is divided into three parts:
  260. - The header
  261. - Pageset 1
  262. - Pageset 2
  263. The header always starts at the first device and first block. We know its
  264. size before we begin to save the image because we carefully account for
  265. everything that will be stored in it.
  266. The second pageset (LRU) is stored first. It begins on the next page after
  267. the end of the header.
  268. The first pageset is stored second. It's start location is only known once
  269. pageset2 has been saved, since pageset2 may be compressed as it is written.
  270. This location is thus recorded at the end of saving pageset2. It is page
  271. aligned also.
  272. Since this information is needed at resume time, and the location of extents
  273. in memory will differ at resume time, this needs to be stored in a portable
  274. way:
  275. struct extent_iterate_saved_state {
  276. int chain_num;
  277. int extent_num;
  278. unsigned long offset;
  279. };
  280. We can thus implement a layer of abstraction wherein the core of TuxOnIce
  281. doesn't have to worry about which device we're currently writing to or
  282. where in the device we are. It simply requests that the next page in the
  283. pageset or header be written, leaving the details to this layer, and
  284. invokes the routines to remember and restore the position, without having
  285. to worry about the details of how the data is arranged on disk or such like.
  286. c) Modules
  287. One aim in designing TuxOnIce was to make it flexible. We wanted to allow
  288. for the implementation of different methods of transforming a page to be
  289. written to disk and different methods of getting the pages stored.
  290. In early versions (the betas and perhaps Suspend1), compression support was
  291. inlined in the image writing code, and the data structures and code for
  292. managing swap were intertwined with the rest of the code. A number of people
  293. had expressed interest in implementing image encryption, and alternative
  294. methods of storing the image.
  295. In order to achieve this, TuxOnIce was given a modular design.
  296. A module is a single file which encapsulates the functionality needed
  297. to transform a pageset of data (encryption or compression, for example),
  298. or to write the pageset to a device. The former type of module is called
  299. a 'page-transformer', the later a 'writer'.
  300. Modules are linked together in pipeline fashion. There may be zero or more
  301. page transformers in a pipeline, and there is always exactly one writer.
  302. The pipeline follows this pattern:
  303. ---------------------------------
  304. | TuxOnIce Core |
  305. ---------------------------------
  306. |
  307. |
  308. ---------------------------------
  309. | Page transformer 1 |
  310. ---------------------------------
  311. |
  312. |
  313. ---------------------------------
  314. | Page transformer 2 |
  315. ---------------------------------
  316. |
  317. |
  318. ---------------------------------
  319. | Writer |
  320. ---------------------------------
  321. During the writing of an image, the core code feeds pages one at a time
  322. to the first module. This module performs whatever transformations it
  323. implements on the incoming data, completely consuming the incoming data and
  324. feeding output in a similar manner to the next module.
  325. All routines are SMP safe, and the final result of the transformations is
  326. written with an index (provided by the core) and size of the output by the
  327. writer. As a result, we can have multithreaded I/O without needing to
  328. worry about the sequence in which pages are written (or read).
  329. During reading, the pipeline works in the reverse direction. The core code
  330. calls the first module with the address of a buffer which should be filled.
  331. (Note that the buffer size is always PAGE_SIZE at this time). This module
  332. will in turn request data from the next module and so on down until the
  333. writer is made to read from the stored image.
  334. Part of definition of the structure of a module thus looks like this:
  335. int (*rw_init) (int rw, int stream_number);
  336. int (*rw_cleanup) (int rw);
  337. int (*write_chunk) (struct page *buffer_page);
  338. int (*read_chunk) (struct page *buffer_page, int sync);
  339. It should be noted that the _cleanup routine may be called before the
  340. full stream of data has been read or written. While writing the image,
  341. the user may (depending upon settings) choose to abort suspending, and
  342. if we are in the midst of writing the last portion of the image, a portion
  343. of the second pageset may be reread. This may also happen if an error
  344. occurs and we seek to abort the process of writing the image.
  345. The modular design is also useful in a number of other ways. It provides
  346. a means where by we can add support for:
  347. - providing overall initialisation and cleanup routines;
  348. - serialising configuration information in the image header;
  349. - providing debugging information to the user;
  350. - determining memory and image storage requirements;
  351. - dis/enabling components at run-time;
  352. - configuring the module (see below);
  353. ...and routines for writers specific to their work:
  354. - Parsing a resume= location;
  355. - Determining whether an image exists;
  356. - Marking a resume as having been attempted;
  357. - Invalidating an image;
  358. Since some parts of the core - the user interface and storage manager
  359. support - have use for some of these functions, they are registered as
  360. 'miscellaneous' modules as well.
  361. d) Sysfs data structures.
  362. This brings us naturally to support for configuring TuxOnIce. We desired to
  363. provide a way to make TuxOnIce as flexible and configurable as possible.
  364. The user shouldn't have to reboot just because they want to now hibernate to
  365. a file instead of a partition, for example.
  366. To accomplish this, TuxOnIce implements a very generic means whereby the
  367. core and modules can register new sysfs entries. All TuxOnIce entries use
  368. a single _store and _show routine, both of which are found in
  369. tuxonice_sysfs.c in the kernel/power directory. These routines handle the
  370. most common operations - getting and setting the values of bits, integers,
  371. longs, unsigned longs and strings in one place, and allow overrides for
  372. customised get and set options as well as side-effect routines for all
  373. reads and writes.
  374. When combined with some simple macros, a new sysfs entry can then be defined
  375. in just a couple of lines:
  376. SYSFS_INT("progress_granularity", SYSFS_RW, &progress_granularity, 1,
  377. 2048, 0, NULL),
  378. This defines a sysfs entry named "progress_granularity" which is rw and
  379. allows the user to access an integer stored at &progress_granularity, giving
  380. it a value between 1 and 2048 inclusive.
  381. Sysfs entries are registered under /sys/power/tuxonice, and entries for
  382. modules are located in a subdirectory named after the module.