This is tar.info, produced by Makeinfo version 3.12f from tar.texi. START-INFO-DIR-ENTRY * tar: (tar). Making tape (or disk) archives. END-INFO-DIR-ENTRY This file documents GNU `tar', a utility used to store, backup, and transport files. Copyright (C) 1992, 1994, 1995, 1996, 1997, 1999 Free Software Foundation, Inc. Permission is granted to make and distribute verbatim copies of this manual provided the copyright notice and this permission notice are preserved on all copies. Permission is granted to copy and distribute modified versions of this manual under the conditions for verbatim copying, provided that the entire resulting derived work is distributed under the terms of a permission notice identical to this one. Permission is granted to copy and distribute translations of this manual into another language, under the above conditions for modified versions, except that this permission notice may be stated in a translation approved by the Foundation. This file documents GNU `tar', which is a utility used to store, backup, and transport files. `tar' is a tape (or disk) archiver. This manual documents the release 1.13. File: tar.info, Node: Date input formats, Next: Formats, Prev: Choosing, Up: Top Date input formats ****************** Our units of temporal measurement, from seconds on up to months, are so complicated, asymmetrical and disjunctive so as to make coherent mental reckoning in time all but impossible. Indeed, had some tyrannical god contrived to enslave our minds to time, to make it all but impossible for us to escape subjection to sodden routines and unpleasant surprises, he could hardly have done better than handing down our present system. It is like a set of trapezoidal building blocks, with no vertical or horizontal surfaces, like a language in which the simplest thought demands ornate constructions, useless particles and lengthy circumlocutions. Unlike the more successful patterns of language and science, which enable us to face experience boldly or at least level-headedly, our system of temporal calculation silently and persistently encourages our terror of time. ... It is as though architects had to measure length in feet, width in meters and height in ells; as though basic instruction manuals demanded a knowledge of five different languages. It is no wonder then that we often look into our own immediate past or future, last Tuesday or a week from Sunday, with feelings of helpless confusion. ... -- Robert Grudin, `Time and the Art of Living'. This section describes the textual date representations that GNU programs accept. These are the strings you, as a user, can supply as arguments to the various programs. The C interface (via the `getdate' function) is not described here. Although the date syntax here can represent any possible time since zero A.D., computer integers are not big enough for such a (comparatively) long time. The earliest date semantically allowed on Unix systems is midnight, 1 January 1970 UCT. * Menu: * General date syntax:: Common rules. * Calendar date item:: 19 Dec 1994. * Time of day item:: 9:20pm. * Timezone item:: EST, DST, BST, UCT, AHST, ... * Day of week item:: Monday and others. * Relative item in date strings:: next tuesday, 2 years ago. * Pure numbers in date strings:: 19931219, 1440. * Authors of getdate:: Bellovin, Salz, Berets, et al. File: tar.info, Node: General date syntax, Next: Calendar date item, Prev: Date input formats, Up: Date input formats General date syntax =================== A "date" is a string, possibly empty, containing many items separated by whitespace. The whitespace may be omitted when no ambiguity arises. The empty string means the beginning of today (i.e., midnight). Order of the items is immaterial. A date string may contain many flavors of items: * calendar date items * time of the day items * time zone items * day of the week items * relative items * pure numbers. We describe each of these item types in turn, below. A few numbers may be written out in words in most contexts. This is most useful for specifying day of the week items or relative items (see below). Here is the list: `first' for 1, `next' for 2, `third' for 3, `fourth' for 4, `fifth' for 5, `sixth' for 6, `seventh' for 7, `eighth' for 8, `ninth' for 9, `tenth' for 10, `eleventh' for 11 and `twelfth' for 12. Also, `last' means exactly -1. When a month is written this way, it is still considered to be written numerically, instead of being "spelled in full"; this changes the allowed strings. Alphabetic case is completely ignored in dates. Comments may be introduced between round parentheses, as long as included parentheses are properly nested. Hyphens not followed by a digit are currently ignored. Leading zeros on numbers are ignored. File: tar.info, Node: Calendar date item, Next: Time of day item, Prev: General date syntax, Up: Date input formats Calendar date item ================== A "calendar date item" specifies a day of the year. It is specified differently, depending on whether the month is specified numerically or literally. All these strings specify the same calendar date: 1970-09-17 # ISO 8601. 70-9-17 # This century assumed by default. 70-09-17 # Leading zeros are ignored. 9/17/72 # Common U.S. writing. 24 September 1972 24 Sept 72 # September has a special abbreviation. 24 Sep 72 # Three-letter abbreviations always allowed. Sep 24, 1972 24-sep-72 24sep72 The year can also be omitted. In this case, the last specified year is used, or the current year if none. For example: 9/17 sep 17 Here are the rules. For numeric months, the ISO 8601 format `YEAR-MONTH-DAY' is allowed, where YEAR is any positive number, MONTH is a number between 01 and 12, and DAY is a number between 01 and 31. A leading zero must be present if a number is less than ten. If YEAR is less than 100, then 1900 is added to it to force a date in this century. The construct `MONTH/DAY/YEAR', popular in the United States, is accepted. Also `MONTH/DAY', omitting the year. Literal months may be spelled out in full: `January', `February', `March', `April', `May', `June', `July', `August', `September', `October', `November' or `December'. Literal months may be abbreviated to their first three letters, possibly followed by an abbreviating dot. It is also permitted to write `Sept' instead of `September'. When months are written literally, the calendar date may be given as any of the following: DAY MONTH YEAR DAY MONTH MONTH DAY YEAR DAY-MONTH-YEAR Or, omitting the year: MONTH DAY File: tar.info, Node: Time of day item, Next: Timezone item, Prev: Calendar date item, Up: Date input formats Time of day item ================ A "time of day item" in date strings specifies the time on a given day. Here are some examples, all of which represent the same time: 20:02:0 20:02 8:02pm 20:02-0500 # In EST (Eastern U.S. Standard Time). More generally, the time of the day may be given as `HOUR:MINUTE:SECOND', where HOUR is a number between 0 and 23, MINUTE is a number between 0 and 59, and SECOND is a number between 0 and 59. Alternatively, `:SECOND' can be omitted, in which case it is taken to be zero. If the time is followed by `am' or `pm' (or `a.m.' or `p.m.'), HOUR is restricted to run from 1 to 12, and `:MINUTE' may be omitted (taken to be zero). `am' indicates the first half of the day, `pm' indicates the second half of the day. In this notation, 12 is the predecessor of 1: midnight is `12am' while noon is `12pm'. (This is the zero-oriented interpretation of `12am' and `12pm', as opposed to the old tradition derived from Latin which uses `12m' for noon and `12pm' for midnight.) The time may alternatively be followed by a timezone correction, expressed as `SHHMM', where S is `+' or `-', HH is a number of zone hours and MM is a number of zone minutes. When a timezone correction is given this way, it forces interpretation of the time in UTC, overriding any previous specification for the timezone or the local timezone. The MINUTE part of the time of the day may not be elided when a timezone correction is used. This is the only way to specify a timezone correction by fractional parts of an hour. Either `am'/`pm' or a timezone correction may be specified, but not both. File: tar.info, Node: Timezone item, Next: Day of week item, Prev: Time of day item, Up: Date input formats Timezone item ============= A "timezone item" specifies an international timezone, indicated by a small set of letters. Any included period is ignored. Military timezone designations use a single letter. Currently, only integral zone hours may be represented in a timezone item. See the previous section for a finer control over the timezone correction. Here are many non-daylight-savings-time timezones, indexed by the zone hour value. +000 `GMT' for Greenwich Mean, `UT' or `UTC' for Universal (Coordinated), `WET' for Western European and `Z' for militaries. +100 `WAT' for West Africa and `A' for militaries. +200 `AT' for Azores and `B' for militaries. +300 `C' for militaries. +400 `AST' for Atlantic Standard and `D' for militaries. +500 `E' for militaries and `EST' for Eastern Standard. +600 `CST' for Central Standard and `F' for militaries. +700 `G' for militaries and `MST' for Mountain Standard. +800 `H' for militaries and `PST' for Pacific Standard. +900 `I' for militaries and `YST' for Yukon Standard. +1000 `AHST' for Alaska-Hawaii Standard, `CAT' for Central Alaska, `HST' for Hawaii Standard and `K' for militaries. +1100 `L' for militaries and `NT' for Nome. +1200 `IDLW' for International Date Line West and `M' for militaries. -100 `CET' for Central European, `FWT' for French Winter, `MET' for Middle European, `MEWT' for Middle European Winter, `N' for militaries and `SWT' for Swedish Winter. -200 `EET' for Eastern European, USSR Zone 1 and `O' for militaries. -300 `BT' for Baghdad, USSR Zone 2 and `P' for militaries. -400 `Q' for militaries and `ZP4' for USSR Zone 3. -500 `R' for militaries and `ZP5' for USSR Zone 4. -600 `S' for militaries and `ZP6' for USSR Zone 5. -700 `T' for militaries and `WAST' for West Australian Standard. -800 `CCT' for China Coast, USSR Zone 7 and `U' for militaries. -900 `JST' for Japan Standard, USSR Zone 8 and `V' for militaries. -1000 `EAST' for East Australian Standard, `GST' for Guam Standard, USSR Zone 9 and `W' for militaries. -1100 `X' for militaries. -1200 `IDLE' for International Date Line East, `NZST' for New Zealand Standard, `NZT' for New Zealand and `Y' for militaries. Here are many DST timezones, indexed by the zone hour value. Also, by following a non-DST timezone by the string `DST' in a separate word (that is, separated by some whitespace), the corresponding DST timezone may be specified. 0 `BST' for British Summer. +400 `ADT' for Atlantic Daylight. +500 `EDT' for Eastern Daylight. +600 `CDT' for Central Daylight. +700 `MDT' for Mountain Daylight. +800 `PDT' for Pacific Daylight. +900 `YDT' for Yukon Daylight. +1000 `HDT' for Hawaii Daylight. -100 `MEST' for Middle European Summer, `MESZ' for Middle European Summer, `SST' for Swedish Summer and `FST' for French Summer. -700 `WADT' for West Australian Daylight. -1000 `EADT' for Eastern Australian Daylight. -1200 `NZDT' for New Zealand Daylight. File: tar.info, Node: Day of week item, Next: Relative item in date strings, Prev: Timezone item, Up: Date input formats Day of week item ================ The explicit mention of a day of the week will forward the date (only if necessary) to reach that day of the week in the future. Days of the week may be spelled out in full: `Sunday', `Monday', `Tuesday', `Wednesday', `Thursday', `Friday' or `Saturday'. Days may be abbreviated to their first three letters, optionally followed by a period. The special abbreviations `Tues' for `Tuesday', `Wednes' for `Wednesday' and `Thur' or `Thurs' for `Thursday' are also allowed. A number may precede a day of the week item to move forward supplementary weeks. It is best used in expression like `third monday'. In this context, `last DAY' or `next DAY' is also acceptable; they move one week before or after the day that DAY by itself would represent. A comma following a day of the week item is ignored. File: tar.info, Node: Relative item in date strings, Next: Pure numbers in date strings, Prev: Day of week item, Up: Date input formats Relative item in date strings ============================= "Relative items" adjust a date (or the current date if none) forward or backward. The effects of relative items accumulate. Here are some examples: 1 year 1 year ago 3 years 2 days The unit of time displacement may be selected by the string `year' or `month' for moving by whole years or months. These are fuzzy units, as years and months are not all of equal duration. More precise units are `fortnight' which is worth 14 days, `week' worth 7 days, `day' worth 24 hours, `hour' worth 60 minutes, `minute' or `min' worth 60 seconds, and `second' or `sec' worth one second. An `s' suffix on these units is accepted and ignored. The unit of time may be preceded by a multiplier, given as an optionally signed number. Unsigned numbers are taken as positively signed. No number at all implies 1 for a multiplier. Following a relative item by the string `ago' is equivalent to preceding the unit by a multiplicator with value -1. The string `tomorrow' is worth one day in the future (equivalent to `day'), the string `yesterday' is worth one day in the past (equivalent to `day ago'). The strings `now' or `today' are relative items corresponding to zero-valued time displacement, these strings come from the fact a zero-valued time displacement represents the current time when not otherwise change by previous items. They may be used to stress other items, like in `12:00 today'. The string `this' also has the meaning of a zero-valued time displacement, but is preferred in date strings like `this thursday'. When a relative item makes the resulting date to cross the boundary between DST and non-DST (or vice-versa), the hour is adjusted according to the local time. File: tar.info, Node: Pure numbers in date strings, Next: Authors of getdate, Prev: Relative item in date strings, Up: Date input formats Pure numbers in date strings ============================ The precise intepretation of a pure decimal number is dependent of the context in the date string. If the decimal number is of the form YYYYMMDD and no other calendar date item (*note Calendar date item::.) appears before it in the date string, then YYYY is read as the year, MM as the month number and DD as the day of the month, for the specified calendar date. If the decimal number is of the form HHMM and no other time of day item appears before it in the date string, then HH is read as the hour of the day and MM as the minute of the hour, for the specified time of the day. MM can also be omitted. If both a calendar date and a time of day appear to the left of a number in the date string, but no relative item, then the number overrides the year. File: tar.info, Node: Authors of getdate, Prev: Pure numbers in date strings, Up: Date input formats Authors of `getdate' ==================== `getdate' was originally implemented by Steven M. Bellovin (`smb@research.att.com') while at the University of North Carolina at Chapel Hill. The code was later tweaked by a couple of people on Usenet, then completely overhauled by Rich $alz (`rsalz@bbn.com') and Jim Berets (`jberets@bbn.com') in August, 1990. Various revisions for the GNU system were made by David MacKenzie, Jim Meyering, and others. This chapter was originally produced by Franc,ois Pinard (`pinard@iro.umontreal.ca') from the `getdate.y' source code, and then edited by K. Berry (`kb@cs.umb.edu'). File: tar.info, Node: Formats, Next: Media, Prev: Date input formats, Up: Top Controlling the Archive Format ****************************** * Menu: * Portability:: Making `tar' Archives More Portable * Compression:: Using Less Space through Compression * Attributes:: Handling File Attributes * Standard:: The Standard Format * Extensions:: GNU Extensions to the Archive Format * cpio:: Comparison of `tar' and `cpio' File: tar.info, Node: Portability, Next: Compression, Prev: Formats, Up: Formats Making `tar' Archives More Portable =================================== Creating a `tar' archive on a particular system that is meant to be useful later on many other machines and with other versions of `tar' is more challenging than you might think. `tar' archive formats have been evolving since the first versions of Unix. Many such formats are around, and are not always comptible with each other. This section discusses a few problems, and gives some advice about making `tar' archives more portable. One golden rule is simplicity. For example, limit your `tar' archives to contain only regular files and directories, avoiding other kind of special files. Do not attempt to save sparse files or contiguous files as such. Let's discuss a few more problems, in turn. * Menu: * Portable Names:: Portable Names * dereference:: Symbolic Links * old:: Old V7 Archives * posix:: POSIX archives * Checksumming:: Checksumming Problems File: tar.info, Node: Portable Names, Next: dereference, Prev: Portability, Up: Portability Portable Names -------------- Use _straight_ file and directory names, made up of printable ASCII characters, avoiding colons, slashes, backslashes, spaces, and other _dangerous_ characters. Avoid deep directory nesting. Accounting for oldish System V machines, limit your file and directory names to 14 characters or less. If you intend to have your `tar' archives to be read under MSDOS, you should not rely on case distinction for file names, and you might use the GNU `doschk' program for helping you further diagnosing illegal MSDOS names, which are even more limited than System V's. File: tar.info, Node: dereference, Next: old, Prev: Portable Names, Up: Portability Symbolic Links -------------- Normally, when `tar' archives a symbolic link, it writes a block to the archive naming the target of the link. In that way, the `tar' archive is a faithful record of the filesystem contents. `--dereference' (`-h') is used with `--create' (`-c'), and causes `tar' to archive the files symbolic links point to, instead of the links themselves. When this option is used, when `tar' encounters a symbolic link, it will archive the linked-to file, instead of simply recording the presence of a symbolic link. The name under which the file is stored in the file system is not recorded in the archive. To record both the symbolic link name and the file name in the system, archive the file under both names. If all links were recorded automatically by `tar', an extracted file might be linked to a file name that no longer exists in the file system. If a linked-to file is encountered again by `tar' while creating the same archive, an entire second copy of it will be stored. (This _might_ be considered a bug.) So, for portable archives, do not archive symbolic links as such, and use `--dereference' (`-h'): many systems do not support symbolic links, and moreover, your distribution might be unusable if it contains unresolved symbolic links. File: tar.info, Node: old, Next: posix, Prev: dereference, Up: Portability Old V7 Archives --------------- Certain old versions of `tar' cannot handle additional information recorded by newer `tar' programs. To create an archive in V7 format (not ANSI), which can be read by these old versions, specify the `--old-archive' (`-o') option in conjunction with the `--create' (`-c'). `tar' also accepts `--portability' for this option. When you specify it, `tar' leaves out information about directories, pipes, fifos, contiguous files, and device files, and specifies file ownership by group and user IDs instead of group and user names. When updating an archive, do not use `--old-archive' (`-o') unless the archive was created with using this option. In most cases, a _new_ format archive can be read by an _old_ `tar' program without serious trouble, so this option should seldom be needed. On the other hand, most modern `tar's are able to read old format archives, so it might be safer for you to always use `--old-archive' (`-o') for your distributions. File: tar.info, Node: posix, Next: Checksumming, Prev: old, Up: Portability GNU `tar' and POSIX `tar' ------------------------- GNU `tar' was based on an early draft of the POSIX 1003.1 `ustar' standard. GNU extensions to `tar', such as the support for file names longer than 100 characters, use portions of the `tar' header record which were specified in that POSIX draft as unused. Subsequent changes in POSIX have allocated the same parts of the header record for other purposes. As a result, GNU `tar' is incompatible with the current POSIX spec, and with `tar' programs that follow it. We plan to reimplement these GNU extensions in a new way which is upward compatible with the latest POSIX `tar' format, but we don't know when this will be done. In the mean time, there is simply no telling what might happen if you read a GNU `tar' archive, which uses the GNU extensions, using some other `tar' program. So if you want to read the archive with another `tar' program, be sure to write it using the `--old-archive' option (`-o'). Traditionally, old `tar's have a limit of 100 characters. GNU `tar' attempted two different approaches to overcome this limit, using and extending a format specified by a draft of some P1003.1. The first way was not that successful, and involved `@MaNgLeD@' file names, or such; while a second approach used `././@LongLink' and other tricks, yielding better success. In theory, GNU `tar' should be able to handle file names of practically unlimited length. So, if GNU `tar' fails to dump and retrieve files having more than 100 characters, then there is a bug in GNU `tar', indeed. But, being strictly POSIX, the limit was still 100 characters. For various other purposes, GNU `tar' used areas left unassigned in the POSIX draft. POSIX later revised P1003.1 `ustar' format by assigning previously unused header fields, in such a way that the upper limit for file name length was raised to 256 characters. However, the actual POSIX limit oscillates between 100 and 256, depending on the precise location of slashes in full file name (this is rather ugly). Since GNU `tar' use the same fields for quite other purposes, it became incompatible with the latest POSIX standards. For longer or non-fitting file names, we plan to use yet another set of GNU extensions, but this time, complying with the provisions POSIX offers for extending the format, rather than conflicting with it. Whenever an archive uses old GNU `tar' extension format or POSIX extensions, would it be for very long file names or other specialities, this archive becomes non-portable to other `tar' implementations. In fact, anything can happen. The most forgiving `tar's will merely unpack the file using a wrong name, and maybe create another file named something like `@LongName', with the true file name in it. `tar's not protecting themselves may segment violate! Compatibility concerns make all this thing more difficult, as we will have to support _all_ these things together, for a while. GNU `tar' should be able to produce and read true POSIX format files, while being able to detect old GNU `tar' formats, besides old V7 format, and process them conveniently. It would take years before this whole area stabilizes... There are plans to raise this 100 limit to 256, and yet produce POSIX conformant archives. Past 256, I do not know yet if GNU `tar' will go non-POSIX again, or merely refuse to archive the file. There are plans so GNU `tar' support more fully the latest POSIX format, while being able to read old V7 format, GNU (semi-POSIX plus extension), as well as full POSIX. One may ask if there is part of the POSIX format that we still cannot support. This simple question has a complex answer. Maybe that, on intimate look, some strong limitations will pop up, but until now, nothing sounds too difficult (but see below). I only have these few pages of POSIX telling about `Extended tar Format' (P1003.1-1990 - section 10.1.1), and there are references to other parts of the standard I do not have, which should normally enforce limitations on stored file names (I suspect things like fixing what `/' and `<NUL>' means). There are also some points which the standard does not make clear, Existing practice will then drive what I should do. POSIX mandates that, when a file name cannot fit within 100 to 256 characters (the variance comes from the fact a `/' is ideally needed as the 156'th character), or a link name cannot fit within 100 characters, a warning should be issued and the file _not_ be stored. Unless some `--posix' option is given (or `POSIXLY_CORRECT' is set), I suspect that GNU `tar' should disobey this specification, and automatically switch to using GNU extensions to overcome file name or link name length limitations. There is a problem, however, which I did not intimately studied yet. Given a truly POSIX archive with names having more than 100 characters, I guess that GNU `tar' up to 1.11.8 will process it as if it were an old V7 archive, and be fooled by some fields which are coded differently. So, the question is to decide if the next generation of GNU `tar' should produce POSIX format by default, whenever possible, producing archives older versions of GNU `tar' might not be able to read correctly. I fear that we will have to suffer such a choice one of these days, if we want GNU `tar' to go closer to POSIX. We can rush it. Another possibility is to produce the current GNU `tar' format by default for a few years, but have GNU `tar' versions from some 1.POSIX and up able to recognize all three formats, and let older GNU `tar' fade out slowly. Then, we could switch to producing POSIX format by default, with not much harm to those still having (very old at that time) GNU `tar' versions prior to 1.POSIX. POSIX format cannot represent very long names, volume headers, splitting of files in multi-volumes, sparse files, and incremental dumps; these would be all disallowed if `--posix' or `POSIXLY_CORRECT'. Otherwise, if `tar' is given long names, or `-[VMSgG]', then it should automatically go non-POSIX. I think this is easily granted without much discussion. Another point is that only `mtime' is stored in POSIX archives, while GNU `tar' currently also store `atime' and `ctime'. If we want GNU `tar' to go closer to POSIX, my choice would be to drop `atime' and `ctime' support on average. On the other hand, I perceive that full dumps or incremental dumps need `atime' and `ctime' support, so for those special applications, POSIX has to be avoided altogether. A few users requested that `--sparse' (`-S') be always active by default, I think that before replying to them, we have to decide if we want GNU `tar' to go closer to POSIX on average, while producing files. My choice would be to go closer to POSIX in the long run. Besides possible double reading, I do not see any point of not trying to save files as sparse when creating archives which are neither POSIX nor old-V7, so the actual `--sparse' (`-S') would become selected by default when producing such archives, whatever the reason is. So, `--sparse' (`-S') alone might be redefined to force GNU-format archives, and recover its previous meaning from this fact. GNU-format as it exists now can easily fool other POSIX `tar', as it uses fields which POSIX considers to be part of the file name prefix. I wonder if it would not be a good idea, in the long run, to try changing GNU-format so any added field (like `ctime', `atime', file offset in subsequent volumes, or sparse file descriptions) be wholly and always pushed into an extension block, instead of using space in the POSIX header block. I could manage to do that portably between future GNU `tar's. So other POSIX `tar's might be at least able to provide kind of correct listings for the archives produced by GNU `tar', if not able to process them otherwise. Using these projected extensions might induce older `tar's to fail. We would use the same approach as for POSIX. I'll put out a `tar' capable of reading POSIXier, yet extended archives, but will not produce this format by default, in GNU mode. In a few years, when newer GNU `tar's will have flooded out `tar' 1.11.X and previous, we could switch to producing POSIXier extended archives, with no real harm to users, as almost all existing GNU `tar's will be ready to read POSIXier format. In fact, I'll do both changes at the same time, in a few years, and just prepare `tar' for both changes, without effecting them, from 1.POSIX. (Both changes: 1--using POSIX convention for getting over 100 characters; 2--avoiding mangling POSIX headers for GNU extensions, using only POSIX mandated extension techniques). So, a future `tar' will have a `--posix' flag forcing the usage of truly POSIX headers, and so, producing archives previous GNU `tar' will not be able to read. So, _once_ pretest will announce that feature, it would be particularly useful that users test how exchangeable will be archives between GNU `tar' with `--posix' and other POSIX `tar'. In a few years, when GNU `tar' will produce POSIX headers by default, `--posix' will have a strong meaning and will disallow GNU extensions. But in the meantime, for a long while, `--posix' in GNU tar will not disallow GNU extensions like `--label=ARCHIVE-LABEL' (`-V ARCHIVE-LABEL'), `--multi-volume' (`-M'), `--sparse' (`-S'), or very long file or link names. However, `--posix' with GNU extensions will use POSIX headers with reserved-for-users extensions to headers, and I will be curious to know how well or bad POSIX `tar's will react to these. GNU `tar' prior to 1.POSIX, and after 1.POSIX without `--posix', generates and checks `ustar ', with two suffixed spaces. This is sufficient for older GNU `tar' not to recognize POSIX archives, and consequently, wrongly decide those archives are in old V7 format. It is a useful bug for me, because GNU `tar' has other POSIX incompatibilities, and I need to segregate GNU `tar' semi-POSIX archives from truly POSIX archives, for GNU `tar' should be somewhat compatible with itself, while migrating closer to latest POSIX standards. So, I'll be very careful about how and when I will do the correction. File: tar.info, Node: Checksumming, Prev: posix, Up: Portability Checksumming Problems --------------------- SunOS and HP-UX `tar' fail to accept archives created using GNU `tar' and containing non-ASCII file names, that is, file names having characters with the eight bit set, because they use signed checksums, while GNU `tar' uses unsigned checksums while creating archives, as per POSIX standards. On reading, GNU `tar' computes both checksums and accept any. It is somewhat worrying that a lot of people may go around doing backup of their files using faulty (or at least non-standard) software, not learning about it until it's time to restore their missing files with an incompatible file extractor, or vice versa. GNU `tar' compute checksums both ways, and accept any on read, so GNU tar can read Sun tapes even with their wrong checksums. GNU `tar' produces the standard checksum, however, raising incompatibilities with Sun. That is to say, GNU `tar' has not been modified to _produce_ incorrect archives to be read by buggy `tar''s. I've been told that more recent Sun `tar' now read standard archives, so maybe Sun did a similar patch, after all? The story seems to be that when Sun first imported `tar' sources on their system, they recompiled it without realizing that the checksums were computed differently, because of a change in the default signing of `char''s in their compiler. So they started computing checksums wrongly. When they later realized their mistake, they merely decided to stay compatible with it, and with themselves afterwards. Presumably, but I do not really know, HP-UX has chosen that their `tar' archives to be compatible with Sun's. The current standards do not favor Sun `tar' format. In any case, it now falls on the shoulders of SunOS and HP-UX users to get a `tar' able to read the good archives they receive. File: tar.info, Node: Compression, Next: Attributes, Prev: Portability, Up: Formats Using Less Space through Compression ==================================== * Menu: * gzip:: Creating and Reading Compressed Archives * sparse:: Archiving Sparse Files File: tar.info, Node: gzip, Next: sparse, Prev: Compression, Up: Compression Creating and Reading Compressed Archives ---------------------------------------- _(This message will disappear, once this node revised.)_ `-z' `--gzip' `--ungzip' Filter the archive through `gzip'. Some format parameters must be taken into consideration when modifying an archive: . Compressed archives cannot be modified. You can use `--gzip' and `--gunzip' on physical devices (tape drives, etc.) and remote files as well as on normal files; data to or from such devices or remote files is reblocked by another copy of the `tar' program to enforce the specified (or default) record size. The default compression parameters are used; if you need to override them, avoid the `--gzip' (`--gunzip', `--ungzip', `-z') option and run `gzip' explicitly. (Or set the `GZIP' environment variable.) The `--gzip' (`--gunzip', `--ungzip', `-z') option does not work with the `--multi-volume' (`-M') option, or with the `--update' (`-u'), `--append' (`-r'), `--concatenate' (`--catenate', `-A'), or `--delete' operations. It is not exact to say that GNU `tar' is to work in concert with `gzip' in a way similar to `zip', say. Surely, it is possible that `tar' and `gzip' be done with a single call, like in: $ tar cfz archive.tar.gz subdir to save all of `subdir' into a `gzip''ed archive. Later you can do: $ tar xfz archive.tar.gz to explode and unpack. The difference is that the whole archive is compressed. With `zip', archive members are archived individually. `tar''s method yields better compression. On the other hand, one can view the contents of a `zip' archive without having to decompress it. As for the `tar' and `gzip' tandem, you need to decompress the archive to see its contents. However, this may be done without needing disk space, by using pipes internally: $ tar tfz archive.tar.gz About corrupted compressed archives: `gzip''ed files have no redundancy, for maximum compression. The adaptive nature of the compression scheme means that the compression tables are implicitly spread all over the archive. If you lose a few blocks, the dynamic construction of the compression tables becomes unsychronized, and there is little chance that you could recover later in the archive. There are pending suggestions for having a per-volume or per-file compression in GNU `tar'. This would allow for viewing the contents without decompression, and for resynchronizing decompression at every volume or file, in case of corrupted archives. Doing so, we might loose some compressibility. But this would have make recovering easier. So, there are pros and cons. We'll see! `-Z' `--compress' `--uncompress' Filter the archive through `compress'. Otherwise like `--gzip' (`--gunzip', `--ungzip', `-z'). `--use-compress-program=PROG' Filter through PROG (must accept `-d'). `--compress' (`--uncompress', `-Z') stores an archive in compressed format. This option is useful in saving time over networks and space in pipes, and when storage space is at a premium. `--compress' (`--uncompress', `-Z') causes `tar' to compress when writing the archive, or to uncompress when reading the archive. To perform compression and uncompression on the archive, `tar' runs the `compress' utility. `tar' uses the default compression parameters; if you need to override them, avoid the `--compress' (`--uncompress', `-Z') option and run the `compress' utility explicitly. It is useful to be able to call the `compress' utility from within `tar' because the `compress' utility by itself cannot access remote tape drives. The `--compress' (`--uncompress', `-Z') option will not work in conjunction with the `--multi-volume' (`-M') option or the `--append' (`-r'), `--update' (`-u'), `--append' (`-r') and `--delete' operations. *Note Operations::, for more information on these operations. If there is no compress utility available, `tar' will report an error. *Please note* that the `compress' program may be covered by a patent, and therefore we recommend you stop using it. `--compress' `--uncompress' `-z' `-Z' When this option is specified, `tar' will compress (when writing an archive), or uncompress (when reading an archive). Used in conjunction with the `--create' (`-c'), `--extract' (`--get', `-x'), `--list' (`-t') and `--compare' (`--diff', `-d') operations. You can have archives be compressed by using the `--gzip' (`--gunzip', `--ungzip', `-z') option. This will arrange for `tar' to use the `gzip' program to be used to compress or uncompress the archive wren writing or reading it. To use the older, obsolete, `compress' program, use the `--compress' (`--uncompress', `-Z') option. The GNU Project recommends you not use `compress', because there is a patent covering the algorithm it uses. You could be sued for patent infringment merely by running `compress'. I have one question, or maybe it's a suggestion if there isn't a way to do it now. I would like to use `--gzip' (`--gunzip', `--ungzip', `-z'), but I'd also like the output to be fed through a program like GNU `ecc' (actually, right now that's `exactly' what I'd like to use :-)), basically adding ECC protection on top of compression. It seems as if this should be quite easy to do, but I can't work out exactly how to go about it. Of course, I can pipe the standard output of `tar' through `ecc', but then I lose (though I haven't started using it yet, I confess) the ability to have `tar' use `rmt' for it's I/O (I think). I think the most straightforward thing would be to let me specify a general set of filters outboard of compression (preferably ordered, so the order can be automatically reversed on input operations, and with the options they require specifiable), but beggars shouldn't be choosers and anything you decide on would be fine with me. By the way, I like `ecc' but if (as the comments say) it can't deal with loss of block sync, I'm tempted to throw some time at adding that capability. Supposing I were to actually do such a thing and get it (apparantly) working, do you accept contributed changes to utilities like that? (Leigh Clayton `loc@soliton.com', May 1995). Isn't that exactly the role of the `--use-compress-prog=PROGRAM' option? I never tried it myself, but I suspect you may want to write a PROG script or program able to filter stdin to stdout to way you want. It should recognize the `-d' option, for when extraction is needed rather than creation. It has been reported that if one writes compressed data (through the `--gzip' (`--gunzip', `--ungzip', `-z') or `--compress' (`--uncompress', `-Z') options) to a DLT and tries to use the DLT compression mode, the data will actually get bigger and one will end up with less space on the tape. File: tar.info, Node: sparse, Prev: gzip, Up: Compression Archiving Sparse Files ---------------------- _(This message will disappear, once this node revised.)_ `-S' `--sparse' Handle sparse files efficiently. This option causes all files to be put in the archive to be tested for sparseness, and handled specially if they are. The `--sparse' (`-S') option is useful when many `dbm' files, for example, are being backed up. Using this option dramatically decreases the amount of space needed to store such a file. In later versions, this option may be removed, and the testing and treatment of sparse files may be done automatically with any special GNU options. For now, it is an option needing to be specified on the command line with the creation or updating of an archive. Files in the filesystem occasionally have "holes." A hole in a file is a section of the file's contents which was never written. The contents of a hole read as all zeros. On many operating systems, actual disk storage is not allocated for holes, but they are counted in the length of the file. If you archive such a file, `tar' could create an archive longer than the original. To have `tar' attempt to recognize the holes in a file, use `--sparse' (`-S'). When you use the `--sparse' (`-S') option, then, for any file using less disk space than would be expected from its length, `tar' searches the file for consecutive stretches of zeros. It then records in the archive for the file where the consecutive stretches of zeros are, and only archives the "real contents" of the file. On extraction (using `--sparse' (`-S') is not needed on extraction) any such files have hols created wherever the continuous stretches of zeros were found. Thus, if you use `--sparse' (`-S'), `tar' archives won't take more space than the original. A file is sparse if it contains blocks of zeros whose existence is recorded, but that have no space allocated on disk. When you specify the `--sparse' (`-S') option in conjunction with the `--create' (`-c') operation, `tar' tests all files for sparseness while archiving. If `tar' finds a file to be sparse, it uses a sparse representation of the file in the archive. *Note create::, for more information about creating archives. `--sparse' (`-S') is useful when archiving files, such as dbm files, likely to contain many nulls. This option dramatically decreases the amount of space needed to store such an archive. *Please Note:* Always use `--sparse' (`-S') when performing file system backups, to avoid archiving the expanded forms of files stored sparsely in the system. Even if your system has no sparse files currently, some may be created in the future. If you use `--sparse' (`-S') while making file system backups as a matter of course, you can be assured the archive will never take more space on the media than the files take on disk (otherwise, archiving a disk filled with sparse files might take hundreds of tapes). `tar' ignores the `--sparse' (`-S') option when reading an archive. `--sparse' `-S' Files stored sparsely in the file system are represented sparsely in the archive. Use in conjunction with write operations. However, users should be well aware that at archive creation time, GNU `tar' still has to read whole disk file to locate the "holes", and so, even if sparse files use little space on disk and in the archive, they may sometimes require inordinate amount of time for reading and examining all-zero blocks of a file. Although it works, it's painfully slow for a large (sparse) file, even though the resulting tar archive may be small. (One user reports that dumping a `core' file of over 400 megabytes, but with only about 3 megabytes of actual data, took about 9 minutes on a Sun Sparstation ELC, with full CPU utilisation.) This reading is required in all cases and is not related to the fact the `--sparse' (`-S') option is used or not, so by merely _not_ using the option, you are not saving time(1). Programs like `dump' do not have to read the entire file; by examining the file system directly, they can determine in advance exactly where the holes are and thus avoid reading through them. The only data it need read are the actual allocated data blocks. GNU `tar' uses a more portable and straightforward archiving approach, it would be fairly difficult that it does otherwise. Elizabeth Zwicky writes to `comp.unix.internals', on 1990-12-10: What I did say is that you cannot tell the difference between a hole and an equivalent number of nulls without reading raw blocks. `st_blocks' at best tells you how many holes there are; it doesn't tell you _where_. Just as programs may, conceivably, care what `st_blocks' is (care to name one that does?), they may also care where the holes are (I have no examples of this one either, but it's equally imaginable). I conclude from this that good archivers are not portable. One can arguably conclude that if you want a portable program, you can in good conscience restore files with as many holes as possible, since you can't get it right. ---------- Footnotes ---------- (1) Well! We should say the whole truth, here. When `--sparse' (`-S') is selected while creating an archive, the current `tar' algorithm requires sparse files to be read twice, not once. We hope to develop a new archive format for saving sparse files in which one pass will be sufficient.