ZIP (file format)
From EDeskWiki
The ZIP file format is a popular data compression and file archiver/archival file format/format. A ZIP file contains one or more files that have been compressed, to reduce their file size, or stored as-is.
The format was originally designed by Phil Katz for PKZIP. However, many software utilities other than PKZIP itself are now available to create, modify, or open (unzip, decompress) ZIP files, notably WinZip, BOMArchiveHelper, KGB Archiver, PicoZip, Info-ZIP, WinRAR, IZArc, 7-Zip, ALZip and TUGZip. Microsoft has included built-in ZIP support (under the name "compressed folders") in later versions of its Microsoft Windows/Windows operating system. Apple has included built-in ZIP support in Mac OS X Mac OS X v10.3/10.3 and later via the BOMArchiveHelper utility.
ZIP files generally use the file extensions ".zip" or ".ZIP" and the MIME media type application/zip. Some software uses the ZIP file format as a wrapper for a large number of small items in a specific structure. Generally when this is done a different file extension is used. Examples of this usage are Java (programming language)/Java JAR (file format)/JAR files, id Software .pk3/.pk4 files, package files for StepMania and Winamp/Windows Media Player skins, XPInstall, as well as OpenDocument and Office Open XML office formats. Both OpenDocument and Office Open XML formats use the JAR (file format)/JAR file format internally, so files can be easily uncompressed and compressed using tools for ZIP files.
Contents |
History
Early history
The ZIP file format was originally created by Phil Katz, founder of PKWARE, after a prolonged legal dispute between PKWARE and System Enhancement Associates (SEA) over the trademark "ARC" (short for "Archive") and the file extension ARC (file format)/.ARC.
PKWARE's first archive product, PKARC, borrowed heavily from SEA's published code, and improved on it by converting SEA's ARC C (programming language)/C code into hand-optimised assembler, which was much faster. PKARC also used the ".ARC" file extension. SEA contended that Katz had based his product on their code and trademark, and thus ought to license the code from them and pay royalties. PKWARE refused. SEA brought a successful copyright infringement lawsuit against Phil Katz and PKWARE. After suit was brought, Katz briefly released a relabeled version of PKARC named PKPAK in a futile effort to invalidate the suit.
During settlement, Katz still refused to pay license fees to SEA, instead agreeing to pay SEA's legal expenses and stop selling PKARC. He then went on to create his own file format, which is known worldwide now as the ZIP format (commonly called a "ZIP file"). The ZIP format was more resistant to data loss than the ARC format because of redundant catalog storage; it also was more flexible than ARC, providing room for additional optional compression algorithms and future expansion. Along with the new format, PKZIP included at least one compression algorithm more efficient than any supported by ARC. Once PKZIP was released, many users abandoned ARC because of its slower speed and less effective compression, and because Katz had successfully put forth the idea that he was the "good guy" who was being treated unfairly by an "evil corporation".
Katz publicly released technical documentation on the ZIP file format making it an open format, along with the first version of his PKZIP archiver, in January 1989.
The name zip (meaning speed) was suggested by Katz's friend Robert Mahoney. They wanted to imply that their product would be faster than ARC and other compression formats of the time.
Moving beyond the command line
In the mid 1990s, as more new computers included graphical user interfaces, more users were not comfortable with the command line/command-line operation of PKZIP. Seeing an opportunity, shareware authors began pitching compression and archival programs with graphical user interfaces. Many of these used the ZIP format. WinZip was among the most popular. PKWARE also offered a graphical version of PKZIP. These programs were easier to learn than the older command-line equivalents, but users still had to learn a specialized tool with its own interface for file archival and compression.
In the late 1990s, various file manager software started integrating support for the ZIP format into their user interface. Even earlier, Norton Commander and its clone (computer and video games)/clones like Volkov Commander in DOS had started that trend, and that remains the norm for the "Commander-like" or Orthodox file managers like Midnight Commander for Linux and UNIX-like systems and Total Commander (previously Windows Commander) for Windows. The KDE file manager (kfm) supported the ZIP format very early; ZIP support was also first added to Windows Explorer with the Microsoft Plus!/Plus! enhancement package in Windows 98 and later included in Windows Me and Windows XP; ZIP format support is also built in the Mac OS Finder (as of Mac OS X, via the BOMArchiveHelper utility), the Nautilus file manager used by GNOME and the Konqueror file manager of newer versions of KDE. By 2002, all major desktop environments included ZIP file support in their file managers: a ZIP file is typically presented as a directory or folder, so that files are copied into and out of it in the same manner as any other folder and the compression is handled in a way largely transparent to the user. This has eliminated the need to learn a specialized tool and interface for file archival and compression.
Technical information
ZIP is a fairly simple archive format that compresses every file separately. Compressing files separately allows for individual files to be retrieved without reading through other data; in theory, it may allow better compression by using different algorithms for different files. However a caveat to this is that archives containing a large number of small files end up significantly larger than if they were compressed as a single file (the classic example of the latter is the common Tar (file format)/tar.gz archive which consists of a Tar (file format)/TAR archive compressed using gzip).
The specification for ZIP indicates that files can be stored either uncompressed or using a variety of compression algorithms. However, in practice, ZIP is almost always used with Phil Katz/Katz's DEFLATE (algorithm)/DEFLATE algorithm, except when files being added are already compressed or are resistant to compression.
ZIP supports a simple password/password-based symmetric-key algorithm/symmetric encryption system which is known to be seriously flawed. In particular it is vulnerable to known-plaintext attacks which are in some cases made worse by poor implementations of random number generators.[1] It also supports spreading archives across multiple removable disks (generally floppy disks, but it could also be used with other removable media).
New features including new Data compression/compression and encryption (e.g. Advanced Encryption Standard/AES) methods have been added to ZIP in more recent times, but these are not supported by many tools and are not in wide use.
The original ZIP format had a number of limits (uncompressed size of a file, compressed size of a file and total size of the archive) at 4GB. In version 4.5 of the specification, PKWARE introduced the "ZIP64" format extensions to get around these limitations.
The File Allocation Table/FAT filesystem of DOS only has a timestamp resolution of two seconds; ZIP file records mimic this. As a result, the built-in timestamp resolution of files in a ZIP archive is only two seconds, though extra fields can be used to store more accurate timestamps.
The Info-ZIP implementations of the ZIP format adds support for Unix filesystem features, such as user and group IDs, file permissions, and support for symbolic links. The Apache Ant implementation is aware of them to the extent that it can create files with predefined Unix permissions.
The Info-ZIP Windows tools also support NTFS filesystem permissions, and will make an attempt to translate from NTFS permissions to Unix permissions or vice-versa when extracting files. This is sometimes annoying, and can result in undesireable combinations, e.g. .exe files being created on NTFS volumes with executable permission denied.
Compression methods
The size for comparison figures were made using the contents of ftp://ftp.kernel.org/pub/linux/kernel/v2.6/linux-2.6.9.tar.bz2 and maximum compression.
- Shrinking (method 1)
- Shrinking is a variant of LZW with a few minor tweaks. As such it was affected by the LZW patent issue. It was never clear if the patent covered unshrinking but some open source projects (for example Info-ZIP) decided to play it safe and not include unshrinking support in the default builds.
- Reducing (methods 2-5)
- Reducing involves a combination of compressing repeated byte sequences then applying a probability-based encoding to the result.
- Imploding (method 6)
- Imploding involves compressing repeated byte sequences with a sliding Window function/window then compressing the result using multiple Shannon-Fano coding/Shannon-Fano trees.
- Tokenizing (method 7)
- This method number is reserved. The PKWARE specification does not define an algorithm for it. This is because the format was developed (as a non-proprietary open specification) by a third-party other than PKWARE for specialized usage.Template:Fact/date=February 2007
- Deflate and enhanced deflate (methods 8 and 9)
- These methods use the well-known deflate algorithm. Deflate allows a window up to 32 KiB. Enhanced deflate allows a window up to 64 KiB. The enhanced version performs slightly better but is not as widely supported.
- Sizes for comparison (using PKZIP 8.00.0038 for Windows):
- Deflate: 52.1 MiB
- Enhanced deflate: 51.8 MiB
- PKWARE Data Compression Library Imploding (method 10)
- The official ZIP format specification gives no further information on this.
- Size for comparison: 61.6 MiB (PKZIP 8.00.0038 for Windows in binary mode).
- Method 11
- This method number is reserved by PKWARE.
- Bzip2 (method 12)
- This method uses the well-known bzip2 algorithm. This algorithm performs better than deflate but is not widely supported, particularly by Windows-based tools.
- Size for comparison: 50.6 MiB (PKZIP 8.00.0038 for Windows).
- Note that although both the original (tar inside bzip2, 34.6 MiB) and the comparison version (bzip2 as a ZIP method, 50.6 MiB) use the same compression algorithm, the ZIP version is 46% larger. This demonstrates the compression ratio advantage of a solid archive over ZIP's strategy of compressing each individual file separately when used to archive many (16,448) files.
External links
- Technical specifications of the PKZIP file formats from info-ZIP
- Current file format specification from PKWARE (including many recent features that are not widely supported)
- Original specification for the first version of the format
- List of ZIP-related resources, libraries and sources
- 18 Years of ZIP format: Happy Birthday at The Data Compression News Blog
- Comparison of the performances of various methods of data compression (french)
- Archive and compressed file types
