- Hybrid: (lossy+correction)
TOOGAM's Software Archive: Multimedia Files: Audio Codecs has a section about compression techniques that gain potentially high compression ratios by making some data loss optional. A fairly small but usable file is created. Also, a “correction” file can be used to add in whatever data was lost during the process of obtaining the nicer compression ratio. The usable (small, lossy) file, combined with the correction file, is sufficient to generate the original file, so no loss is forced by the process.
This seems to be a much more ideal approach than forcing a person to either have loss or to also have an additional copy of the entire original set of data.
Note that TOOGAM's Software Archive was meant for the MS-DOS and platforms that were typically compatible with that platform, prior to emulation. Therefore, unless virtualization/emulation is used, many of the implementations on that TOOGAM's Software Archive: Multimedia Files: Audio Codecs may not work as well in other environments, such as operating systems on a computer using an x64 chip.
- Lossless data
- General Data
- Widely compatible software
Zip files are wonderful, mostly due to their very widespread support. Perhaps one of the most prominent examples is Info-Zip 5.12's UnZipSfx.Exe file which was only about 12KB, and required very little RAM. Info-Zip's UnZip page had, for a long time, touted the software as likely being “The Third Most Portable Program in the World!” (This was before some other competitors joined, such as the Linux kernel, DOOM, and clones of games like Bastet (including earlier incarnations such as TETЯIS, the variation of the title used by Atari... or shall we say Ataяi?). Having freely available source code, early on, surely helped cement Zip files to become a widely supported standard.
7-Zip can often create some of the smallest possible Zip-compatible files (and gzip-compatible files), and also supports using the LZMA2 compression algorithms by using a format that is widely called the 7-Zip format, which uses the
Although, before blindly trusting the files it makes, do be familiar with this warning: TOOGAM's Software Archive: Archives: 7-Zip Warning.
For zip files, creating very small files may be done with something like:
7zaa -tzip -mx=9 -mfb=256 -mpass=15
- Required variations
On Microsoft Windows systems, 7-Zip does not typically place itself in the path by default, and so a longer reference to 7-Zip's executable may be needed.
a -tzip -mx=9 -mfb=256 -mpass=15
On a 64-bit version of Microsoft Windows, the following may be needed if the 32-bit version of 7-Zip is installed, and if the 64-bit version of 7-Zip is not installed:
a -tzip -mx=9 -mfb=256 -mpass=15
C:\Program Files (x86)\7-Zip\7z.exe
Actually, the maximum value for the
-mfbmay be a tiny bit higher:
258(depending on what file format is used). The created file would be slightly smaller if the following is supported by the 7-Zip command line, and is then used:
7zaa -tzip -mx=9 -mfb=258 -mpass=15
For those who are new to command line user interfaces, one of the trickiest parts may be to simply find the exact name of the executable to use. The following example shows how this might be doable in Unix:
7zaa -tzip -mx=9 -mfb=256 -mpass=15
This may Zip up all files in the current directory and subdirectories. If that is not what is wanted, just add, to the end of the command line, any filespec (“file specification”) that is needed. The following example shows such filespecs being used, as well as how to reference 32-bit 7-Zip if it is installed on a machine using a 64-bit version of Microsoft Windows:
a -tzip -mx=9 -mfb=256 -mpass=15
C:\Program Files (x86)\7-Zip\7z.exe
This is all based on TOOGAM's online help for the 7-Zip command line.
Licensing: The 7-Zip's home page for the Lempel-Ziv-Markov chain algorithm (“LZMA”) Software Development Kit (“SDK”) states that one of the features of version 4.62 is that “LZMA SDK is placed in the public domain.” 7-Zip itself uses LGPL 2.1(+), except that the 7z.dll file has an exception for the unRAR support, as noted by 7-Zip license. The differences are only intended to affect people who wish to create software or alter how software behaves. As clearly noted on 7-Zip's license, “You can use 7-Zip on any computer, including a computer in a commercial organization. You don't need to register or pay for 7-Zip.”
- Further reductions
- Removing directories
To further shrink zip files: After zipping up data, the zip file might be shrinkable by removing unneeded data about subdirectories. The savings will be small (e.g. 116 bytes per subdirectory). (With 7-Zip, this unneeded data is in the zip file if compressing subdirectories and their contents, but specifying a filename from a subdirectory will not cause the unneeded directory entry.) Listing the contents of a zip file can show whether a directory is part of the zip file. Using text mode (command line) software will probably show this more clearly than software that renders the zip file contents with a graphical interface.
If subdirectories were used, then the zip file could be made a bit smaller by deleting the subdirectories. The files will still be stored within subdirectories, and so software will typically create the necessary subdirectories for files to be extracted to, but the zip file won't contain the meta-data (like file times) for the subdirectories. If that metadata is not desired, that could make the zip files a tiny (generally insignificantly) bit smaller. This process can be done using Info-Zip's Zip, by using: “
- Use Zopfli
As a real example, a zip file was made (using 7-Zip) of Freeexe version 0.1 and that Zip file was 4,726,980 bytes. After using Advzip's Zopfli code for just one iteration, the code reduced to 4,721,510 bytes. Using that same code on the latest ZIP file for 10 iterations reduced the code down to 4,721,441 bytes. Using the same code for 10,000 iterations on the latest ZIP file resulted in a new ZIP file size of 4,721,025 bytes. So after 10,011 iterations, the code's reduction was 4,721,025 / 4,726,980 bytes which is still over 99.87% of the original file size, having saved 5,955 bytes (after likely running code hours).
As a real example, a zip file was made (using 7-Zip) of Freeexe version 0.1 and that Zip file was 4,727,045 bytes. After using Advzip's Zopfli code for just one iteration, the code reduced to 4,721,603 bytes, saving 5,442 bytes. Using that same code on the latest ZIP file for 10 iterations reduced the code down to 4,721,533 bytes, creating savings of another 70 bytes (for 5,512 bytes saved), which would save one block if the file were stored or transmitted in blocks of 128 bytes, but not if stored/transmitted in blocks of 256 bytes or larger. (Most modern file systems will use a “block”, called an “allocation unit”, which is a multiple of 512 bytes.) Using the same code for 10,000 iterations on the latest ZIP file resulted in a new ZIP file size of 4,721,131 bytes. So after 10,011 iterations, the code's reduction was 4,721,131 / 4,727,045 bytes, which is still over 99.87% of the original file size, having saved 5,914 bytes (after likely running code for 9 hours, 4 minutes, 43.99 seconds on a 2.2GHz computer).
A similar process on the related (Freeexe version 0.1) archive related to source code resulted in reduction from 28,855,016 to 28,837,057, having shaved off 17,959 bytes.
Each time the Advzip program was run, the program did not provide any output until the requested number of iterations completed. That means that for minutes, if not numerous hours, there was no progress report.
Brotli announcement notes that Brotli “allows us to get 20?26% higher compression ratios over Zopfli.” One big advantage to using Zopfli is compatibility with certain existing file formats. (That might be the only major, significant advantage of using Zopfli over newer formats like ZStandard or Brotli.)
Zopfli's source code is available.
- Using Zopfli in advzip
For users of Microsoft Windows, or DOS, Zopfli is built into AdvZip, which is part of AdvanceCOMP (or, at least, new enough versions). Grab AdvanceCOMP. (AdvanceCOMP Releases provides the more direct hyperlinks, which can also be found by ignoring some of the earlier hyperlinks to other packages, and scrolling down sufficiently on the AdvanceCOMP Download section).
Extract AdvanceComp and run:
Update: That didn't work when tested. (Maybe that was meant for an older version? Or, maybe it was just a typo?) Try this instead:
There is nothing special about the number 10,000. It simply helps to have larger numbers, so that may provide better results than 1,000.
- Other tools
There may be other ways to shrink zip files even further. DeflOpt has been known to accomplish reductions of files made by 7-Zip. Zopfli issue: DeflOpt beats it also mentioned Defluff.
Once GPLv3, after version 7.00 this software became public domain using the license from Unlicense.org.
Compared to old ZIP programs, this uses up quite a lot of memory (e.g., it can use multiple gigabytes). It can also produce smaller files.
- To compress:
-m5will specify the comprssion method to use.)
Results on a 5,322,831,872 byte Windows Server 2016 installer ISO file were: 4,367,258,102 bytes using
while using the executable file designed for compatibilit with 32-bit Microsoft Windows cleanly crashed. (The operating system didn't close the program, but the program did detect an error and displayed an error message and closed.)
- Other PAQ options
This was noticed from a graphic on the ZPAQ page, which showed ZPAQ creating a slightly smaller file, but pcompress was apparently pretty close in size, and notably faster, using “
The following would be more consistent with documentation:
pcompress-v -l 14 -s 60m
-venables verbose mode, showing each file.
On ZPAQ's chart, the speed shown was roughly identical to Info-Zip's “
ZPAQ's page noted pcompress was available for Linux, and apparently not available for Microsoft Windows. moinag's post about a Windows port said it was being worked on, back in 2012. It hadn't been made as of 2016 (Forum post regarding a Windows binary).
- Some Other (Newer) Options
(This section was made for solutions that are relatively new, or at least newly discovered.)
Quora: Brotli and Zstandard (with answers provided by famed compression analyst Matt Mahoney, and by Joe Duarte), 7-Zip-zstd's Github.com page (a home page for a program that modified 7-Zip to support Zstandard, Brotli, and more) provided a picture showing LZMA2 providing highest compression ratio over Zstandard, Brotli, and others. Zstandard announcement showed zstd having a compression ratio of 3.14, and xz had a 4.31 compression ratio. So there is this evidence that some of these newer compression standards might not top the comrpession ratio of the LZMA2 compression method that can be found in 7-Zip and xz. (Instead, the focus on some of the newer variations may be speed tradeoff.)
The good news is that the compressed data is small.
It is also “open source”.
RFC 7932: Brotli Compressed Data Format has a Copyright Notice which says, “Code Components extracted from” (RFC 7932) “must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions”, citing http://trustee.ietf.org/license-info which redirects to IETF: “Trust Legal Provisions (TLP)”. IETF Trust Legal Provisions 5.0 (HTML format) was published and effective starting March 25, 2015, making that the version “in effect on the date of publication of” RFC 7932 (quoting the copyright notice from that RFC).
GitHub: Google's Brotli notes, “Brotli is open-sourced under the MIT License, see the LICENSE file.” (Brotli License file). Wikipedia's page on Brotli notes some code implementing Brotli, created by Mark Adler, who was “one of the co-authors of the zlib/gzip compression format and library. Adler's implementation was released under the terms of the similarly permissive Apache license”.
The format has been touted as being excellent for web content. With support by major web browsers (both Chrome and Firefox) and an IETF document resembling an RFC document, the format seems to have obtained some stability and popularity beyond the PPM format.
There is also (at least some) support by nginx, Apache, and IIS 7 (64-bit).
The bad news about brotli is that it is slow compressing. It also doesn't seem to support keeping track of a bunch of metadata, such as filenames. That is little problem, as such metadata can be captured using another tool, like tar (a traditional way to store metadata in Unix), and then that archive can be compressed.
Google has released the source code. Trying to find an officially created executable for Microsoft Windows was an unfruitful search, although some programs by third parties are available. An OpenBSD port of Brotli has been started, but is outdated (version 0.3.0 when Brotli version 0.6.0 has been seen), and isn't part of the OpenBSD package set (nothing is at OpenPorts.se) as of this writing (June 17, 2016).
Still, this review is being written in June 2016, less than a year since Google's announcement. So, hopefully some of these limits will dissipate over time.
- Brotli announcement mentions Brötli refers to food.
- Info. Eric Lawrence: “New Year’s Diet: Brotli Compression” noted “Mozilla beat Google to the finish and Firefox” supported Brotli before Google's web support. Eric Lawrence: “New Year’s Diet: Brotli Compression” also posted about a star graphic using Brotli (useful for testing). Also, IIS info on Brotli support mentions IIS 7 (64-bit).
info by Matt Mahoney (who has run a compression comparison site for many years) seems to suggest using
-q 11 -w 24. A comment about brotli platform, by Matt Mahoney notes, “Windows and Linux versions of brotli compress to different sizes on most files” an author notes that for
--quality, “The supported range is 0 to 11, and 11 is the default.”. Reported issue notes
-q 9can compress stronger than
- Not .bro, per Slashdot article: Google/Mozilla Engineers Nix .bro File Type As Offensive. (Apparently .brotli is preferred.)
- It works on lots of types of data. However, there may be some special support for common data types. Bronti dictionary commentary.
- IETF document
- Brotli releases
Bulat's post provides some executable files (
). EricLaw's Brotli page also hyperlinks to an “Authenticode-signed” Brotli.exe file (related blog post on performance). Cran R Project's page on Brotli has “Windows binaries” available, which claim to be version 1.0 but that seems to be a package version number, not Brotli's (since, at the time of this writing, the latest release of Brotli is 0.6.0). Those are compressed archives containing DLL files (not .exe files).
It seems there is no default extension, but people have been favoring a six letter extension (“.brotli”)
To see help:
brotli-q 11 -w 24 -v -f -i
Depending on the version, some of that may need to be changed. e.g., using
--out. Check the command line help for details. Likewise, expanded parameters may need to be
--force. Window size tends to be
-wbut may not be universally supported. “
--repeat” has also been seen.
- Zstandard / Zstd
GitHub: Facebook: ZStd: README.md file notes, “Zstandard is dual-licensed under BSD and GPLv2.”
Tino Reichardt's 7-Zip with support for Zstandard, Brotli, Lz4, Lz5 and Lizard Compression is claimed to have lost the interest of the author, by forum post that promoted TC4Shell Modern7z. These projects provide ZStandard support as a plug-in for 7-Zip.
Like some other software (7-Zip, and xz), contrary to the software's name including the word “ZIP”, the archives created by this software do not appear to have any sort of compatibility to PKZip (which created the common ZIP file format), unlike gzip (which is incompatible with PKZip but does seem to have some similarities in the software's archives).
The software version 0.09a (alpha) identified its home page as http://nanozip.net but it seems that domain name is no longer being dedicated to that software. (Presumably this indicates that the domain name was probably allowed to expire, and has been lost.) Still, this older software has been able to provide some significant compression
Add to an archive. e.g.:
nz.exea -cc -v -m1.4g
-ccrepresents one of the compression algorithms. Another option is
-cO(which is a capital letter “o”, not a zero).
-m1.4grepresents a limitation on how much memory to use, and this amount was chosen simply from a single example that was seen at MattMahoney.net: Data Compression, Generic Compression Benchmark.
Results on a 5,322,831,872 byte Windows Server 2016 installer ISO file were: 4,804,999,774 bytes using “
-cO” and 5,931,787,906 bytes using “
The README identifies itself as “
Long Range ZIP or Lzma RZIP”.
lrzip's home page specifies “-Uzp 1 -L 9” for maximum compression. Using a couple of uses of the “
-v” option can make it more verbose.
lrzip-Uzvvp 1 -L 9
The input filename is specified. The output filename will have .lrz added (e.g., “
filename.lrz”) unless something is specified using options like “
-o” or “
-O” or “
This software may be available for Linux-based operating systems rather easily. (In Debian, a simple “
” works nicely.) For Microsoft Windows, the situation isn't quite so easy...
One option for Microsoft Windows may be Cygwin. This may require the following files:
(At the moment, rather direct download hyperlinks are not available from right here. These files were obtained by installing Cygwin, which ended up installing a number of additional files.)
That includes over 5MB of DLL files. All of those files can probably be obtained from these packages. (Perhaps some of these packages are not necessary.)
- Install bash 4.4.12-3 (automatically added)
- Install coreutils 8.25-3 (automatically added)
- Install cygwin 2.11.2-1 (automatically added)
- Install libattr1 2.4.46-1 (automatically added)
- Install libbz2_1 1.0.6-3 (automatically added)
- Install libgcc1 7.3.0-3 (automatically added)
- Install libgmp10 6.1.2-1 (automatically added)
- Install libiconv2 1.14-3 (automatically added)
- Install libintl8 0.19.8.1-2 (automatically added)
- Install liblzo2_2 2.10-1 (automatically added)
- Install libncursesw10 6.0-12.20171125 (automatically added)
- Install libreadline7 7.0.3-3 (automatically added)
- Install libstdc++6 7.3.0-3 (automatically added)
- Install lrzip 0.631-1
- Install terminfo 6.0-12.20171125 (automatically added)
- Install tzdata 2018e-1 (automatically added)
- Install zlib0 1.2.11-1 (automatically added)
Another option may be WeSaySo.co.uk lrzip.zip for Win32 (unverified), which offers what appears to be v0.23 of the program. This is an old version, from an unknown source.
- Other solutions for lossless compressing
Other options for compressing files exist. Some of these options may even make (typically slightly) smaller archives. However, they may have some disadvantage(s), such as not being quite as freely available/distributable/modifiable or being far less compatible. For instance, often a PAQ program has been released with some tiny improvement, while is often not even compatible with previous versions of the same program. (A ZPAQ specification may improve in that respect.) Also, in order to decompress the data, some of these options may require very large amounts of RAM compared to a standard Zip file.
MaximumCompression.com summary of Multiple Files (sorted by Compression Ratio) shows some of the results created with compression programs listed at MaximumCompression.com. MattMahoney.net: Data Compression, Generic Compression Benchmark also shows some results of some rather random data.
TOOGAM's Software Archive: Archivers lists some software to help handle a variety of formats.
The gzip standard seems to generally provides inferior compression than Zip. Its one advantage is that it is supported fairly widely, including by several web servers. RFC 2616 (HTTP 1.1 standard) section 3.5: “Content Coding” specified compression formats including gzip. However, the “deflate” has compatablity issues: Wikipedia's article for Gzip: “Derivatives and other users” section states, “a server has no way to detect whether a client will correctly handle” a specific deflate implementation, as some code (including MS IE 6 through 8) supports RFC 1951 (“DEFLATE Compressed Data Format Specification version 1.3”) instead of correctly supporting RFC 1950 (“ZLIB Compressed Data Format Specification version 3.3”). (Such issues may leave the Zip format as being the best widely-compatible format, although even Zip has had multiple compression methods and some software will not support some of the more obscure variations.)
The gzip format supports only a single file. If there is a desire to store multiple files into a compressed archive, the common approach is to use another archive format, most commonly the tar format, to compress the file. Upon doing so, the file extension of .tgz (so the filename looks like
filename.tgz) is commonly used. On filesystems that support long filenames, another file extention sequence is .tar.gz (so the filename looks like
filename.tar.gz), which indicates the exact same file format. So, use those extensions interchangably.
Info-Zip has released
(for many platforms).
7-Zip also supports gzip. The command line support is quite similar to handling Zip archives: Simply use
Furthermore, the gzip format supports the ability to not storing the filename. With 7-Zip, a
*.gz file can be created without the file's metadata by using something like:
7zaa -tgzip -mx=9 -mfb=258 -mpass=15 -si
(On DOS-based systems, like Microsoft Windows traditional command line, there is typically no built-in
command, so the
command would be more equivilent. However,
command may be available instead. Also, the command may be
Wikipedia's article for Gzip: “Derivatives and other users” section states, “AdvanceCOMP and 7-Zip can produce
-compatible files, using an internal DEFLATE implementation with better compression ratios than
itself”. (Hyperlinks removed and styling added to quoted text.) AdvanceComp with
: documentation notes “The AdvanceCOMP distribution also includes the
tool. This tool performs the same recompression optimization on
Based on reading from Brian Lindholm's “New Options in the World of File Compression” article at Linux Gazette, it appears that this file extension probably indicates the file was made with “lzma_alone”, which is related to the 7-ZIP SDK (which entered “public domain” status with version 4.62). The files it creates may be smaller than “xz” files, but also contain fewer details that can help with integrity checking.
The article also indicates that the software may be a bit more challenging to use, including build difficulty and documentation clarity. For those who don't care about something like a 17KB difference of about 64-65 MB of uncompressed data, the “xz” format may be a bit easier to work with.
Wikipedia's article for “xz”, “History” section says, “Although the original 7-Zip program, which implements LZMA2 compression, is able to produce small files at the cost of speed, it also created its own unique archive format which was made primarily for Windows and did not support Unix” as well. Brian Lindholm's “New Options in the World of File Compression” article at Linux Gazette notes, “Unfortunately, with its Windows-based roots, the .7z file format made no provision for Unix-style permissions, user/group information, access control lists, or other such information. These limitations are show-stoppers for people doing backups on multi-user systems.”
So, basically xz is like 7-Zip's LZMA2 support, but with nicer support for some Unix filesystem features.
XZ Utils home page notes, “XZ Utils are the successor to LZMA Utils.” “The most interesting parts of XZ Utils (e.g. liblzma) are in the public domain. You can do whatever you want with the public domain parts.”
The BZip2 format gained some popularity for having superior compression (as measured by compressed files being smaller) than the widely available gzip and zip software that was available at the time. Some hinderences to more widespread adoption included incompatibility with existing formats, and slower speed on older hardware. The slower speed is likely neglible with today's computing resources, but was more signficant in 1996 when Bzip2 first came out.
At this time, Bzip2 is unlikely to gain popularity, as people who really want maximum compatbility are more likely to gravitate to Zip or gzip, and people who really want smaller files are more likely to gravitate to newer formats, like those that use LZMA2.
Actually, tar does not compress files. However, it does create file archives similar to some compression formats like ZIP files. This allows a person to have a single file that can be extracted to multiple files, providing many of the conveniences as other file archiving formats. The
command typically supports extensions from the The “IEEE P1003 “Portable Operating System” standard” that are related to Unix-style filesystem options, “including owner and group names, and support for named pipes, fifos, continuous files, and block and character devices.” (That quote came from the manual page found from an archive at Computerized Operational Department #5 Public Domain Software: Unix tar.) Support for these Unix-style filesystem options has made the tar format quite popular by Unix users.
program supports the “tape archive” (“
*.tar”) file format. Different versions of the program may also support additional file formats, perhaps most famously the tar+gzip format (“
*.tgz” or “
*.tar.gz” files). There may be various compression options, such as
-z(lowercase) to use the
command. Other command line options might include (uppercase)
-Zto use the
-jto use the
command, and (uppercase)
-Jto use the
command. However, the precise formats supported do vary between different tar implementations. (The tar command's
-joption has been known to do something else, with some tar implementation.) In some cases, the
command may run a helper program even if the relevant command line switch isn't provided. Because there have been different behaviors over the years, checking the system's “man” page is recommended.
In some cases, the first hyphen might be optional. Reading the manual pages is recommended.
Some versions of
may strip leading slashes when extracting files. This may be the default behavior. For instance, with OpenBSD's
-Pcommand line option preserves those slashes. According to pdtar documentation, the
-Aoption avoided using absolute names (by stripping the leading slashes). With Solaris 10, there is no way to strip the slashes while extracting with the built-in
command. (The available
command could be used to work around this.) Since details vary, reading the manual pages is recommended. Be careful when extracting tar files with unknown file structures. If the
program's documentation doesn't mention how leading slashes are handled, a person can simply use the
-tto show the “table of contents” of the tar file. If leading slashes exist, they should be visible when performing the check. If they don't exist, then command line parameters related to leading slashes may not be needed anyway.
Wikipedia's article for
(computing): “Key implementations” section mentions multiple sections.
To extract a file, use:
Note that the “
f” parameter must be the last parameter, because the text after the “
f” parameter is expected to be a filename.
- Other option(s)
The 7-Zip program supports tar.
Based on reviewing Luke's answer to dreftymac's StackOverflow.com question (and Superrole's 8:47 comment about
-aoa, provided to Luke's answer to a SuperUser.com question) and various answers to Philipp's “Unix and Linux” Stack Exchange question, it appears the following ought to work:
- tar+gzip files
7zax -aoa -si -ttar -o
- tar+bzip2 files
7zx -si -ttar
Sponsored by Dropbox, this seems to be designed to rival PAQ and other formats, by being willing to spend notable time to make a small file. However, when tested (on a Microsoft Windows Server 2016 installation 5.3GB ISO image), the software seemed to use way less memory, and didn't output a file in the end.
- Compression libraries
A person who wishes to do computer programming can use some he pre-created software to easily add data compression features to another program. Some details are available from further discussion at: Techn's: coding: compressing bits.
- Other topics related to lossless compressing
Compression challenges: Thank Matt Mahoney's site about data compression for SHARND which creates rather incompressible data. The site also has a program called Better Archiver with Recrusive Functionality (BARF) that says “Of course, a properly designed test” ... “is not susceptable to such tricks.” There have been other invalid, fake compressors that have been far less forthcoming in their trickery.
For example, a program that stores data inside the “deleted space” on a hard drive might initially appear to be successfully storing data without legitimately using up hard drive space. However, any operations that threaten data undeletion, such as simply writing any data (and, perhaps more likely, disk defragmentation) would legitimately erase that data needed to restore the data. This trick is also unlikely to be possible to implement using remote data storage.
The “pigeon-hole principle” “counting argument” described by Usenet comp.compression FAQ 9.2 keeps humanity from being able to achieve lossless data compression for all sets of data. (Later questions in the FAQ may document some false claims.)
In the future, some further details about the counting argument may be located and/or documented here. (These details won't necessarily be meant to provide further details who understand it, but to help document the “proof” paragraph a bit more clearly, for those who don't intuitively, instantly understand the (N-1) notation of that paragraph.)
Instead, only some data sets are compressible. Somewhere I have read that most possible data is actually incompressible, and only “interesting” data is typically compressible. Wikipedia's article on Data Compression: section about Lossless compression says “Lossless compression is possible because most real-world data has statistical redundancy.” For example, a boring spreadsheet only has a chance of being very interesting to anybody if the numbers match up in some meaningful way. If all of the bits were truely random, the results would have no structure, and would quickly be boring. For instance, video noise (a.k.a. television “snow”) may look interesting initially, but each image being shown is typically fairly equal, being approximately as interesting as the other images that are shown. One exact layout of the colorations is not likely to be subtantially more interesting than another different precise layout of how the data shows.
Fortunately, most data that people use regularly fits the criteria of being “interesting”, which mostly just means that the data is, somehow, being structured and meaningful. This means that such data is often fairly compressible. The most common exceptions are data that is already intentionally compressed (because known structure repetition is likely to have already been removed), random (or, perhaps, sufficiently psuedo-random) data, or data that is already suitably small data.
Naturally, one option is to just compress the data which is on the drive. (Cause the drive to contain a zip file.) However, another approach is to use software which will try to compress an entire drive, and then to still make the drive rather usable. For details, see: compressed hard disk files. After disspelling some common misperceptions about compressed hard disk files, see making CHD images if desired.
- Specific data
Wikipedia's page about Lossless Data Compression: section about methods discusses some different implemented algorithms.
Following are some notes that may help with some older platforms. Note that TOOGAM's Software Archive was meant for the MS-DOS and platforms that were typically compatible with that platform, prior to emulation. Therefore, unless virtualization/emulation is used, many of the implementations on that page may not work as well in other environments, such as operating systems on a computer using an x64 chip.
- [#discompr]: Discs
For ISO 9660 images, see TOOGAM's Software Archive: Archivers for optical disc images (“Lossless compression” section). Namely, at the time of this writing, there is one interesting option: Error Code Modeler (“ECM”).
- Error Code Modeler
Error Code Modeler(“ECM”) is data-specific, taking advantage of the ISO 9660 format. Some examples of between 15% and 19% are documented on the ECM project's home page. Similar results have been experienced elsewhere. The resulting file might not be a whole lot smaller than a compressed archive, but the results remain about as compressable on a percentage basis. Instead of just compressing unnecessary data (which creates a compressed copy of the data), this completely throws out the unnecessary data. The remaining data is still just as compressible as ever. Using this before a general compression algorithm (such as Zip) may result in the smallest file.
Then again, a 3GB+ DVD has been known to save a very small amount of space (about 150KB, if memory serves correctly). So, this does not always save a lot of space. The main ECM page says, “The space saved depends on the number of sectors with unnecessary EDC/ECC data in them, which will depend on the specific type of” disc.
- [#execompr]: Executable code
- Overview of executable compression
One multiplatform solution is called UPX (Ultimate Packer for eXecutables). Unix has a long history of having multiple executables be identical by using symlinks. More recently, there are solutions for creating Unix executables, such as
(seen used with BSD executables) or BusyBox (seen used with Linux executables). “Busybox replacement project” web page: section called “Related work” lists some other similar projects.
Often executable compression works flawlessly. However, some executables may work less flawlessly. Executables that use “overlays” or external code libraries (“shared objects” in Unix, “Dynamic Link Library” files in Microsoft Windows) may either leave quite a bit of code uncompressed, or be less likely to work well. The good news is that when things break, they will most commonly break rather immediately: the program won't run right, and this will be so substantial that a program which outputs to the screen won't even get to output its first letter/character before an error is shown. The reason this is good news is because it leaves little room for doubt. If things work, then they are very likely (statistically speaking) to work (perfectly) well.
For DOS (and perhaps Microsoft Windows), the TOOGAM's software Archive: Archivers: “Executable Compressors” section may list some additional options. Some (and perhaps all?) of those options may be older options that are simply inferior to UPX.
- Ultimate Packer for Executables
An overview of options is provided within the program. The following may show over a hundred lines of output:
The following is a method that may work rather nicely on multiple operating systems:
upx--ultra-brute --all-filters --all-methods -v -f
(According to UPX news, version 1.90 beta introduced
--all-methods. Github: UPX: filter.txt, you can “try most of them with "
--all-filters".” One would think “all” would mean all, but apparently not.)
There may certainly be additional variations. An an example, for Microsoft Windows executables, a
--compress-icons=3flag may be available and adding a second filespec of
may be quite impactful. So:
upx--ultra-brute -v -f --all-filters --all-methods --compress-icons=3
As another example of an option, MS-DOS executables may have a
-8086flag to enable more compatibility.
- Video (sight)
After a multi-year process,
*.PNG files are now widely supported, including support for transparencies, on popular web browsers and by other software. Converting a graphics file to PNG format (and then using optimizing tools to shrink
*.PNG files) is relatively painless, and in most cases should be done. The only widely used alternative approach is to introduce quality loss by using a format such as JPEG.
- Video (moving video)
- TOOGAM's Software Archive: Multimedia Files: Audio Codecs