Compressing Bits

Hybrid: (lossy+correction)

TOOGAM's Software Archive: Multimedia Files: Audio Codecs has a section about compression techniques that gain potentially high compression ratios by making some data loss optional. A fairly small but usable file is created. Also, a “correction” file can be used to add in whatever data was lost during the process of obtaining the nicer compression ratio. The usable (small, lossy) file, combined with the correction file, is sufficient to generate the original file, so no loss is forced by the process.

This seems to be a much more ideal approach than forcing a person to either have loss or to also have an additional copy of the entire original set of data.

Note that TOOGAM's Software Archive was meant for the MS-DOS and platforms that were typically compatible with that platform, prior to emulation. Therefore, unless virtualization/emulation is used, many of the implementations on that TOOGAM's Software Archive: Multimedia Files: Audio Codecs may not work as well in other environments, such as operating systems on a computer using an x64 chip.

Lossless data
General Data
Newer option(s)

(This section was made for solutions that are relatively new, or at least newly discovered.)

brotli
effective

The good news is that the compressed data is small.

unrestricted

It is also “open source”.

support

The format has been touted as being excellent for web content. With support by major web browsers (both Chrome and Firefox) and an IETF document resembling an RFC document, the format seems to have obtained some stability and popularity beyond the PPM format.

There is also (at least some) support by nginx, Apache, and IIS 7 (64-bit).

The bad news about brotli is that it is slow compressing. It also doesn't seem to support keeping track of a bunch of metadata, such as filenames. That is little problem, as such metadata can be captured using another tool, like tar (a traditional way to store metadata in Unix), and then that archive can be compressed.

availability

Google has released the source code. Trying to find an officially created executable for Microsoft Windows was an unfruitful search, although some programs by third parties are available. An OpenBSD port of Brotli has been started, but is outdated (version 0.3.0 when Brotli version 0.6.0 has been seen), and isn't part of the OpenBSD package set (nothing is at OpenPorts.se) as of this writing (June 17, 2016).

Still, this review is being written in June 2016, less than a year since Google's announcement. So, hopefully some of these limits will dissipate over time.

It seems there is no default extension, but people have been favoring a six letter extension (“.brotli”)

To see help:
brotli -h

To compress:

brotli -q 11 -w 24 -v -f -i uncompressed_file -o compr.brotli

Depending on the version, some of that may need to be changed. e.g., using --input and --output, or --in and --out. Check the command line help for details. Likewise, expanded parameters may need to be --quality, --verbose, and --force. Window size tends to be -w but may not be universally supported. “--repeat iters” has also been seen.

To uncompress:

brotli -d -i compr.brotli -o uncompressed_file
Widely compatible software
Intro

Zip files are wonderful, mostly due to their very widespread support. Perhaps one of the most prominent examples is Info-Zip 5.12's UnZipSfx.Exe file which was only about 12KB, and required very little RAM. Info-Zip's UnZip page had, for a long time, touted the software as likely being “The Third Most Portable Program in the World!” (This was before some other competitors joined, such as the Linux kernel, DOOM, and clones of games like Bastet (including earlier incarnations such as TETЯIS, the variation of the title used by Atari... or shall we say Ataяi?). Having freely available source code, early on, surely helped cement Zip files to become a widely supported standard.

7-Zip

7-Zip can often create some of the smallest possible Zip-compatible files (and gzip-compatible files), and also supports using the LZMA2 compression algorithms by using a format that is widely called the 7-Zip format, which uses the *.7z extension.

For zip files, creating very small files may be done with something like:

7za a -tzip -mx=9 -mfb=256 -mpass=15 filename.zip filespec
Required variations

On Microsoft Windows systems, 7-Zip does not typically place itself in the path by default, and so a longer reference to 7-Zip's executable may be needed.

"C:\Program Files\7-Zip\7z.exe" a -tzip -mx=9 -mfb=256 -mpass=15 filename.zip filespec

On a 64-bit version of Microsoft Windows, the following may be needed if the 32-bit version of 7-Zip is installed, and if the 64-bit version of 7-Zip is not installed:

"C:\Program Files (x86)\7-Zip\7z.exe" a -tzip -mx=9 -mfb=256 -mpass=15 filename.zip filespec

Actually, the maximum value for the -mfb may be a tiny bit higher: 257 or 258 (depending on what file format is used). The created file would be slightly smaller if the following is supported by the 7-Zip command line, and is then used:

7za a -tzip -mx=9 -mfb=258 -mpass=15 filename.zip filespec

For those who are new to command line user interfaces, one of the trickiest parts may be to simply find the exact name of the executable to use. The following example shows how this might be doable in Unix:

7za a -tzip -mx=9 -mfb=256 -mpass=15 filename.zip

This may Zip up all files in the current directory and subdirectories. If that is not what is wanted, just add, to the end of the command line, any filespec (“file specification”) that is needed. The following example shows such filespecs being used, as well as how to reference 32-bit 7-Zip if it is installed on a machine using a 64-bit version of Microsoft Windows:

"C:\Program Files (x86)\7-Zip\7z.exe" a -tzip -mx=9 -mfb=256 -mpass=15 filename.zip example.txt filetwo.png moredata.* yetmore.txt

This is all based on TOOGAM's online help for the 7-Zip command line.

Licensing: The 7-Zip's home page for the Lempel-Ziv-Markov chain algorithm (“LZMA”) Software Development Kit (“SDK”) states that one of the features of version 4.62 is that “LZMA SDK is placed in the public domain.” 7-Zip itself uses LGPL 2.1(+), except that the 7z.dll file has an exception for the unRAR support, as noted by 7-Zip license. The differences are only intended to affect people who wish to create software or alter how software behaves. As clearly noted on 7-Zip's license, “You can use 7-Zip on any computer, including a computer in a commercial organization. You don't need to register or pay for 7-Zip.”

Further reductions
Removing directories

To further shrink zip files: After zipping up data, the zip file might be shrinkable by removing unneeded data about subdirectories. The savings will be small (e.g. 116 bytes per subdirectory). (With 7-Zip, this unneeded data is in the zip file if compressing subdirectories and their contents, but specifying a filename from a subdirectory will not cause the unneeded directory entry.) Listing the contents of a zip file can show whether a directory is part of the zip file. Using text mode (command line) software will probably show this more clearly than software that renders the zip file contents with a graphical interface.

If subdirectories were used, then the zip file could be made a bit smaller by deleting the subdirectories. The files will still be stored within subdirectories, and so software will typically create the necessary subdirectories for files to be extracted to, but the zip file won't contain the meta-data (like file times) for the subdirectories. If that metadata is not desired, that could make the zip files a tiny (generally insignificantly) bit smaller. This process can be done using Info-Zip's Zip, by using: “ zip -d filename.zip dirName

Use Zopfli

Zopfli's source code is available.

Using Zopfli in advzip

For users of Microsoft Windows, or DOS, Zopfli is built into AdvZip, which is part of AdvanceCOMP (or, at least, new enough versions). Grab AdvanceCOMP (AdvanceCOMP Releases provides the more direct hyperlinks, which can also be found by scrolling down sufficiently on the AdvanceCOMP Download section).

Extract AdvanceComp and run:

advzip -rkzpi 10000 file.zip

There is nothing special about the number 10,000. It simply helps to have larger numbers, so that may provide better results than 1,000.

Other tools

There may be other ways to shrink zip files even further. DeflOpt has been known to accomplish reductions of files made by 7-Zip. Zopfli issue: DeflOpt beats it also mentioned Defluff.

Other solutions for lossless compressing

Other options for compressing files exist. Some of these options may even make (typically slightly) smaller archives. However, they may have some disadvantage(s), such as not being quite as freely available/distributable/modifiable or being far less compatible. For instance, often a PAQ program has been released with some tiny improvement, while is often not even compatible with previous versions of the same program. (A ZPAQ specification may improve in that respect.) Also, in order to decompress the data, some of these options may require very large amounts of RAM compared to a standard Zip file.

MaximumCompression.com summary of Multiple Files (sorted by Compression Ratio) shows some of the results created with compression programs listed at MaximumCompression.com.

TOOGAM's Software Archive: Archivers lists some software to help handle a variety of formats.

Details about the PAQ programs (including the ZPAQ standard) is on Matt Mahoney's site about data compression.

gzip

The gzip standard seems to generally provides inferior compression than Zip. Its one advantage is that it is supported fairly widely, including by several web servers. RFC 2616 (HTTP 1.1 standard) section 3.5: “Content Coding” specified compression formats including gzip. However, the “deflate” has compatablity issues: Wikipedia's article for Gzip: “Derivatives and other users” section states, “a server has no way to detect whether a client will correctly handle” a specific deflate implementation, as some code (including MS IE 6 through 8) supports RFC 1951 (“DEFLATE Compressed Data Format Specification version 1.3”) instead of correctly supporting RFC 1950 (“ZLIB Compressed Data Format Specification version 3.3”). (Such issues may leave the Zip format as being the best widely-compatible format, although even Zip has had multiple compression methods and some software will not support some of the more obscure variations.)

The gzip format supports only a single file. If there is a desire to store multiple files into a compressed archive, the common approach is to use another archive format, most commonly the tar format, to compress the file. Upon doing so, the file extension of .tgz (so the filename looks like filename.tgz) is commonly used. On filesystems that support long filenames, another file extention sequence is .tar.gz (so the filename looks like filename.tar.gz), which indicates the exact same file format. So, use those extensions interchangably.

Info-Zip has released gzip (for many platforms).

7-Zip also supports gzip. The command line support is quite similar to handling Zip archives: Simply use -tgzip instead of -tzip.

Furthermore, the gzip format supports the ability to not storing the filename. With 7-Zip, a *.gz file can be created without the file's metadata by using something like:

cat filename| 7za a -tgzip -mx=9 -mfb=258 -mpass=15 -si filename.gz

(On DOS-based systems, like Microsoft Windows traditional command line, there is typically no built-in cat command, so the type command would be more equivilent. However, type command may be available instead. Also, the command may be 7z.exe instead of 7za.)

Wikipedia's article for Gzip: “Derivatives and other users” section states, “AdvanceCOMP and 7-Zip can produce gzip-compatible files, using an internal DEFLATE implementation with better compression ratios than gzip itself”. (Hyperlinks removed and styling added to quoted text.) AdvanceComp with advpngidat: documentation notes “The AdvanceCOMP distribution also includes the advdef tool. This tool performs the same recompression optimization on .gz files”.

lzma

Based on reading from Brian Lindholm's “New Options in the World of File Compression” article at Linux Gazette, it appears that this file extension probably indicates the file was made with “lzma_alone”, which is related to the 7-ZIP SDK (which entered “public domain” status with version 4.62). The files it creates may be smaller than “xz” files, but also contain fewer details that can help with integrity checking.

The article also indicates that the software may be a bit more challenging to use, including build difficulty and documentation clarity. For those who don't care about something like a 17KB difference of about 64-65 MB of uncompressed data, the “xz” format may be a bit easier to work with.

xz

Wikipedia's article for “xz”, “History” section says, “Although the original 7-Zip program, which implements LZMA2 compression, is able to produce small files at the cost of speed, it also created its own unique archive format which was made primarily for Windows and did not support Unix” as well. Brian Lindholm's “New Options in the World of File Compression” article at Linux Gazette notes, “Unfortunately, with its Windows-based roots, the .7z file format made no provision for Unix-style permissions, user/group information, access control lists, or other such information. These limitations are show-stoppers for people doing backups on multi-user systems.”

So, basically xz is like 7-Zip's LZMA2 support, but with nicer support for some Unix filesystem features.

XZ Utils home page notes, “XZ Utils are the successor to LZMA Utils.” “The most interesting parts of XZ Utils (e.g. liblzma) are in the public domain. You can do whatever you want with the public domain parts.”

bzip2

The BZip2 format gained some popularity for having superior compression (as measured by compressed files being smaller) than the widely available gzip and zip software that was available at the time. Some hinderences to more widespread adoption included incompatibility with existing formats, and slower speed on older hardware. The slower speed is likely neglible with today's computing resources, but was more signficant in 1996 when Bzip2 first came out.

At this time, Bzip2 is unlikely to gain popularity, as people who really want maximum compatbility are more likely to gravitate to Zip or gzip, and people who really want smaller files are more likely to gravitate to newer formats, like those that use LZMA2.

tar

Actually, tar does not compress files. However, it does create file archives similar to some compression formats like ZIP files. This allows a person to have a single file that can be extracted to multiple files, providing many of the conveniences as other file archiving formats. The tar command typically supports extensions from the The “IEEE P1003 “Portable Operating System” standard” that are related to Unix-style filesystem options, “including owner and group names, and support for named pipes, fifos, continuous files, and block and character devices.” (That quote came from the manual page found from an archive at Computerized Operational Department #5 Public Domain Software: Unix tar.) Support for these Unix-style filesystem options has made the tar format quite popular by Unix users.

The tar program

The tar program supports the “tape archive” (“*.tar”) file format. Different versions of the program may also support additional file formats, perhaps most famously the tar+gzip format (“*.tgz” or “*.tar.gz” files). There may be various compression options, such as -z (lowercase) to use the gzip command. Other command line options might include (uppercase) -Z to use the compress and uncompress commands, (lowercase) -j to use the bzip2 command, and (uppercase) -J to use the xz command. However, the precise formats supported do vary between different tar implementations. (The tar command's -j option has been known to do something else, with some tar implementation.) In some cases, the tar command may run a helper program even if the relevant command line switch isn't provided. Because there have been different behaviors over the years, checking the system's “man” page is recommended.

In some cases, the first hyphen might be optional. Reading the manual pages is recommended.

Some versions of tar may strip leading slashes when extracting files. This may be the default behavior. For instance, with OpenBSD's tar, the -P command line option preserves those slashes. According to pdtar documentation, the -A option avoided using absolute names (by stripping the leading slashes). With Solaris 10, there is no way to strip the slashes while extracting with the built-in tar command. (The available chroot command could be used to work around this.) Since details vary, reading the manual pages is recommended. Be careful when extracting tar files with unknown file structures. If the tar program's documentation doesn't mention how leading slashes are handled, a person can simply use the -t to show the “table of contents” of the tar file. If leading slashes exist, they should be visible when performing the check. If they don't exist, then command line parameters related to leading slashes may not be needed anyway.

Wikipedia's article for tar (computing): “Key implementations” section mentions multiple sections.

To extract a file, use: tar -xvvf filename.tar

Note that the “f” parameter must be the last parameter, because the text after the “f” parameter is expected to be a filename.

Other option(s)

The 7-Zip program supports tar.

Based on reviewing Luke's answer to dreftymac's StackOverflow.com question (and Superrole's 8:47 comment about -aoa, provided to Luke's answer to a SuperUser.com question) and various answers to Philipp's “Unix and Linux” Stack Exchange question, it appears the following ought to work:

tar+gzip files
7z e -so "filename.tgz" | 7za x -aoa -si -ttar -o"outDir"
tar+bzip2 files
7za e -so "filename.tb2" | 7z x -si -ttar
Compression libraries

A person who wishes to do computer programming can use some he pre-created software to easily add data compression features to another program. Some details are available from further discussion at: Techn's: coding: compressing bits.

Other topics related to lossless compressing

Compression challenges: Thank Matt Mahoney's site about data compression for SHARND which creates rather incompressible data. The site also has a program called Better Archiver with Recrusive Functionality (BARF) that says “Of course, a properly designed test” ... “is not susceptable to such tricks.” There have been other invalid, fake compressors that have been far less forthcoming in their trickery.

For example, a program that stores data inside the “deleted space” on a hard drive might initially appear to be successfully storing data without legitimately using up hard drive space. However, any operations that threaten data undeletion, such as simply writing any data (and, perhaps more likely, disk defragmentation) would legitimately erase that data needed to restore the data. This trick is also unlikely to be possible to implement using remote data storage.

The “pigeon-hole principle” “counting argument” described by Usenet comp.compression FAQ 9.2 keeps humanity from being able to achieve lossless data compression for all sets of data. (Later questions in the FAQ may document some false claims.)

In the future, some further details about the counting argument may be located and/or documented here. (These details won't necessarily be meant to provide further details who understand it, but to help document the “proof” paragraph a bit more clearly, for those who don't intuitively, instantly understand the (N-1) notation of that paragraph.)

Instead, only some data sets are compressible. Somewhere I have read that most possible data is actually incompressible, and only “interesting” data is typically compressible. Wikipedia's article on Data Compression: section about Lossless compression says “Lossless compression is possible because most real-world data has statistical redundancy.” For example, a boring spreadsheet only has a chance of being very interesting to anybody if the numbers match up in some meaningful way. If all of the bits were truely random, the results would have no structure, and would quickly be boring. For instance, video noise (a.k.a. television “snow”) may look interesting initially, but each image being shown is typically fairly equal, being approximately as interesting as the other images that are shown. One exact layout of the colorations is not likely to be subtantially more interesting than another different precise layout of how the data shows.

Fortunately, most data that people use regularly fits the criteria of being “interesting”, which mostly just means that the data is, somehow, being structured and meaningful. This means that such data is often fairly compressible. The most common exceptions are data that is already intentionally compressed (because known structure repetition is likely to have already been removed), random (or, perhaps, sufficiently psuedo-random) data, or data that is already suitably small data.

Drives

Naturally, one option is to just compress the data which is on the drive. (Cause the drive to contain a zip file.) However, another approach is to use software which will try to compress an entire drive, and then to still make the drive rather usable. For details, see: compressed hard disk files. After disspelling some common misperceptions about compressed hard disk files, see making CHD images if desired.

Specific data

Wikipedia's page about Lossless Data Compression: section about methods discusses some different implemented algorithms.

Following are some notes that may help with some older platforms. Note that TOOGAM's Software Archive was meant for the MS-DOS and platforms that were typically compatible with that platform, prior to emulation. Therefore, unless virtualization/emulation is used, many of the implementations on that page may not work as well in other environments, such as operating systems on a computer using an x64 chip.

[#discompr]: Discs

For ISO 9660 images, see TOOGAM's Software Archive: Archivers for optical disc images (“Lossless compression” section). Namely, at the time of this writing, there is one interesting option: Error Code Modeler (“ECM”).

Error Code Modeler

Error Code Modeler(“ECM”) is data-specific, taking advantage of the ISO 9660 format. Some examples of between 15% and 19% are documented on the ECM project's home page. Similar results have been experienced elsewhere. The resulting file might not be a whole lot smaller than a compressed archive, but the results remain about as compressable on a percentage basis. Instead of just compressing unnecessary data (which creates a compressed copy of the data), this completely throws out the unnecessary data. The remaining data is still just as compressible as ever. Using this before a general compression algorithm (such as Zip) may result in the smallest file.

Then again, a 3GB+ DVD has been known to save a very small amount of space (about 150KB, if memory serves correctly). So, this does not always save a lot of space. The main ECM page says, “The space saved depends on the number of sectors with unnecessary EDC/ECC data in them, which will depend on the specific type of” disc.

[#execompr]: Executable code
Overview of executable compression

One multiplatform solution is called UPX (Ultimate Packer for eXecutables). Unix has a long history of having multiple executables be identical by using symlinks. More recently, there are solutions for creating Unix executables, such as crunchgen (seen used with BSD executables) or BusyBox (seen used with Linux executables). “Busybox replacement project” web page: section called “Related work” lists some other similar projects.

Often executable compression works flawlessly. However, some executables may work less flawlessly. Executables that use “overlays” or external code libraries (“shared objects” in Unix, “Dynamic Link Library” files in Microsoft Windows) may either leave quite a bit of code uncompressed, or be less likely to work well. The good news is that when things break, they will most commonly break rather immediately: the program won't run right, and this will be so substantial that a program which outputs to the screen won't even get to output its first letter/character before an error is shown. The reason this is good news is because it leaves little room for doubt. If things work, then they are very likely (statistically speaking) to work (perfectly) well.

For DOS (and perhaps Microsoft Windows), the TOOGAM's software Archive: Archivers: “Executable Compressors” section may list some additional options. Some (and perhaps all?) of those options may be older options that are simply inferior to UPX.

Ultimate Packer for Executables

An overview of options is provided within the program. The following may show over a hundred lines of output:

upx --help

The following is a method that may work rather nicely on multiple operating systems:

upx --best --ultrabrute -v -f *.exe

There may certainly be additional variations. An an example, for Microsoft Windows executables, a --compress-icons=3 flag may be available and adding a second filespec of *.dll may be quite impactful. So:

upx --best --ultrabrute -v -f --compress-icons=3 *.exe *.dll
As another example of an option, MS-DOS executables may have a -8086 flag to enable more compatibility.

Video (sight)
Graphics

After a multi-year process, *.PNG files are now widely supported, including support for transparencies, on popular web browsers and by other software. Converting a graphics file to PNG format (and then using optimizing tools to shrink *.PNG files) is relatively painless, and in most cases should be done. The only widely used alternative approach is to introduce quality loss by using a format such as JPEG.

Video (moving video)

See: TOOGAM's software archive: Multimedia files: Lossless video

Audio
TOOGAM's Software Archive: Multimedia Files: Audio Codecs