The World Wide Web: File/Data Transferring

HTTP(S): The World Wide Web

This site is basically about transfering data using the world wide web. This includes setting up web servers that serve content such as static files. For other details, see the sections about websites, the section about professionally providing technical services (e.g. including making a website), and file transferring methods.

Web clients
Web browsers

Some creators of operating systems are the same organizations that have released web browsers. Such operating systems are likely to come bundled with a web browser provided by that software make. Examples include Google Chrome being bundled with Chromium/ChromeOS, Apple's Safari being bundled with Mac OS X, and Microsoft's Internet Explorer being included with Microsoft Windows (perhaps starting with Windows 98). Another noteworthy code base is the code behind Iceweasel, which is based on Mozilla Firefox. Google has also gained popularity with its web browser named Chrome.

Some operating systems may come with other web browsers. For a while, OpenBSD came with Lynx (and this was mentioned by OpenBSD FAQ: What is included with OpenBSD (FAQ 1.8) will list Lynx). Running lynx would run that web browser.

Other web browser software is available from TOOGAM's software archive: Web browsers

Further information about browsing such files may be seen by: web browsing.

Software designed to get (even if it doesn't render) a file

One way to accomplish this is to use WebDAV, which is a standard that uses HTTP. Using this might (or might not) be a bit more complex, so it is described in another section: section about WebDAV clients.

[#curl]: cURL

This software may be downloaded from cURL's download page. In the right frame of the page, at the top of the “Packages” section, there is a list of operating systems that can be used to jump to a specific part of the page, like cURL downloads for 32-bit Microsoft Windows. Before downloading, check the available version numbers, and feature sets. (Some executables don't support using the SCP protocol over SSH, while other binary releases may not have SSL support enabled.)

This software may be similar to the older WGet, but have some advantages. Licensing may be more flexible, and the libcurl code may be more useful for embedding in other projects. The cURL software also supports a number of protocols. People unfamiliar with either product may gain more benefit by learning how to use cURL, when cURL provides the desired functionality.

Command line parameters may be commonly desired. For instance, a command line parameter is needed to have cURL output to a file, instead of just showing the file's contents to standard output. This is discussed by Daniel Stenberg (cURL author) blog post on cURL default behavior where cURL's author notes, “I released curl for the first time on March 20, 1998: the call was made. The default was set. I will not change a default and hurt millions of users.”. (Emphasis in original material that is being quoted.)

This guide has not yet completed documentation of cURL. After reading the manual page, some of the following options seemed to look fairly interesting:

curl -LOvRJC - URL

Although the space and hyphen/dash should come after the letter C, some parameters can be re-arranged. So instead of the “lover see” abbreviation, it can be arranged “clover” style, as follows:

curl -C - -L -O -v -R -J URL

Or, to specify an output filename:

curl -C - -L -v -R -o outputfile URL

-C - ” is for resuming (continuing). The second hyphen in that sequence specifies to just continue at the last byte, rather than a specified byte number. If the file already exists, it will be resumed if possible. If the file already exists (fully downloaded), and the server does not support resuming, then curl does exit with an error code 33.

The above command is identical to:

curl --continue-at - --location --verbose --remote-time --output outputfile URL

Or, instead of “--output outputfile”, you could use --remote-name” (a.k.a. “-O”), and possibly also --remote-header-name” (a.k.a. “-J”, an option specific to using HTTP(S)).

Here is a description of some options that may be useful for many simple transfers. Some of these are included in the above samples.

  • -o outputFile ” outputs to the specified local file. Without this or -O, the default is to output directly to the terminal/console's standard output stream. That may be redirected without problem, although redirecting all output could also redirect error messages.
  • -O (or --remote-name) specifies that output should be stored to a filename that is specified by the web server. Note that this relies on the web server providing a filename, which web servers don't always do. For instance, at the time of this writing, visiting http://toogam.com did not result in the web server providing a filename, although visiting http://toogam.com/index.htm did. (So, different results occurred even though the actual web page downloaded would be identical either way.) If the web server does not provide a filename, curl will output an error message, “curl: Remote file name has no length!” (Then curl outputs “curl: try 'curl --help' or 'curl --manual' for more information”, and blank line, and then curl exits, possibly using exit/return/error code of 23.)
  • --compressed
  • -k (or the longer name, --insecure) allows for SSL connections and transfers even if they are deemed to be “insecure”. (Maybe that will help when a self-signed cert is used?) Using this may not be a preferred process for correctly handling HTTPS security, but the full process of adding a certificate can be more work than what is justified for some tasks.
    • When a person creates a cURL executable file, that person can choose to support SSL or not. If the curl executable file does not support SSL, then using HTTPS will not be possible at all. When downloading the curl program, most users will benefit by seeking to get a copy that has SSL support enabled (unless there is a compelling reason not to, like trying to absolutely minimize disk space). If a cURL executable does support SSL, then running “curl -V” should mention SSL in the list of supported Features.
    • The preferred route is to use a *.crt file, so that HTTPS can be successfully verified. For example, cURL 7.54.1 for 32-bit Microsoft Windows (provided by Viktor Szakats) is a binary package that included a curl-ca-bundle.crt file. Nicely, the curl program can operate even if this file is missing, although HTTPS support may be impacted if the file is unavailable. Using --cacert filename.crt (or perhaps --capath somedir) may help if the filename isn't found, or -k may be useful if that file doesn't seem to be available.

      Many cURL distributions may not come with this file. The reason for not including it is that the people from the cURL project don't plan to try to be the ones who keep the file updated. The file can be extracted from Mozilla's project. See: cURL FAQ: “Why don't you update ca-bundle.crt.

    • Another tidbit of documentation from cURL's website: cURL FAQ: “What certificates do I need when I use SSL?”
  • -L allows for rather automated redirections. This is recommended for ease of use, but might cause unexpected results if a web page utilizes malicious code.
  • -J can specify to use a local filename that matches a filename that the remote end has. So, for example, if you ran a command that looks something like wget -JOL http://example.net/somedir/get-file.php?=installer.exe”, the effect of the J switch may be to have a local file named “installer.exe” rather than “file.php”.
  • Note: Sometimes using -L and -J just don't seem to do the trick. This seems to often be the case when going to a URL that doesn't have a filename specified (and simply asks the web server to provide a default file for a directory). This may be resolvable by specifying a local filename using “-o filename.htm” (or by specifying a different remote URL, e.g. “https://mozilla.org/index.html” instead of just trying to specify “https://mozilla.org”).
  • -R queries the remote server for timestamp information, so that this information is not lost.
  • -v causes a more verbose display. There are other options, too: “ --progressbar ” (abbreviatable as “ -# ”, although that may be a challenging abbreviation to use with some command line prompt shells), “ --trace outputFilename ” or “ --trace-ascii outputFilename ” or “ --trace-time ” (added to cURL 7.14.0) or “ -i ” (a.k.a. “ --include ”). Or, for even less input than normal, “ -s ” (a.k.a. “ --silent ”).
  • --help provides an overview of command line parameters. The --manual provides more details about the options, and has a lot of content that looks identical to the man page. It seems, however, that using --manual may provide a bit more information at the end, beyond just what is seen in the manual page. After the “SEE ALSO” section which is also found in cURL's man page, there is a “LATEST VERSION” section (pointing to http://curl.haxx.se) and some other sections.
[#curlsftp]: Using SFTP in cURL

First, the version of cURL will need the feature. Run “ curl -V ” which, in a modern release of cURL, will show supportd protocols. Look for the protocol named “sftp” (and, probably right before that in the list, “scp”). When downloading versions of cURL, this is often described as having “SSH” support (which is different than just having “SSL” support).

Then, how to update via SFTP / SCP is noted by Daniel Stenberg's post on uploading with SCP and cURL: SCP and SSH File Transfer Protocol (“SFTP”). TheLinuxMen.Blogspot.com: Uploading files to FTP/SFTP using CURL shows using -k (which disables certificate checking), which may be helpful if the server does not have a certificate in the local system's certificate store (or a file specifiable using --cacert).

To list files:

curl -u username sftp://example.org/

(The trailing slash specifies to curl that a directory is specified, so curl should just try to show a file listing, rather than downloading a file)

To look at a home directory:

curl -u username sftp://example.org/~/

To specify a file to upload:

curl -u username -T filename sftp://example.org/~/./

Note: If trying to write to a directory, you do want to end with a slash (or simply specify a fuller path, including the desired filename to output). Otherwise, the transfer will fail. (Presumably curl is trying to actually write to a file with the specified name, which will of course fail. This is different behavior than, say, PuTTY's pscp which may figure out that the destination is a directory, and then try to write to a file (using the same filename as the source) in the desired directory. The curl program doesn't seem to attempt that logic, and so the result is a failed transfer. (The solution to that failure is to simply include a slash at the end.)

Grabbing multiple files

The curl program, which is executable from the command line, does not support retrieving a bunch of file listings and then using that information to gather a bunch of files. (cURL FAQ: Recursive fetching recommends using external software such as a PERL script. cURL's lack of support for mirroring has been accepted, as recursively obtaining files was just not a primary focus of cURL's intended design. (cURL's lack of mirroring support is also covered earlier in that cURL FAQ document: cURL FAQ: “What is curl not?”.) The topic of support for mirroring, and specifically even the topic of WGet having more support for mirroring than cURL, is mentioned by Page comparing cURL and WGet, by Daniel Stenberg, cURL author.)

Obtaining files

Here are some ways to grab all of the files in a remote system's directory.

Using Unix to have curl grab remote files

The following code will work on many Unix systems to help obtain a bunch of files. Note, however, that this does not obtain subdirectories.

This does seem to be implying the use of /bin/sh or something compatible, so users of csh and similar software might want to put this in a script. This example uses external programs: curl (unsurprisingly), sed, and grep.

First, this code will just output the curl commands that will be run later, because this command is run at the start:

export ACTTOTAKE=echo

Then, be sure to customize the following URL. In this example, the variable does not end with a slash.

export REMOTESITE=http://example.com/dirname
for FILE in $(curl -s -C - -L -R ${REMOTESITE} |
 grep href |
 sed 's/.*href="//' |
 sed 's/".*//' |
 grep -v ^.\$ |
 grep -v ^..\$ |
grep -v /\$ ) ; do
   ${ACTTOTAKE} curl -C - -L -R -O -v "${REMOTESITE}/${FILE}"
done

Now, to actually obtain the files, first do this:

unset ACTTOTAKE=echo

and then re-run the everything except for the first line which initially set that variable.

Finally, clean up:

unset REMOTESITE=echo

Credit: This was created by modifying about half of the code from patrix's answer to “Thi G.”'s “Ask Different” question, which provided not only the basic concept but also the trickiest portion of the code (which was using sed).

Using Microsoft Windows to have curl grab remote files
A batch file is provided by iKiWiXz's answer to Yuck's StackOverflow.com question.
[#wget]: WGet

WGet may have more mirroring (recursively downloading) options than the newer cURL. WGet may also typically figure out the desired output filename. So, WGet has some nice points. The default options are a bit nicer than cURL's, resulting in less incentive to commonly include extra command line parameters. So, WGet can be handy, and was among some of the most simple and useful software before the release of the newer cURL.

However, WGet does have some drawbacks. Sometimes WGet would create a file that included parts of the URL after the remote filename. It may have less flexible licensing, and sometimes WGet creates a file that included parts of the URL after the remote filename.

To save the filetime that is stored on the web server, use the -N parameter. (In Unix, perhaps only when saving to a FAT32 drive, it was found to be helpful to use sudo to have permission to make the changes.)

An example command would be to use:

wget -N ftp://example.com/file.ext

There are multiple versions available for the Microsoft Windows platform, and not all of them are equally easy to get to work. For some details, see TOOGAM's Software Archive: information about WGet.

[#tnftpwww]: tnftp (default ftp client)

The tnftp software supports both FTP and also HTTP. Some operating systems come bundled with this software, and so this software may be what's used when running the ftp command. free(code) web page for tnftp states, “tnftp is the default FTP client found in FreeBSD, MacOS X, NetBSD, and SuSE Linux.” Based on the OpenBSD manual page for ftp, it looks like the following will work for HTTP

ftp -v http://server/filename

... and, based on the OpenBSD manual page for ftp, it looks like the following will work for HTTPS without doing certificate checking. That will provide HTTPS encryption, but skipping the certificate checking means that the program does not perform authentication of the website. (So, somebody listening to traffic won't see a copy of the traffic, although an MITM attack could effectively insert malicious data.)

ftp -v -S dont https:// server/filename

Note: at least one version of a manual page had documented the syntax to look like this bad example:

ftp -v http:// server/filename

Note: the documented space between the double slashes, after the protocol, and the host name seems weird; it is probably just a typo in the documentation. Some brief testing has indicated that the space does not seem to be optional, nor permitted.

BITS

The following command was apparently available by Windows 7, but apparently also has been deprecated (even if the BITS service remains).

bitsadmin /transfer customJobBane /DOWNLOAD /PRIORITY low http://server/dirname/filename C:\dir\filename

Specifying a local filename without a path will not work. (Specifying the full path to the local file will work.)

Specifying “ /PRIORITY FOREGROUND ” may make a huge difference in speed (making the transfer go fairly quickly, rather than kilobytes over many seconds).

BITS HTTP requirements

Utility Spotlight: Scripting Trouble-Free Downloads with BITS (Michael Murgolo, TechNet Magazine) mentions support since XP (for BITS) and where BitsAdmin.exe can be obtained for those older operating systems. Also, that article notes, about that program, “The Windows XP SP2 Support Tools version has a known issue that does not allow it to be used correctly with the Windows Script Host WshShell.Exec method to capture StdOut.”

There's also support by PowerShell, which is what Microsoft is recommending over BitsAdmin.exe these days.

powershell
import-module bitstransfer
Start-BitsTransfer http://server/dirname/filename C:\dir\filename
.Net Framework

MSDN: WebClient shows various methods, including JScript, C++, and others. See also the PowerShell section for an example of using this (using “DownloadFile”)

PowerShell
DownloadFile

For a version that uses .NET, see: Thomas Jespersen's answer to Robert Massa's SuperUser.com question on HTTP from a command line, Andrew Scagnelli's answer to Robert Massa's SuperUser.com question on HTTP from a command line. Some preliminary reserach indicates this seems like part of .NET (MSDN: WebClient.Download.)

powershell -command "(new-object System.Net.WebClient).DownloadFile( 'http://example.com/somedir/filename.txt' , 'output.txt' ) "

Note that there are other ways to use DownloadFile without using PowerShell.

BITS

This is discussed in the section about BITS.

PowerShell 3.0

This doesn't work with the PowerShell that comes with Windows 7: running $PSVersion from within PowerShell will show PSVersion is 2.0.

Based on Janus Troelsen's comment to an answer to Robert Massa's SuperUser.com question on HTTP from a command line and some further research (TechNet: Invoke-WebRequest)

powershell -command "& { Invoke-Web-Request http://server/dirname/filename -OutFile filename } "

StackOverflow : Checking URL may describe some useful techniques for PowerShell 3+.

Warrick

Warrick has a page hosted by Old Dominion University's Computer Science Department. Warrick receives a recommendation by a very notable website with Wayback Machine @ Archive.org FAQ #26: “How can I get a copy of the pages on my Web site? If my site got hacked or damaged, could I get a backup from the Archive?”

Other options

TOOGAM's Software Archive: page about Web Browsers.

Editors can also be effectively used to see what pages look like. Some software may be able to provide an experience that is quite similar to what is found with web browsers that don't edit pages. SeaMonkey, the offspring of Netscape Communicator, is such an example. It uses some of the same code as Iceweasel and Mozilla Firefox, which are offspring of Netscape Naviator. So, if people like those web browsers, and have any interest in dabbling with editing, then a solution like SeaMonkey may be worth checking out.

TOOGAM's Software Archive: Web page editors lists some options (currently without providing any real discussion about those options).

Web servers

In general, the two main protocols that web servers are designed to support have been HTTPS and HTTP. Generally HTTP is a bit easier to set up, and then adding support for HTTPS involves modifying the software that is successfully providing HTTP. So, start with supporting HTTP.

(The following is an older paragraph. Perhaps it should be just cleaned up, or merged elsewhere, or removed?) Info (to be) included: Redundancy, serving a web page, serving multiple sites (placing each site in its own area: for info about different types of sites or site features, see Network Features section about that: or, perhaps an even more specific URL may be available?), server-side content (CGI and support for languages), authentication (using passwords specific to the web site: to use network-based authentication with the server may involve info from a separate section)

Web Transfering protocols

This section has information about some protocols and/or how to implement these protocols

HTTP
Protocol reference detail(s)
HTTP versions

HTTP 0.9 document is titled “The Original HTTP as defined in 1991” (referring to the year 1991 A.D.) and describes “a subset of the full HTTP protocol, and is known as HTTP 0.9.”

HTTP/1.0 standardized things further. RFC 1945: HTTP/1.0, W3C info on HTTP 1. However, condemning the standard into obscurity, the protocol did not support name-based “virtual website” hosting. This basically means that a unique IPv4 address was needed for every different website. In the decades that followed, Internet Service Providers routinely placed multiple websites on a single web server that utilized just a single public IPv4 address. Therefore, HTTP/1.1 ended up being required for many unencrypted websites. (Website encrypted with HTTPS did still commonly utilize a unique IPv4 address per website, as discussed in the section about name-based “virtual website” hosting.)

Another nice feature is that HTTP/1.1 supported, which was not supported by HTTP/1.0, was having multiple files being transferred over a single TCP connection, reducing required overhead for many websites.

[#codhtsta]: HTTP Status response codes

IANA's registry of HTTP status codes, RFC 2616 (HTTP/1.1) section 10: Status Code Definitions

Additional reference(s): MS KB 943891: HTTP status code in IIS 7, 7.5, and 8 document some more codes. Codes that are 400 or larger may be followed by a decimal point and a number (resulting in a lengthier status code like 404.1 or 404.10, which are different from each other).

As a comparison, HTTP is not the only protocol that supports response code numbers. For example, RFC 959 (“FTP”) page 37 and 38 document something similar for FTP, and SMTP server responses start with a number. However, HTTP servers are a bit more unique for being famous for showing the number to an end user (probably most famously numbers 404 and then 302 (though 303 or 307 should be used instead of 302, per Wikipedia's List of HTTP status codes), though hopefully (most desirably, whether or not this is realistic) 304 or 200 should be the most common).

[#vrthstnm]: Name-based “virtual website” hosting

(Most commonly, the quotation marks and the word “website” are not included in that phrase.)

See: Wikipedia's article section on name-based “virtual website” hosting.

Allows multiple websites to use the same IP address. When an HTTP/1.1 client requested an unencrypted file transmission, the web browser would specify which website contained the requested data. Therefore, the web server could provide content based on which website was being requested. Since HTTP/1.1 was supported by even early versions of popular web browsers (by Netscape Navigator 2.0 and Microsoft Internet Explorer 2.0), the vast majority of end users have always had full support for the necessary HTTP/1.1 by the time that servers started to support named-based virtual website hosting.

Browser support for name-based virtual website hosting
Browser support for name-based virtual website hosting using HTTP/1.1

The following has been blatanty copied (with full permission) from the text shown by http://TOOGAM.com when it is accessed using HTTP/1.0

The domain-name hosting works well with most modern browsers:

"Microsoft Internet Explorer 3.0, Netscape Navigator 2.0, and later versions of both browsers support the use of host header names; earlier versions of the two browsers do not."

--Microsoft (Initially found from a now-defunct Microsoft URL at /windows2000/en/server/IIs/htm/core/iinmres.htm under http://www.microsoft.com. The old URL had a ® sign right after the word Microsoft. The page since redirected to /technet/prodtechnol/windows2000serv/default.mspx but now that page is no longer found by their web server. The current quote may be found from another URL where the page called another Microsoft web page about Supporting Host Header Names might still be working.)

Wikipedia Article on Netscape Navigator lists release dates of Mosaic Netscape 0.9 on October 13, 1994, and Netscape Navigator 2.0 on in April 1996 (although versions of Wikipedia's Netscape Navigator page prior to the middle of April 2007 had listed the release date as September 18, 1995).

"Microsoft originally released Internet Explorer 1.0 in August 1995 with the Internet Jumpstart Kit in Microsoft Plus! for Windows 95." "Version 2.0 was also released for the Macintosh and Windows 3.1 in April 1996." "Internet Explorer 3.0 was released free of charge in August 1996 by bundling it with Windows 95 OSR2."
--Wikipedia Article on IE (quoted text pre-dating the second half of August 2007)

As can be seen from the linked to information above, this issue affects the paid-for versions of Microsoft Internet Explorer released during the first year that Microsoft Internet Explorer was released, or versions of Netscape's Navigator released in the first 18 months (or 11 months, depending on which quoted release date is more accurate). It also shows that the major popular browsers released over a decade ago tend to support the required HOST header of HTTP. (Newer web browser software, like Mozilla Firefox, also tend to support the HTTP's HOST header by the first publicly released "release version" with a version number of at least 1.0.)

Name based virtual web hosting support when using HTTPS

Each website requires its own IP address until “Server Name Indication” support started being utilized. Alternative approaches to support HTTPS, such as using an alternative TCP port number or relying on the “Subject Alternative Name” (“SAN”)/“Unified Communications certificate” (“UCC”) feature used by some certificates, had drawbacks and so were not exceedingly popular. Namely, such approaches were not considered to be as easy for many end users who use the web browser client software.

Server Name Indication (“SNI”) extensions to SSL/TLS

Wikipedia's page for “Server Name Indication” states, “As of November 2012, the only major user bases whose browsers do not support SNI appear to be users of Internet Explorer 8 or below on Windows XP and versions of Java before 1.7 on any operating system.”

This is supported by Microsoft Internet Explorer 7 and newer (on Windows Vista and newer), although MSDN Blog about SNI support notes, “IE relies on SChannel for the implementation of all of its HTTPS protocols. SChannel is an operating system component, and it was only updated with support for TLS extension on Windows Vista and later.” (That hyperlink was in the original text being quoted.) It seems likely that the story for support by Windows Server 2003 is the same as the story for support by Windows XP.

Modern web servers do support SNI. IIS Beta release notes indicate SNI support. IIS 8 came with Windows Server 2012 and Windows 8. (Windows Server 2008 R2 and Windows 7 came with IIS 7.5.)

RFC 3546 section 3.1 introduced the feature into the series of RFC documents. That RFC was marked as obsoleted by newer RFCs. (RFC 6066 section 3: Server Name Indication is newer.)

Having a web server run different websites on different port numbers

This does technically work very well, except that people referring to a website will need to specify the TCP port number as part of the URL. That is simply considered to be too complicated for people who want the absolute simplest experience for end users (who use the web browser client software).

Certificate features: “Subject Alternative Name” (“SAN”) field in a Unified Communications certificate (“UCC”)

This mostly worked, although is a pricier option for whoever is running the web server.

Implementation: Getting the web server working
Overview
Content details

Know what type of content is going to be served. The easiest option to set up is “static files”. However, many elaborate websites have additional requirements. For instance: can users submit data (e.g., search engines.)? Can users modify the pages (e.g. forum posts, wiki updates)? How is the information stored (e.g. simple static files sitting on the operating system, dynamic content using executable files sitting on the operating system, read-and-write access to a database?)

Choosing and installing server software

Choose what software is going to be installed. A key consideration may be to make sure that the software does support the desired type of content. (If the web browser does not directly support the content, that may not be a real problem if there is an add-on for the web browser that will then allow the web browser to support the desired type of content.)

Potentially another consideration may be whether the operating system being used comes with any software. If so, using that software may be more convenient. If not, then the bundled software will likely continue to exist even though it is being ignored.

If the desired software isn't installed, then install the software.

Data location

Know where the data is going to be at. This was touched on briefly when mentioning how the data is stored. However, it will be good to be a bit more specific. If the data is stored on filesystems, then at what location? If the data is going to be stored in databases, then which databases?

Having users store data in their home directory (e.g. under /home/) may require that the web browser has access to that section of the filesystem. Security may be tighter if the web browser does not have access to that section of the filesystem. This would require that both users and the web browser software have access to a common area of the filesystem. Users may be oblivious to the location of the data if there is a symlink in their directory (e.g. called html/ or htdocs/) that points to the desired location. (This isn't necessarily saying that users couldn't figure out that the symlink to a folder is really a symlink. This is simply saying that users would not be required to notice that it isn't a symlink, and even people who know better could act as if they were oblivious to the knowledge about the symlink actually being a symlink and not a folder.)

Popularity

Google Online Security Blog entry on web server and malware showed Apache has 66%, Microsoft IIS 23%, and nginx at 4%, while other web servers combined to make up the remaining 7%. That study was from 2007. Wikipedia's article for Microsoft Internet Information Server: “Usage” section notes that “IIS dropped from the second most popular position at the end of 2011” as nginx climbed to become the second most popular web server. (This may have been based on Netcraft November 2012 report which showed nginx's 11.78% beating Microsoft IIS's 11.53% of active sites, a lead for nginx that increased over the following month. However, some other figures still had Microsoft IIS ahead of nginx by a factor of less than 5% (in November). So, who was leading at that exact moment depended on what was being measured.)

Implementations-Specific Info for Webserver Software
[#thttpd]: thttpd: “tiny/turbo/throttling” HTTP server

The software's name is thttpd. The product's page shows a title of “tiny/turbo/throttling HTTP server”. Wikipedia's article on thttpd indicates all those are valid, as the article says the first letter “in thttpd stands for variously tiny, turbo, or throttling.”

This is a web server that is very simple to use. For at least non-dynamic content, this is pretty easy to support multiple (virtual) host names, and IPv4 addresses. Supporting IPv6 also works if host names are used. However, supporting web clients that specify an IPv6 address does not mix real well with virtual hostname support. (This limitation is discussed further, later.) At least some support for dynamic content is provided (although, at the time of this writing, this guide does not go into extensive details about that).

Wikipedia's article on thttpd indicates the project has become dormant, with newer development being done under a fork named shttpd.

Although the server may be simple, one requirement to take care of is to make sure that the filesystem ownerships meet the requirements. Make sure that the content does not have the Unix-style execute bit set in the filesystem. (An ownership of -rw-rw-r-- may be good if the user trusts other group members, or an ownership of -rw-r--r-- may be good if the group (e.g. if the group is “users”) may contain some individuals that should not be trusted to update the file.) Details about making adjustments are in the section about filesystem attributes.

If the content is in /var/www/htdocs/, then run:

thttpd -r -M 3600 -d /var/www/htdocs/

Okay, convinced that this works? Great. Now, complicate things a bit. Stop the instance of thttpd (using details from the sections about seeing what is running and adjusting what is running as needed).

For example, the site being used is site.example and uses a public IPv4 address of 203.0.113.21. (Note that site.example may be a valid name with public DNS or a private DNS domain.)

The first thing to do is to establish where the data for this web server will be at. For this example, we will have the data be at /var/www/htdocs/ (which is a common location). Next, we want to have a subdirectory named after the site address that the visitor will put in the web browser. For instance, /var/www/htdocs/site.example/ will store the data when someone goes to site.example in the web browser.

At this point, the data layout is not yet configured, but the webserver may be started.

thttpd -r -M 3600 -dd /var/www/htdocs/ -v

Err, for now, maybe ignore the -dd and the path, or instead use this:

thttpd -d /var/www/htdocs/ -r -M 3600 /var/www/htdocs/ -v

For more elaborate usage:

thttpd -r -d /var/www/htdocs/ -u username -M 3600
Content on the site
Simple IPv4 usage

If the needed content hasn't yet been placed, now it is time to do so:

echo Simple text which is not HTML but which browsers accept | tee -a /var/www/htdocs/site.example/index.html

Calomel.org's guide to thttpd notes that permissions of 644 may be needed.

Supporting multiple sites
Supporting multiple IPv4 sites

Is a single IPv4 site working as intended? Great, now expand that a bit:

cd /var/www/htdocs/site.example/
ln -s site.example/ www.site.example 10.0.5.21
Supporting multiple IPv6 sites

Now, here is the nasty part. Let's say this server uses 2001:db8::21 as an IPv6 address. Here is how to support that:

cd /var/www/htdocs/site.example/
ln -s [fd00/

Whoa! That can't be right, can it? That example/documentation looks cut off, right? Isn't it?

The sad truth is, that is what works. An entire IPv6 /16 is dedicated to just this one site. Basically, it appears that this software doesn't really provide decent support for virtual hosts being visited by IPv6. At least, not in the version that was used to create this documentation (thttpd/2.25b 29dec2003 which is the package distributed with OpenBSD 5.0). The web server does in fact work fine, as intended, with IPv6 when a DNS name is used (and probably works fine if DNS-based “virtual host” support is not being used with the program's “ -v” option), but this software (which is over 5 years old) simply won't be able to support a lot of IPv6 addresses.

However, how many users are going to be using IPv6 addresses? Perhaps, at the time of this writing, not many (yet). For the few who do, an (untested) option may be to use symbolic links (unless thttpd dislikes the .. reference, which may be the case: the manual did mention something about detecting and disallowing references to parent directories) so that end users can follow a hyperlink and get to the requested content.

Possible additional reference: Steve Kemp's Blog: IPv6 and thttpd

HTTPS
thttpd FAQ: Howto discusses HTTPS. In short, HTTPS is not directly supported by the main release of thttpd, but there are options to be able to accomplish the task.
Nginx
Name note

Meant to be pronounced as “Engine X”. (The software's name is often pronounced quite different, such as “Engeenks”, by those unfamiliar with the preferred pronunciation.) Calomel.org's Relayd guide seems to recommend “Nginx, pronounced "Engine X"”, and compares this software product to Lighttpd and thttpd and Apache.

Other info

See: Information about Nginx.

lighttpd

The name is a portmanteau of “light” and “httpd”. Calomel's guide to using Lighttpd says “the author” “oddly” calls this program “lighty” (which would naturally sound like “ly-tee” when pronounced by native English speakers).

Calomel's guide to using Lighttpd says, “Lighttpd powers several popular Web 2.0 sites like YouTube, wikipedia and meebo.”

Wikipedia's article on Lighttpd: “Limitations” section (Wikipedia's article on Lighttpd: January 2016, “Limitations” section) disccuses a limit that looks rather like a bug. bug info

[#apache]: Apache

This guide was written using the enhanced Apache 1.3 included with OpenBSD. (It might be true that some of these details differ for Apache 2.)

IPv6 compatibility statement

One of the first things that was noticed is that going to the website via a DNS name worked, and going to the website via an IPv4 address worked, but going to the website via an IPv6 address did not work. The reason might be described by Apache 2 Documentation about Binding: section about Special IPv6 Consideratins. (OpenBSD was being used. It seems from reading that text, that non-BSD operating systems might have different defaults, and so may often not require this step.) A solution was to add the following lines to the config file:

Listen *:80
Listen [fd00:0:0:1::21]:80

Note: An assumption in the above example is that the “Listen *:80” line is uncommented. That may be true for a default installation, but may not be true in some other cases. If the “Listen *:80” line is commented out, it may remain commented out. Adding a non-commented out “Listen [fd00:0:0:1::21]:80” may still be useful.

Then re-load the configuration file with:

sudo apachectl graceful
sudo apachectl graceful

Yep, it is true: the recommendation was to run this twice. Sometimes the software would seem to not run when asked to restart like this. Then running it a second time would result in the complaint that it wasn't running, and fix the problem of it not running.

The web browser being used seemed to like to cache results from when things didn't work. (In Firefox, Shift-Ctrl-R instead of Ctrl-R caused a more thorough reload, so then things worked.)

For more details about using Apache, see the Apache setup guide.

[#msiis]: Internet Information Services (“IIS”)

Although the term “services” may suggest a more extensive offering, this software by Microsoft is basically a web server. (The “services” comes from the idea that web servers often provide multiple types of services, such as providing documents over HTTP, secured documents using HTTPS, possibly an FTP, and any other sorts of features (like streaming media) that might be delivered over the basic web transfer protocols.)

Logs

This is being written by memory and may need to be verified on a live system...

Main log files

The most frequently referred to logs are IIS's main log files.

Log file directory

Very likely, if the web content is stored in %SystemRoot%\Inetpub (e.g. C:\Inetpub ), then look under %SystemRoot%\Inetpub\Logfiles\.

However, in case that is not where the logs are, here is the process to actually figure out where the logs are:

To find out where the logs really are, go to the IIS Manager (locatable in Administration Tools, which will be in the Control Panel and may also just be on the Start Menu). In IIS7, there may be multiple icons, one called IIS Manager and one called IIS6 Manager. The one called IIS Manager is probably the one wanted. Select a website. Then, to see the location of a folder where logs to under:

  • In IIS7, in center of the window (not in the left frame), locate the icon called “Logging”.
  • In earlier versions of IIS7, logging may be under the site's properties. Access the context/shortcut/“right-click” menu of a site, and choose properties. (Yet another step may be needed here...
W3SVC* Subfolders

Under the main Logfiles\ directory/folder may be a subdirectory for each seperate IIS site. Each may start with the characters W3SVC and then be followed by a bunch of numeric decimal digits (it's unknown how many numbers there may be. The exact amount of digits does vary on a system. Perhaps 10-16 or so?). At least in some configurations such as a Small Business Server and/or server using Microsoft SharePoint, there may be several IIS sites even in a default installation where nobody has tried to add many additional sites. Most of the W3SVC* subfolders may have old modification times, being more than a few days old.

To determine which directory corresponds to a specific site:

In IIS7, check the properties for the site and look for the ID. That shows the number which comes after the W3SVC characters.

HTTP errors (when using IIS)

Errors may also be recorded by HTTP.SYS. To check this out, find the HTTPERR subdirectory. It is located in a specific location somewhere under the directory where Windows was installed to (generally C:\Windows\ in Windows Server 2003 and newer, with C:\WinNT\ being common with some older Microsoft Windows NT-based operating systems).

[#httpdios]: Cisco IOS: using a built-in web server

This section discusses the web server that may be built in as part of Cisco IOS devices. This section assumes some familiarity with using Cisco devices, including interacting with the device's CLI (“command line interface”) to work directly with Cisco IOS commands. For tutorials that cover such topics, see: Cisco Introduction, Cisco Equipment, Cisco IOS basic usage guide

Note: This guide is currently designed to show how to set up both an HTTPS server and also an HTTPS server. For setting up the HTTPS server, this guide may currently be assuming a rather specific approach, which is that the Cisco IOS device generates its own keys.

One thing that is required is to get some SSL keys. (This may need some further testing, but this can be done the same as generating SSH keys, following the instructions at creating SSH keys in IOS. Or, a router might simply do this rather automatically when enabling HTTPS?) For now, those are the steps that this guide expects to be done.

Additionally, user accounts should be created. See: Cisco IOS basic usage guide: section about local “user database” (which covers, in more detail, the act of making a user for an IOS device). This guide does not repeat the details in those sections, but that does need to be taken care of, so follow those guides to get that done.

Once that is taken care of, the following will enable both the HTTPS server and the HTTP server, including enabling authentication to use the local database:

ip http server
ip http secure-server
Authentication

The content that comes with a device will likely want to authenticate. To authenticate using a username and a password:

ip http authentication local

Rather than “local”, other authentication methods may include: “enable”, or AAA (presumably by specifying a method list?). Using AAA is an approach that can offer more flexibility, but may require a bit more typing to initially set up. Details about using AAA, to see if that may be desired, might be available.... might be seen in the information related to Cisco's “CCNA Security” certification.

(That's it for now.... No further details about what why all of those steps are needed, or what happens if a certain step is not performed, like trying to use HTTPS but not HTTP. Such details may be added later; for that matter, once some review is done to verify just how much of these details are specific to HTTPS, some of these details may be moved to a section about HTTPS. Meanwhile, steps that ought to work are shown here. There you are.)

The line that says “server” is not needed for HTTPS.

[#https]: HTTPS

Supporting HTTPS basically involves getting the web browser to use a certificate.

Supporting multiple HTTPS sites

There are two ways to do this. One is to use the information described by RFC 6066: “Transport Layer Security (TLS) Extensions”, Section 3: “Server Name Indication” (“SNI”) (or a predecessor document/standard: RFC 4366: “Transport Layer Security (TLS) Extensions”, Section 3.1: “Server Name Indication” (“SNI”), which seems to have been a widely-cited version as SNI started gaining support, although there was also an even earlier RFC 3546: “Transport Layer Security (TLS) Extensions”, Section 3.1: “Server Name Indication” (“SNI”)). The advantage to this method is that it will not require an extra IP address for each certificate. The disadvantage is just that this is a newer option, and so some older software may not support it. A widely-cited example is a browser, including Microsoft Internet Explorer, that uses operating system SSL code, and using the operating system SSL code that comes with Windows XP. According to this widely-cited example, an older web browser (Microsoft Internet Explorer 7) that is bundled with Windows Vista will work, but MS IE 8 with Windows XP won't work. However, Daniel Lange's article on SNI references Gentoo developer Tobias Scheerbaum writing some related text in German, and Daniel says that Tobias “states that SP3 for Windows XP enables IE6 to send the SNI (SP2 is not sufficient).” So, Windows ME and older (including Win2K) probably do not support this. The OpenBSD team's bundled version of Apache 1.3 does not support this. Many other platforms, though, are likely to support this.

The other way, which is even more widely supported, is to use a SAN/UCC certificate.

Browser support may be tested using https://sni.velox.ch

Much of the following may be redundant with the newer section about using a certificate.

Certificate requirements

The certificate might be able to support multiple host names. Traditionally this has been supported using “wildcard” functionality. A newer style called UCC may also support multiple host names. In many cases, each certificate is designed to be used by just one organization. Consequently, web host providers (commonly Internet Service Providers (“ISPs”) but also other independent specialists) generally have at least one certificate for each customer. (Since all supported host names have some sort of a reference (perhaps a wildcard) in a certificate, it would generally be considered extremely amateur for a UCC cert to show host names from multiple unrelated organizations.)

One certificate is used for any specific socket's TCP port that the web browser is listening to. Each socket consists of a network address (such as an IP address) and a TCP port, so additional certificates may be used by using a different network (e.g. IP) address and/or by selecting a different TCP number. It is very possible to use a certificate on one or mulitple socket TCP ports, but only one single certificate will be associated with a specific TCP port responding to HTTPS traffic for a specific network address (such as an IP address).

Note that the IP address being listened to may be a private/internal IP address. However, trying to proxy HTTPS traffic may be challenging as the proxy itself may be treated similar to an unauthorized rogue “man in the middle” (“MITM”)-style attack. The only way for a device to listen to HTTPS traffic is if that device is also actively using the certificate.

DSA keys are no longer recommended. (OpenBSD 5.2 Manual Page for SSL: BUGS section has stated, “The world needs more DSA capable SSL and SSH services.” (This advice is now considered old. OpenSSH.com: Legacy Options says, “OpenSSH 7.0 and greater” ... “disable the ssh-dss (DSA) public key algorithm. It too is weak and we recommend against its use.” This is also noted in a section about Certificate Communications.) (For more information about DSA, see: Certificate Communications, section titled “About DSA and DSS”.) Although that advice is now old, even in the same OpenBSD 5.2 manual page that recommended DSA, OpenBSD Manual Page for SSL: “Generating RSA server certificates for web servers” section says, “To support https transactions in httpd(8) you will need to generate an RSA certificate.” The reason is noted in the OpenBSD Manual Page for SSL: section about history, which notes that typically web browser code “libraries do not implement any non-RSA cipher and keying combination.” Reasons why are described in that section. Additional documentation about this is seen by OpenSSL.org FAQ #8: “Why can't I make an SSL conneciton to a server using a DSA certificate?” which states, “The client may not support connections to DSA servers”. Specifically, “most web browsers (including Netscape and MSIE) only support connections to servers supporting RSA cipher suites.” Perhaps greater support is forthcoming: Mozilla Developer Documentation about HTML 5: the “keygen” element refers to support for DSA, EC (elliptic curve), and RSA (which is a default). However, although that documentation mentions this keygen element being supported by Firefox 1.0 (Gecko “1.7 or earlier”), Chrome 1.0, Opera 3.0, and Safari 1.2, it looks like Microsoft Internet Explorer may not support this.

More recently, there's been some concerns noted with DSA. One such concern is that DSA's effectiveness is highly reliant on unpredictable entropy. Sudden Death Entropy Failures mentions this. The comment about needing more DSA has been removed from the SSL man pages (in OpenBSD 5.3). Bruce Schneier on NSA breaking encryption states, “The math is good, but” specific code implementations may be a different story, and “code has been subverted.” (Commentary follows up on this.)

Getting/generating the certificate

Software to perform this may come with software that serves web pages. It appears there is no need to use the software that comes with the web server, if other trusted software is preferred.

Note: In (some versions of IIS, perhaps versions 6 and earlier?), there may be a trick to renewing publicly-recognized certificates. Submitting a certificate for signing by an outside organization may take a while (hours or days). After using IIS(6?) to get the certificate request file, the website may be left in an unusable state. It is not desirable to have the website fail to properly function during that time. (Perhaps the site content works but it just displays a certificate error?) Here is the trick: Use another website. Create a new, temporary website that uses the same certificate. This can be done, and then the temporary website may be used to generate the certificate request file. This process will cause the temporary website to stop functioning with that certificate. Meanwhile, the website which is actually important can continue to function with the certificate. (This paragraph's content will move into one of the following sections once it is determined which stage of the process breaks the website. The content will likely also provide some IIS-specific instructions at/about the same time.)

Keys made with an old implementation of SSL used by Debian

Creating a certificate request file

The act of creating a signed certificate generally is done by using a “certificate request” file, so the first step is to create a “certificate request” file.

Using openssl genrsa
See: OpenBSD FAQ 10: section on supporting Secure HTTP (OpenBSD FAQ 10.7).
Signing the certificate

Signing the certificate is an action done by whatever organization has the key that will be used to sign the certificate.

Deploying/using a certificate

For information, see details about setting up a web browser.

Additional reading which may possibly be of interest: OpenBSD Man page for ssl, OpenBSD Manual page for the web server (Apache httpd), OpenBSD FAQ about startup process.

SSL Intercept

Some discussions on this:

[#spdy]: SPDY

Google has suggested SPDY as a protocol for people to switch to, from using HTTP 1.1.

http://chromium.org/spdy/spdy-whitepaper http://dev.chromium.org/spdy Apache module: Apache module Wikipedia's article on SPDY

http://www.webpronews.com/google-spdy-gaining-adoption-2012-01

http://hothardware.com/News/Googles-SPDY-Incorporated-Into-NextGen-HTML-Company-Offers-TCP-Enhancements/

(Another possibility: QUIC.)

Additional actions
Controlling access to content
Password-protecting content

Currently, there is some information on the Apache guide.

[#robostxt]: Using a /robots.txt file

Robots.txt : robotstxt.org website calls this “the robots.txt Robots Exclusion Standard”. Standard for Robot Exclusion, hosted on the same web site, calls this “A Standard for Robot Exclusion. Wayback Machine @ Archive.org FAQ #14 (“Some sites are not available because of robots.txt or other exclusions. What does that mean?”) refers to this as “The Standard for Robot Exclusion (SRE)”, while Wikipedia's article for “Robot Exclusion Standard” refers to this by the former name, as well as more names: “the Robots Exclusion Protocol or robots.txt protocol”.

Alexa.com documentation for webmastesr notes, “The SRE was developed by Martijn Koster at Webcrawler to allow content providers to control how robots behave on their sites.”

Supporting dynamic content

First, some basic definitions. A “dynamic” web page may refer to any web page that visibly changes after it is loaded. However, that refers to client-side dynamic technologies, such as JavaScript. Another usage of the term “dynamic” is to describe a “dynamic” web page as any URL where the server produces custom content (instead of identical content every time the page is loaded). That requires some support by the web server.

This section is mainly about a web server's ability to support the related technologies. For information about using such functionality, the section on Web Data may be more relevant.

Server-side code (e.g. CGI and/or langauges such as PERL, PHP, Ruby, ASP(.NET), etc.)
Common Gateway Interface (“CGI”)

In a nutshell, CGI is a standard that involves having the web server run a program, and then redirecting all of that program's “standard output” to the web server. The web server checks that the output looks valid, and sends most (or all?) of the output directly to the web browser. So, the output of a CGI program might very often be HTML (although there are other options that may even be common, such as a file containing graphical data or, probably less commonly, a file meant to be downloaded and saved (such as an archive file)). This essentially means that there aren't substantial theoretical limits regarding what bytes are generated from CGI, just like there aren't really substantial limits that dictate what kind of bytes are possible to generate and output from a program.

RFC 3875: “The Common Gateway Interface (CGI) Version 1.1” (page 24) notes that a CGI program “MUST return a Content-Type header field.” There may be other lines that are part of a “message-header”. After the message-header (which includes the “Content-Type” header), a blank line is absolutely required (as noted by RFC 3875: “The Common Gateway Interface (CGI) Version 1.1” section 6.2: “Response Types”). Some brief testing has indicated that a web server may reject a CGI program's output (and claim there is a server configuration error) if the output doesn't start properly. A simple header that works for a text file is the phrase “Content-type: ”, followed by a valid Content type (such as “text/plain”). Then after that line, the next line can be the blank line.

Because fo the required header, CGI programs tend to be custom-made. (They may be some sort of scripting language that outputs the needed header and then runs another program.)

A program that will be useful for using with CGI will likely perform a task, output any needed data, and then exit, non-interactively. Also, many web browsers will sandbox a web server, so that web requests don't involve using code or data that is located outside of an expected area. Due to such restrictions, often executable code used with CGI will generally need to have all supporting code located within the confined area that the web browser expects to find code in. Often, ways to do this are to use simple scripts in a language supported by the web server (or, perhaps more likely, an add-on to the web server), using a “statically compiled” executable, or copying in all dynamically linked code. (In Unix, a list of dynamically linked code modules may often be shown by using “ ldd executableName ”.)

Some information may be provided to the CGI program using environment variables. RFC 4875 (CGI 1.1) Section 4.1: “Request Meta-Variables” (and the following sub-sections) document some example varaibles, such as QUERY_STRING. Apache Module mod_cgi documentation, which has (at least historically) been the most popular implementation for CGI, seems to indicate that only some of these variables are implemented, and even those may be unavailable/unset in some cases. So, plan to preview/test/verify before assuming that any specific environment variables gets passed as hoped for.

See also: W3C CGI, Wikipedia's page on “Common Gateway Interface”

Common implementations may involve having CGI use certain “script” languages, but any file that can be executed (including a binary exectable file) could be used with CGI (and that probably is done fairly commonly).

[#cgispeed]: CGI Speed-ups
FastCGI

The project's web page describes the project as a “simple”, “open standard” with free application libraries (able to be used by programmers for C(++), Java, PERL, and Tcl) and available “modules for popular” web servers.

Speedy CGI

This may be PERL specific. HowToForge.com guide for SpeedyCGI/PersistentPerl (on an old version of Debian called Etch) notes that SpeedyCGI is “also known as PersistentPerl”. SpeedyCGI Page on SourceForge references the web page address for SpeedyCGI Manual. CPAN page about SpeedyCGI

(Recommended by Open Webmail)

Simple Common Gateway Interface

Wikipedia's article for “Simple Common Gateway Interface” notes that, compared to alternatives, this “is designed to be easier to implement.”

Common languages/standards used for generating dynamic web content

These languages may have some common attributes, such as being fairly powerful while not requiring pre-compiled code. To check for more details are available here, about creating content in any of the programming languages listed in this section, see coding.

PERL

See CGISpeed.

PHP
...
Python
Ruby (on Rails)
...
ASP(X)(.NET)
...
Web Data