skipfish/doc/dictionaries.txt

This directory contains four alternative, hand-picked Skipfish dictionaries.

PLEASE READ THESE INSTRUCTIONS CAREFULLY BEFORE PICKING ONE. This is *critical*
to getting decent results in your scans.

------------------------
Key command-line options
------------------------

Skipfish automatically builds and maintain dictionaries based on the URLs
and HTML content encountered while crawling the site. These dictionaries
are extremely useful for subsequent scans of the same target, or for
future assessments of other platforms belonging to the same customer.

Exactly one read-write dictionary needs to be specified for every scan. To
create a blank one and feed it to skipfish, you can use this syntax:

$ touch new_dict.wl
$ ./skipfish -W new_dict.wl [...other options...]

In addition, it is almost always beneficial to seed the scanner with any
number of supplemental, read-only dictionaries with common keywords useful
for brute-force attacks against any website. Supplemental dictionaries can
be loaded using the following syntax:

$ ./skipfish -S supplemental_dict1.wl -S supplemental_dict2.wl \
   -W new_dict.wl [...other options...]

Note that the -W option should be used with care. The target file will be
modified at the end of the scan. The -S dictionaries, on the other hand, are
purely read-only.

If you don't want to create a dictionary and store discovered keywords, you
can use -W- (an alias for -W /dev/null).

You can and should share read-only dictionaries across unrelated scans, but
a separate read-write dictionary should be used for scans of unrelated targets.
Otherwise, undesirable interference may occur.

With this out of the way, let's quickly review the options that may be used
to fine-tune various aspects of dictionary handling:

  -L      - do not automatically learn new keywords based on site content.

            The scanner will still use keywords found in the specified
            dictionaries, if any; but will not go beyond that set.

            This option should not be normally used in most scanning
            modes; if supplied, the scanner will not be able to discover
            and leverage technology-specific terms and file extensions
            unique to the architecture of the targeted site.

  -G num  - change jar size for keyword candidates.

            Up to <num> candidates are randomly selected from site
            content, and periodically retried during brute-force checks;
            when one of them results in a unique non-404 response, it is
            promoted to the dictionary proper. Unsuccessful candidates are
            gradually replaced with new picks, and then discarded at the
            end of the scan. The default jar size is 256.

  -R num  - purge all entries in the read-write that had no non-404 hits for
            the last <num> scans.

            This option prevents dictionary creep in repeated assessments,
            but needs to be used with care: it will permanently nuke a
            part of the dictionary.

  -Y      - inhibit full ${filename}.${extension} brute-force.

            In this mode, the scanner will only brute-force one component
            at a time, trying all possible keywords without any extension,
            and then trying to append extensions to any otherwise discovered
            content.

            This greatly improves scan times, but reduces coverage. Scan modes
            2 and 3 shown in the next section make use of this flag.

--------------
Scanning modes
--------------

The basic dictionary-dependent modes you should be aware of (in order of the
associated request cost):

1) Orderly crawl with no DirBuster-like brute-force at all. In this mode, the
   scanner will not discover non-linked resources such as /admin,
   /index.php.old, etc:

   $ ./skipfish -W- -L [...other options...]

   This mode is very fast, but *NOT* recommended for general use because
   the lack of dictionary bruteforcing will limited the coverage. Use
   only where absolutely necessary.

2) Orderly scan with minimal extension brute-force. In this mode, the scanner
   will not discover resources such as /admin, but will discover cases such as
   /index.php.old (once index.php itself is spotted during an orderly crawl):

   $ touch new_dict.wl
   $ ./skipfish -S dictionaries/extensions-only.wl -W new_dict.wl -Y \
       [...other options...]

   This method is only slightly more request-intensive than #1, and therefore,
   is a marginally better alternative in cases where time is of essence. It's
   still not recommended for most uses. The cost is about 100 requests per
   fuzzed location.

3) Directory OR extension brute-force only. In this mode, the scanner will only
   try fuzzing the file name, or the extension, at any given time - but will
   not try every possible ${filename}.${extension} pair from the dictionary.

   $ touch new_dict.wl
   $ ./skipfish -S dictionaries/complete.wl -W new_dict.wl -Y \
     [...other options...]

   This method has a cost of about 2,000 requests per fuzzed location, and is
   recommended for rapid assessments, especially when working with slow
   servers or very large services.

4) Normal dictionary fuzzing. In this mode, every ${filename}.${extension}
   pair will be attempted. This mode is significantly slower, but offers
   superior coverage, and should be your starting point.

   $ touch new_dict.wl
   $ ./skipfish -S dictionaries/XXX.wl -W new_dict.wl [...other options...]

   Replace XXX with:

     minimal        - recommended starter dictionary, mostly focusing on
                      backup and source files, about 60,000 requests
                      per fuzzed location.

     medium         - more thorough dictionary, focusing on common
                      frameworks, about 140,000 requests.

     complete       - all-inclusive dictionary, over 210,000 requests.

   Normal fuzzing mode is recommended when doing thorough assessments of
   reasonably responsive servers; but it may be prohibitively expensive
   when dealing with very large or very slow sites.

----------------------------
More about dictionary design
----------------------------

Each dictionary may consist of a number of extensions, and a number of
"regular" keywords. Extensions are considered just a special subset of the
keyword list.

You can create custom dictionaries, conforming to this format:

type hits total_age last_age keyword

...where 'type' is either 'e' or 'w' (extension or keyword), followed by a
qualifier (explained below); 'hits' is the total number of times this keyword
resulted in a non-404 hit in all previous scans; 'total_age' is the number of scan
cycles this word is in the dictionary; 'last_age' is the number of scan cycles
since the last 'hit'; and 'keyword' is the actual keyword.

Qualifiers alter the meaning of an entry in the following way:

  wg - generic keyword that is not associated with any specific server-side
       technology. Examples include 'backup', 'accounting', or 'logs'. These
       will be indiscriminately combined with every known extension (e.g.,
       'backup.php') during the fuzzing process.

  ws - technology-specific keyword that are unlikely to have a random
       extension; for example, with 'cgi-bin', testing for 'cgi-bin.php' is
       usually a waste of time. Keywords tagged this way will be combined only
       with a small set of technology-agnostic extensions - e.g., 'cgi-bin.old'.

       NOTE: Technology-specific keywords that in the real world, are always
       paired with a single, specific extension, should be combined with said
       extension in the 'ws' entry itself, rather than trying to accommodate
       them with 'wg' rules. For example, 'MANIFEST.MF' is OK.

  eg - generic extension that is not specific to any well-defined technology,
       or may pop-up in administrator- or developer-created auxiliary content.
       Examples include 'bak', 'old', 'txt', or 'log'.

  es - technology-specific extension, such as 'php', or 'cgi', that are
       unlikely to spontaneously accompany random 'ws' keywords.

Skipfish leverages this distinction by only trying the following brute-force
combinations:

  /some/path/wg_keyword ('index')
  /some/path/ws_keyword ('cgi-bin')
  /some/path/wg_extension ('old')
  /some/path/ws_extension ('php')

  /some/path/wg_keyword.wg_extension ('index.old')
  /some/path/wg_keyword.ws_extension ('index.php')

  /some/path/ws_keyword.ws_extension ('cgi-bin.old')

To decide between 'wg' and 'ws', consider if you are likely to ever encounter
files such as ${this_word}.php or ${this_word}.class. If not, tag the keyword
as 'ws'.

Similarly, to decide between 'eg' and 'es', think about the possibility of
encountering cgi-bin.${this_ext} or zencart.${this_ext}. If it seems unlikely,
choose 'es'.

For your convenience, all legacy keywords and extensions, as well as any entries
detected automatically, will be stored in the dictionary with a '?' qualifier.
This is equivalent to 'g', and is meant to assist the user in reviewing and
triaging any automatically acquired dictionary data.

Other notes about dictionaries:

  - Do not duplicate extensions as keywords - if you already have 'html' as an
    'e' entry, there is no need to also create a 'w' one.

  - There must be no empty or malformed lines, or comments, in the wordlist
    file. Extension keywords must have no leading dot (e.g., 'exe', not '.exe'),
    and all keywords should be NOT url-encoded (e.g., 'Program Files', not
    'Program%20Files'). No keyword should exceed 64 characters.

  - Any valuable dictionary can be tagged with an optional '#ro' line at the
    beginning. This prevents it from being loaded in read-write mode.

  - Tread carefully; poor wordlists are one of the reasons why some web security
    scanners perform worse than expected. You will almost always be better off
    narrowing down or selectively extending the supplied set (and possibly
    contributing back your changes upstream!), than importing a giant wordlist
    scored elsewhere.