Web as Corpus

This site was shut down 18-23 July 2010 because the web corpus databases exceeded the hosting provider's limits. I am redistributing the tables to remain in compliance. Sorry for the inconvenience!
Any suggestions for a hosting provider whose limits are more generous?

Web Concordancer details below; feedback welcome
alternate site if response slow
Search the Web directly for concordances of words and phrases in 34 different languages. 
This new release (last update: 24 May 2010) adds support for selecting which documents to include in the zipfile, preselection based on document metrics, combining all textfiles into a single document for importing into kfNgram or a concordancer, and conversion from UTF-8 into more widely-supported encodings. If it does not work properly for your language, please let me know.
Web Corpus
English-language corpora compiled from the Web in 2006 and 2007.
2007  still under development, currently 3,123,996 types and 518,129,710 tokens; target size at least 1,000,000,000 tokens; will be part-of-speech tagged.
2006
  97,198,272 tokens and 950,087 types; 1-6-grams; wildcard searchable; the original texts and URLs are no longer available due to a hard drive failure.
Search these Web Corpora
Count Matching Webpages
Count how many hits Bing and Yahoo! report for a word or phrase, expressed both as an absolute number and as number of matches per million webpages.  Multiple search terms can be entered and queried at the same time, and numbers can be either formatted for easier reading or left unformatted for copying and pasting into a spreadsheet or database.  As you can see by comparing results from these two search engines, such counts must be interpreted with extreme caution!  Bing numbers per million pages are generally smaller than those from Yahoo!, probably due to an over-optimistic estimate of the total number of pages in Bing's database.
Latest Changes
Wiki detailing additions and tweaks to this site
Web as Corpus Wiki
Wiki with links to web as corpus events, sites and code
Find Search Terms
Search by wildcard in various databases for single-word English search terms (e.g. morphological variants) for pasting into the Advanced Query field

Related papers

Ready-made frequency lists from the Web

English

Web Corpus 2006 – 100 or more HTML
HTML version of list of 30,524 types occurring 100 or more times in this corpus
Web Corpus 2006 – 100 or more TAB
Tab-separated text version of list of 30,524 types occurring 100 or more times
Web Corpus 2006 – 10 or more TAB
Tab-separated text version of list of 104,675 types occurring 10 or more times

Dutch & Afrikaans

Major Search Engines do not distinguish between Dutch and Afrikaans:  they do not provide for searching only for pages in Afrikaans, and searches for pages in Dutch usually return some pages in Afrikaans as well.  National domains (.nl, .be / .za) are only a rough guide to location. International domains like .com, .net, .biz, .info etc. provide no clue to the source.  These lists were compiled to test various algorithms to distinguish Afrikaans from Dutch pages.

Dutch Web Corpus 2006 – 1-grams
HTML version of list of 102,770 types occurring in a pilot corpus of 1,605,346 tokens (6.4 MB)
Afrikaans Web Corpus 2006 – 1-grams
HTML version of list of 62,785 types occurring in a pilot corpus of 1,263,509 tokens (3.9 MB)

Site developed by Bill Fletcher, whose other free resourcesthe BNC-based online database "Phrases in English" (try proxy site PhrasesInEnglish.org if first link fails), web concordancer KWiCFinder and n-gram extractor kfNgram are already familiar to the corpus linguistics community.

Bill's legacy site miniappolis.com is hosted on the same server.  It will no longer be updated, but to prevent link rot it will remain on life support for the foreseeable future.

Please help support this site by acquiring the innovative multilingual
Visual Thesaurus.
 
WebAsCorpus.org only receives credit if you sign up via this link.

http://webascorpus.org launched 7 February 2007, updated 24 May 2010
Background: driftwood, Dares Beach, Maryland – original image