Table of Contents
Web as Corpus Links
collaborative repository for data, software and links to Web as Corpus sites set up by Stefan Evert and co-administered by various members of the WaC community
Annual Web as Corpus Workshops
- WaC 6: 6th Web as Corpus Workshop, in association with NAACL-HLT in Los Angeles, 5-6 June 2010
Workshop site
- WaC 5: 5th Web as Corpus Workshop, as part of SEPLN 09, Donostia / San Sebastián, Spain, 7 September 2009
Workshop site | Proceedings
- WaC 4: 4th Web as Corpus Workshop – Can we beat Google?, as part of LREC 2008, Marrakech, Morocco, 1 June 2008
Conference site | Proceedings
- WaC 3 / CLEANEVAL, Louvain-la-Neuve, B, 15-16 September 2007
Conference site | Proceedings | CLEANEVAL Summary Report
- WaC 1, Corpus Linguistics conference, Birmingham, UK, July 2005
Conference site
Web as Corpus sites
Groups
- ACL SIGWAC Special Interest Group of the Association for Computational Linguistics (ACL) on Web as Corpus, organizer of the Web as Corpus workshop series
Other Wikis
- Web Genre Wiki new Nov 07
- WaCky Project wiki inactive
Web as Corpus concordancers
- KWiCFinder desktop Web concordancer
- Linguist's Search Engine search with parser (temporarily? offline)
- WebAsCorpus.org Web Concordancer (34 languages)
Web Corpora Online (direct query)
- Leeds collection of Internet corpora
(English, Chinese, Finnish, French, German, Italian, Japanese, Polish, Portuguese, Russian, Spanish) - WebAsCorpus.org (English, limited Dutch & Afrikaans)
ESL Sites based on Google's Web 1T Corpus
- Linggle wildcard search for collocates and examples based on Google 1T 2-grams
*FLAX Web Phrases Described in
Wu, S., Witten, I. H. & Franken, M. (2010). Utilizing lexical data from a web-derived corpus to expand productive collocation knowledge. ReCALL, 22(1), 83–102.
Links to other modules
Other WaC Projects
- Corpus building for minority languages Kevin Scannell's Web crawling software and site exploiting the diversity of the Web for over 400 under-resourced languages.
WaC-related Tools / Software
- Jaguar extracts specialized corpora from the web and analyzes various lexical statistics; runs and saves corpora on developer's server
- GrosMoteur Web concordancer; supports either querying Yahoo! or crawling the Web; cross-platform (Python)
Publications
- Baroni, Marco and Bernardini, Silvia (eds.) 2006. WaCky! Working papers on the Web as Corpus. Bologna: GEDIT. incorporates papers from WaC 1 2005
- Gatto, Maristella 2009. From Body to Web. An Introduction to the Web as Corpus. Roma - Bari: Laterza University Press Online.
pre-publication light version (6.5 MB, no registration required) | definitive version (51 MB, requires registration)
Search Engine Links
- AltSearchEngines.com reviews specialized and non-English SEs
- multilingual-search.com discusses issues and developments in non-English search
- abondance.com tracks the European search market from a French perspective