Web as Corpus Resources

APIs

APIs (Application Programming Interfaces) allow developers to access online services programmatically, from applications that run on a web server or on a local machine. While they are useful, they are subject to change or elimination. This list includes notes on APIs I have come across and at least looked at.

The best site for finding APIs and receiving notification when new ones are released is http://www.programmableweb.com.

80legs
Webcrawler and text processor distributed over 50,000 PCs available during idle moments (@SETI model). Some built-in text processing capability (e.g. strip HTML, match regular expressions to return only matching pages or text) with support for Java and .NET custom code. Fee based, but very low cost: $2.00 per million pages crawled and $0.03 per CPU hour. Claims to crawl 2 billion pages in a day. Still in beta. (Nov 09)
AlchemyAPI
offers a number of useful services (copied from their webpage): named entity extraction, text categorization (very basic), language detection (claims about 90 languages recognized), keyword / term extraction, web page cleaning (= boilerplate removal; works fine for European languages, less consistent results with e.g. Chinese), structured data / content scraping. Straightforward API with examples in various programming languages. Your program sends a URL to REST endpoint of one of their services, it returns what you ask for. Alternatively you can post the data directly. While weak for non-European languages, full support for Russian is a pleasant surprise.
“Use the full range of AlchemyAPI services completely free of cost! This includes both commercial and non-commercial use! Make up to 30,000 API calls a day. Higher limits available to approved educational institutions and non-profit groups.”
University of Western Australia
announced 23 June 2010 by Wilson Wong on the Corpora List
“We have made available a list of web services for accessing text mining and NLP tools implemented at our research group (http://ontology.csse.uwa.edu.au) such as boilerplate removal (known as HERCULES), semantic similarity/relatedness measures (i.e. Normalised Web Distance, n-Degree of Wikipedia), noun phrase chunking, triple extraction, noisy text cleaning (known as ISSAC), simple term extraction, and access to our multi-domain, 300 million token text corpora (which are continuously growing). Please write to wilson@csse.uwa.edu.au to obtain a free developer key.”

Datasets

From Marco Baroni on Corpora-List 19 Dec 09
We are happy to announce that you can download two new resources from the site of WaCky (Web as Corpus kool ynitiative)
http://wacky.sslmit.unibo.it/

  1. pukWaC: the ukWaC corpus, a 2 billion Web-derived corpus of English, now enriched with a full dependency parse (POS-tagging and lemmatization done with the TreeTagger, parsing done with the MaltParser);
  2. WaCkypedia: a full 2009 English Wikipedia dump (about 800 million tokens), POS-tagged, lemmatized and dependency parsed with the same tools used for pukWaC.
  • Alessandro Lenci (University of Pisa)
  • Silvia Bernardini, Adriano Ferraresi, Eros Zanchetta (University of Bologna)
  • Marco Baroni (University of Trento)

Links to various online text repositories, datasets etc
http://www.diggingintodata.org/Home/Repositories/tabid/167/Default.aspx


Personal Tools