Wikipedia & Project Gutenberg Ngram databases

By Jay McKinnon, published on February 16, 2012 (share)

I thought I’d follow-up ngram.sh: a script for extracting Google Ngram data with a data sources for Wikipedia and Project Gutenberg ngrams. The ngram.sh script can easily be modified to extract your keywords from these databases too.

Number of publications offered by Project Gutenberg, 1994-2008.

Number of publications offered by Project Gutenberg, 1994-2008.

Wikipedia Ngram data

Title: Tagged and Cleaned Wikipedia (TC Wikipedia) and its Ngram
Author: Javier Artiles & Satoshi Sekine at NYU’s Proteus Project.
Source: Wikipedia [18:12, June 8, 2008 version]
Ngrams: 1gram (31MB); 2gram (447MB); 3gram (1.9GB); 4gram (4.3GB); 5gram (7.1GB); 6gram (10GB); 7gram (13GB)
Also: List of headwords, Infobox data, and sentences tagged by NLP (Natural Language Processing) tools.
Code: None provided.

Number of articles in the English Wikipedia, 2001-2012.

Number of articles in the English Wikipedia, 2001-2012.

Project Gutenberg Ngram data

Title: N-gram data from Project Gutenberg
Author: Prashanth Ellina
Source: Project Gutenberg [n.d. probably 2008]
Ngrams: 2gram %amp; 3gram (624mb);
Also: three tarballs (5.3gb each) of the “complete” text database.
Note: This post is from May 2008, and Project Gutenberg is always growing, so why not get fresh data from the source? The Project Gutenberg Mirroring How-To.
Code: Yes, step-by-step instructions.

1 Comment (meta data)

ngram.sh: a script for extracting Google Ngram data

By Jay McKinnon, published on February 13, 2012 (share)

What is ngram.sh?

The Google Ngram Viewer is a database browser used to chart the relative frequency of words or phrases. The data source is the Google Books database (sort of) and the graphic engine is Google Charts. It’s cool. It’s pretty. It’s hard to use for academic work because it doesn’t easily give up the raw data.

This page explains how to run a script — ngram.sh — on a *NIX shell account and extract your keywords into spreadsheet-ready files, without buying a terabyte hard drive.

Ngrams: telegraph, telephone, television & Internet (1850-2000)

Ngram frequency graph of four communication technologies — telegraph, telephone, television and Internet — from 1850 to 2000.


(read more…)

4 Comments (meta data)

How to make a Facebook page host content from your website

By Jay McKinnon, published on August 7, 2011 (share)

This is a short “How To” document for making a Facebook page and inserting an iFrame from your website into that page. For more information see the Apps on Facebook.com documentation.

Make a page

Go to the very bottom of this page and click “Create a Page“.

Choose a type of page, a category, name, agree to the TOS and “Get started”.

“Edit Page”, “Manage Permissions” (2nd from the top on the left) and make sure Page visibility is set to “Only admins can see this page”. Take note of the “Default Landing Tab”: you can change that to decide what your visitors see first (i.e. your app).

Now “Manage Admins” (below the orange flag on the left), and add me.

Make an App

Go to the Developer page and (read more…)

No Comments (meta data)
Written by Jay McKinnon on  Categories: Facebook, Geek Out, Public Tags: , , ,

Host your own real-time Twitter wall

By Jay McKinnon, published on August 6, 2011 (share)

This article describes the use of a simple AJAX-powered webpage to display a (near) real-time feed from Twitter. If you found this page by following a link to a specific hash tag (i.e. #Jan25 for the Egyptian Revolution), it is likely that the page was removed after the tag fell into disuse.

Short Title: #bloxtalk project

Examples: one page like #SOTU, or multiple pages like #egypt and #algeria, or mixed with other elements, like #bloxtalk

Inspired by comments in Diogenes2008′s diary (Jan 25 2011), I rolled out a simple ajax-powered live feed of the #FOK twitter hash tag. Having kinda botched the first version, I gave it some more thought and came out with some innovations to promote real-time, cross platform communication for web events (live blogs, tweetathons, etc). It’s off-the-shelf javascript with vanilla html, blended together to put conversations from multiple platforms on the same page.

This is an easy how-to guide, offered in the hopes that future live blog events can be made more dynamic and live tweet events more accessible to those unfamiliar with twitter.

Contents

  1. Twitter is…
  2. Why would you do this?
  3. Requirements
  4. Examples in use
  5. The code
  6. Limitations & lessons learned
  7. Hax & back-channels

Originally published 26 January 2011, to DailyKos as “#bloxtalk: roll your own twitter+IRC wall
(read more…)

No Comments (meta data)

How to Aaaarg.org

By Jay McKinnon, published on August 5, 2011 (share)

Aaaaarg.org is a human-generated index of pirated academic works popular among graduate students. This document is an introduction to how to use the system.

Q: How the hell does “Aaaarg.org” work? I can’t seem to figure it out.

A: First, you should probably check the Arg Dot Org Facebook page to see if those are the kind of titles you’re interest in, and interested in talking about. If that stuff boils your crayfish, you’ll need an account at Aaaaarg.org. The old URLs got nuked by lawyers for academic publishers, which should tell you two things:

  1. Some part of what Aaaarg.org facilitates is probably illegal in your jurisdiction, and
  2. The workflow is going to be weird enough to protect the website from liability.

Aaaaarg.org is a human-generated index of pirated academic works. Usually, someone requests a text here or here or here (or in a new thread). Someone else looks in their own library and, if they find it, scan it and upload it somewhere, (like iFile.it, Megaupload.com, Easy-Share.com or something similar). They then create a library record for that file here. If you subscribe/follow an “issue” you can publish a library record to that pool, and if you download and re-upload a file to a new server you can add an external like.

But all in all, it’s a gong show. Disorganized, chaotic, filled with nothing you’re looking for and lots of shit you wouldn’ t have

commerzbank aktiengesellschaft routing code

thought to look for. But sometimes, it can deliver something you need and can’t find. …Does that *kinda* explain the workflow? :)

P.S. Aaaarg.org can also be found on Twitter.

(read more…)

No Comments (meta data)