SubDB
N-Gram-Based Text Categorization

At this point, you might have heard that Google has deprecated it’s Language API:

The Google Translate API has been officially deprecated as of May 26, 2011. Due to the substantial economic burden caused by extensive abuse, the number of requests you may make per day will be limited and the API will be shut off completely on December 1, 2011.

More on: Google Code Blog and Google Translate API Documentation

As you know, we used Google Translate API for language detection. No more! We are now using an implementation of N-Gram-Based Text Categorization to do this vital work in our own servers.

This means faster uploads and better language detection. Faster, because we no longer have to request Google Servers during the upload process and Better because we can now probe more chunks of data, analyze the results and improve the system to satisfy our specific needs.

To make this possible, we now support only a small subset of languages: Dutch, English, French, Italian, Polski, Portuguese (Brazil), Romanian, Spanish, Swedish e Turkish. We have plans to add more languages in the future, but we’ve chosen quality over quantity.

Thanks to @edufelipedev for the suggestion of the algorithm and @wilkerlucio for helping with the tests.

Cheers.

  1. subdb posted this