It’s been almost one year for not maintaining the wiki.debian.org.tw web site. Since I joined the current company, I spent all my time for dealing with routine jobs every signal day. I don’t even want to use my laptop at home, after I finish the jobs every day.

Lately, the wiki.debian.org.tw becomes more unstable. People usually see `Service is not available’ pages in the last couple weeks. One of the reason is the disk is full, the other reason is there are too much spam articles.

Finally, I spent a few hours this weekend for the site. First thing I do, it’s to upgrade the server and the mediawiki software. Frankly speaking, it’s not hard at all, since the wiki is installed in a vserver based on the Debian. All I need to do is running `aptitude dist-upgrade’, to upgrade the distribution from sarge to etch. And then I sync the mediawiki source tree, from 1.7.1 to 1.11.1. It’s also very easy, since mediawiki provide a upgrade script for check and modify the database schema.

The real problem is the thousands of spam articles. Since I have been for a long time not handing the spam problem, and more of the wiki moderators do not check the spam frequently. The spammers are easy to posts a lot of articles without supervision. Even through the moderators come to the wiki site ofter, it’s still impossible to delete the spams through the web interface, due to too much spammers.

Anyhow, the result is I got the thousands of spams in the database. Most of them are advertisements of venereal diseases treatment, they help you to deal with syphilis, gonorrhea and herpes. I all most want to change the wiki’s name as `SafeSexpedia‘, it’s become an informative knowledge base.

Still, I can not stand for the spamming situation. The first two things I do is install the reCAPTCHA MediaWiki Extension, so people need to pass CAPTCHA when they try to register an account. Also, I enabled $wgEmailConfirmToEdit which means only allow the account with email confirmed editing the pages. These two approach would be good enough to stop the new spammers. However, the real problem is the spam articles already in the database.

In order to clean up the database, I check several extensions like Nuke. However, I found they are not convenient for clean up thousands of spam articles. I decided to use APIs. The good thing is there are two scripts in the mediawiki/maintenance folder, cleanupSpam.php and removeUnusedAccounts.php.The cleanupSpam seems fit my requirements, it takes url as argument, and find out all the article which contains the url and remove it.

However, I don’t want to check the articles one by one for looking the urls. Since most of the spammers on the wiki.debian.org.tw are from China, most of them use the email address at 163.com. The most easy way for me, is just clean up all the accounts from 163.com and all of the articles posted by these accounts. And of course I can not just delete these articles. Because the spammer can modify any articles they want. In this case, I might remove some important articles by modified by spammers.

So, I need to have a script, the purpose of the script is find out the accounts with special email or nickname. And find out all of the articles modified by the account. For the article, if

  • If the account is not the latest editor, then we ignore the article. Because someone might already fix the content manually.
  • If the account is the latest editor and the article is created by the account, and it has signal version. Then we simply delete it.
  • If the account is the latest editor and there are earlier version, we found the last version which edited by a valid account. And we restore the article to that version. So we could have the right content for the article, before the spammer put the links into it.

I created another script based on the maintenance samples, thanks for these developers. With the script, I deleted hundreds of accounts and more then 2 thousands articles in a few hours. If you are interested about the script, you can download it from here. Put it in your mediawiki/maintenance folder. The usage is very simple

USAGE: php removeSpamAccountsAndPost.php [--delete] email

It takes only one parameter, you can find the articles by nickname or email. My database is mysql, so you can use ‘%’ as pattern matching for LIKE statement.

php removeSpamAccountsAndPost.php chihchun
php removeSpamAccountsAndPost.php chihchun%

The script only give you a list for preview by default, if you are sure that these accounts and articles should be deleted. Please add `–delete’ for let the script REAL DELETING THE ACCOUNT AND ARTICLES for you.

php removeSpamAccountsAndPost.php --delete chihchun

If you ever read my blog entry for setting up the Debian.org.tw, you probably already know that I love to use reverse proxy in the front of my web servers. This approach can solve the signal IP address for multiple Vservers problem, also it can provide web cache which reducing the server loading.

Since the proxy server (Squid) pass the http session to the real web servers, one of the problem is that my web servers always saw signal source IP address, which is the proxy’s IP address. Even through the proxy server still put the client’s IP in the `X-Forwarded-For’ http header, it’s still painful to retrieve the correct IP address from the head in every web application.

Thanks for Thomas Eibner, who wrote the reverse proxy add forward module for apache. The module simply check the IP address to see if it comes from the proxy server, if it is it will put the IP address in `X-Forwarded-Host’ or `X-Host’ to `Host’ header. So you don’t need to worry about the wrong IP address, and track the http requests more easily.

Debian package is ported by Piotr Roszatycki, but it’s still the old 0.5 version. Since the 0.6 is out, I filed a bugreport for remind him. For my etch servers, I back-ported the package with the last version. You can download it from my personal repository.

BTW, Piotr Roszatycki use yada for libapache2-mod-rpaf, who is also the maintainer of yada. After reading the yada’s script file `debian/packages’, I really feel like I went to my `good’ old days with RPM/specs. :p

Mobile barcode, especially QRCode is more and more popular in Taiwan in last couple years. There are more and more devices has QRCode reader built in, and we are seeing more and more 2d barcodes on print medias.

I mean it, every day I go to the office on MRTs. I always see some billboards have QRCode on it, and also many free MRT news papers.

Continue reading

About 2 weeks ago, I went to Tokyo for a business trip. As usual, I always do some study on the strange city. Every time, when you first arrived to a new city. You need to find the place to stay, and then you can plan the spots to visit.

I found the Google My Maps is a quite nice tool to do travel plans. Since you can easily locale the sign spots and hotels with the Google Map’s geocoding function, take notes and share with the others.


My study on hotels in Shibuya.

Continue reading

If you are using Debian or Ubuntu, and installed the Gnome system by default. Your system probably already has Gnome Volume Manager.

The Gnome Volume Manger is a daemon which listens the hardware events (HAL event), and run the user-configurable commands. Basiclly, it do autorun, automount for the hot-plugged devices. In plain English, it run the Multimedia Player or CD Burning software when you put a CD into the cdrom driver.

Continue reading