It’s been almost one year for not maintaining the wiki.debian.org.tw web site. Since I joined the current company, I spent all my time for dealing with routine jobs every signal day. I don’t even want to use my laptop at home, after I finish the jobs every day.

Lately, the wiki.debian.org.tw becomes more unstable. People usually see `Service is not available’ pages in the last couple weeks. One of the reason is the disk is full, the other reason is there are too much spam articles.

Finally, I spent a few hours this weekend for the site. First thing I do, it’s to upgrade the server and the mediawiki software. Frankly speaking, it’s not hard at all, since the wiki is installed in a vserver based on the Debian. All I need to do is running `aptitude dist-upgrade’, to upgrade the distribution from sarge to etch. And then I sync the mediawiki source tree, from 1.7.1 to 1.11.1. It’s also very easy, since mediawiki provide a upgrade script for check and modify the database schema.

The real problem is the thousands of spam articles. Since I have been for a long time not handing the spam problem, and more of the wiki moderators do not check the spam frequently. The spammers are easy to posts a lot of articles without supervision. Even through the moderators come to the wiki site ofter, it’s still impossible to delete the spams through the web interface, due to too much spammers.

Anyhow, the result is I got the thousands of spams in the database. Most of them are advertisements of venereal diseases treatment, they help you to deal with syphilis, gonorrhea and herpes. I all most want to change the wiki’s name as `SafeSexpedia‘, it’s become an informative knowledge base.

Still, I can not stand for the spamming situation. The first two things I do is install the reCAPTCHA MediaWiki Extension, so people need to pass CAPTCHA when they try to register an account. Also, I enabled $wgEmailConfirmToEdit which means only allow the account with email confirmed editing the pages. These two approach would be good enough to stop the new spammers. However, the real problem is the spam articles already in the database.

In order to clean up the database, I check several extensions like Nuke. However, I found they are not convenient for clean up thousands of spam articles. I decided to use APIs. The good thing is there are two scripts in the mediawiki/maintenance folder, cleanupSpam.php and removeUnusedAccounts.php.The cleanupSpam seems fit my requirements, it takes url as argument, and find out all the article which contains the url and remove it.

However, I don’t want to check the articles one by one for looking the urls. Since most of the spammers on the wiki.debian.org.tw are from China, most of them use the email address at 163.com. The most easy way for me, is just clean up all the accounts from 163.com and all of the articles posted by these accounts. And of course I can not just delete these articles. Because the spammer can modify any articles they want. In this case, I might remove some important articles by modified by spammers.

So, I need to have a script, the purpose of the script is find out the accounts with special email or nickname. And find out all of the articles modified by the account. For the article, if

  • If the account is not the latest editor, then we ignore the article. Because someone might already fix the content manually.
  • If the account is the latest editor and the article is created by the account, and it has signal version. Then we simply delete it.
  • If the account is the latest editor and there are earlier version, we found the last version which edited by a valid account. And we restore the article to that version. So we could have the right content for the article, before the spammer put the links into it.

I created another script based on the maintenance samples, thanks for these developers. With the script, I deleted hundreds of accounts and more then 2 thousands articles in a few hours. If you are interested about the script, you can download it from here. Put it in your mediawiki/maintenance folder. The usage is very simple

USAGE: php removeSpamAccountsAndPost.php [--delete] email

It takes only one parameter, you can find the articles by nickname or email. My database is mysql, so you can use ‘%’ as pattern matching for LIKE statement.

php removeSpamAccountsAndPost.php chihchun
php removeSpamAccountsAndPost.php chihchun%

The script only give you a list for preview by default, if you are sure that these accounts and articles should be deleted. Please add `–delete’ for let the script REAL DELETING THE ACCOUNT AND ARTICLES for you.

php removeSpamAccountsAndPost.php --delete chihchun

去年年中的時候架了一個 SmokePing 來監測某公司幾個服務的 Network Latency 問題,用 SmokePing 的原因是他支援數種協定,所以我可以一口氣拿來監測 DNS, SSH Daemon, RADIUS, Web, SMTP 等。而且 SmokePing 的架構頗模組化,只要稍加修改幾個 Perl Script 就可以很快的滿足我的需求。

不過既然已經隨時偵測網路服務,光是使用電子郵件通知也稍嫌不夠即時。於是起意做了簡訊通知功能,隨意找了幾個 SMS 服務供應商,決定拿便宜的 PCHOME 一元簡訊來頂著用。感謝 SnowFLY (飄然似雪) 做了 SMS PCHOMENet-SMS-PChome Perl module,省了不少功夫。也因此半夜時常被簡訊吵醒。

不過 CPAN 上的版本是 2006 年,跟目前的 PCHOME 網頁不太相容,稍加修改後如
Continue reading

Here is my little script for `incremental’ dump svn revision trees. The script just check every svn repositories which located at /home/svn, and save it in /home/backup by versions.

#!/bin/sh

for dir in /home/svn/* ; do
    name=$(basename ${dir})
    version=$(svnlook youngest ${dir})
    for ((r=1;r<${version};r++)) ; do
	if [ ! -f "/home/backup/${name}-$r.gz" ] ; then
		svnadmin dump ${dir} -r $r --incremental | \
			gzip -9> /home/backup/${name}-$r.gz
	fi
    done
done

Have fun! This is a tip.

If you ever read my blog entry for setting up the Debian.org.tw, you probably already know that I love to use reverse proxy in the front of my web servers. This approach can solve the signal IP address for multiple Vservers problem, also it can provide web cache which reducing the server loading.

Since the proxy server (Squid) pass the http session to the real web servers, one of the problem is that my web servers always saw signal source IP address, which is the proxy’s IP address. Even through the proxy server still put the client’s IP in the `X-Forwarded-For’ http header, it’s still painful to retrieve the correct IP address from the head in every web application.

Thanks for Thomas Eibner, who wrote the reverse proxy add forward module for apache. The module simply check the IP address to see if it comes from the proxy server, if it is it will put the IP address in `X-Forwarded-Host’ or `X-Host’ to `Host’ header. So you don’t need to worry about the wrong IP address, and track the http requests more easily.

Debian package is ported by Piotr Roszatycki, but it’s still the old 0.5 version. Since the 0.6 is out, I filed a bugreport for remind him. For my etch servers, I back-ported the package with the last version. You can download it from my personal repository.

BTW, Piotr Roszatycki use yada for libapache2-mod-rpaf, who is also the maintainer of yada. After reading the yada’s script file `debian/packages’, I really feel like I went to my `good’ old days with RPM/specs. :p

Image Source: ScriptumLibre

Free Software Foundation 最近發起了一項抗議運動,請大家拒買趨勢科技的防毒軟體。主要原因是趨勢科技 (Trend Micro) 指控 Barracuda 的產品中使用 ClamAV 違反其5,623,600 號專利,也就是陳怡樺的空中抓毒專利。這則專利是在 1995 年申請、1997 年四月通過,內容主要是實做 SMTP Proxy 與 FTP Proxy 來進行掃毒的作法
Continue reading