最近為了在網頁中使用一個 C Library,從頭寫了一個 PHP PECL Extension。期間自然做了一些調查與研究,在此寫下供其他朋友作為參考。

首先應瞭解開發 PECL 時,一些必要得知識,像是基本的 Zend SAPI、每個模組的生命週期、記憶體管理、ZVAL 參數的取得等等。關於這方面的知識,除了已經相對太舊的Zend API – Hacking the Core of PHP 外,Sara Golemon 算是著述的較多的開發者,他甚至出了一本書叫做《Extending and Embedding PHP》,去年在 ZendCon 2008 也談過一次 PHP Extension Writing

不過要完成一個 PECL Extension,大約不需要讀完一本書,Sara Golemon 在 2005 年時亦在 Zend Developer Zone 發表幾篇詳盡的技術文章,說明了最重要的幾個重點,讀過跟著做一次,大約就可以熟悉一些基本的技巧。我並沒有發現 Zend SAPI 的詳細文件,因此必須再翻翻 /usr/include/php5/Zend 下的 header files,看看其他 core extensions 的程式碼,大概就可以掌握完成一個 Extension 的知識。

Sara Golemon 的幾篇經典文章是

英文若嚥不下,或可參考 Huang Shiqiang簡體中文翻譯。Huang Shiqiang 尚做了一個以 C 為基礎的 PHP Framekwork – Kiss (計畫網頁),頗有趣。以效能為出發點的 framework 或 template engine 還有 Blitz 等。

讀過文章,瞭解基本的概念之後,就可以開始動手寫程式。首先,你得有開發環境,包含編譯檔案、目錄結構、文件封裝機制等。你自然可以動手複製一份別人的來改,或者你可以依照 PHP Manual 中的說明利用 ext_skel 來產出一個空白的專案。我自己是利用功能相較完整一點的 CodeGen_PECL。利用 Pear 裝好後,就可以利用 pecl-gen 生出一份空白的專案。

PHP Extension 預設利用 phpize 來自動生成 autotools scripts,對開發者來講,若要新增編譯參數或增加原始檔案,只需要增修 config.m4 中的內容即可,不用管整套的 autotools scripts,算是相當方便。config.m4 是用 m4 這個 macro processor 來處理,語法這裡就略過不提了,編譯文件應相當好找。

撰寫一個 Extenstion 的知識大概是這樣。補充說明,上述文件沒有提到的 PHP5 中的 Reflection API,你可以為每個函式加上 arginfo,可以方便一些工具自動取出這些 API 來用,作法請參考 Christian WeiskeHacking PHP5

另外一個值得一提的是做 Unit Testing 的方法,養成寫 test case 是維持軟體品質的好習慣。你若用了 phpize,系統應會自動生成 run-tests.php 等自動測試工具,你可以依照其設計寫一些自動測試的案例,每次編譯後可 make test 做自動測試,確保運作正常。在 IBM developerWorks 上有一篇淺析 PHP 官方自動化測試方法,值得一讀。

最後補充一點,這次我寫的延伸函式庫乃是 C 語言,若欲整合物件導向的  C++ 函式庫,可參考 Wrapping C++ Classes in a PHP Extension 一文。

It’s been almost one year for not maintaining the wiki.debian.org.tw web site. Since I joined the current company, I spent all my time for dealing with routine jobs every signal day. I don’t even want to use my laptop at home, after I finish the jobs every day.

Lately, the wiki.debian.org.tw becomes more unstable. People usually see `Service is not available’ pages in the last couple weeks. One of the reason is the disk is full, the other reason is there are too much spam articles.

Finally, I spent a few hours this weekend for the site. First thing I do, it’s to upgrade the server and the mediawiki software. Frankly speaking, it’s not hard at all, since the wiki is installed in a vserver based on the Debian. All I need to do is running `aptitude dist-upgrade’, to upgrade the distribution from sarge to etch. And then I sync the mediawiki source tree, from 1.7.1 to 1.11.1. It’s also very easy, since mediawiki provide a upgrade script for check and modify the database schema.

The real problem is the thousands of spam articles. Since I have been for a long time not handing the spam problem, and more of the wiki moderators do not check the spam frequently. The spammers are easy to posts a lot of articles without supervision. Even through the moderators come to the wiki site ofter, it’s still impossible to delete the spams through the web interface, due to too much spammers.

Anyhow, the result is I got the thousands of spams in the database. Most of them are advertisements of venereal diseases treatment, they help you to deal with syphilis, gonorrhea and herpes. I all most want to change the wiki’s name as `SafeSexpedia‘, it’s become an informative knowledge base.

Still, I can not stand for the spamming situation. The first two things I do is install the reCAPTCHA MediaWiki Extension, so people need to pass CAPTCHA when they try to register an account. Also, I enabled $wgEmailConfirmToEdit which means only allow the account with email confirmed editing the pages. These two approach would be good enough to stop the new spammers. However, the real problem is the spam articles already in the database.

In order to clean up the database, I check several extensions like Nuke. However, I found they are not convenient for clean up thousands of spam articles. I decided to use APIs. The good thing is there are two scripts in the mediawiki/maintenance folder, cleanupSpam.php and removeUnusedAccounts.php.The cleanupSpam seems fit my requirements, it takes url as argument, and find out all the article which contains the url and remove it.

However, I don’t want to check the articles one by one for looking the urls. Since most of the spammers on the wiki.debian.org.tw are from China, most of them use the email address at 163.com. The most easy way for me, is just clean up all the accounts from 163.com and all of the articles posted by these accounts. And of course I can not just delete these articles. Because the spammer can modify any articles they want. In this case, I might remove some important articles by modified by spammers.

So, I need to have a script, the purpose of the script is find out the accounts with special email or nickname. And find out all of the articles modified by the account. For the article, if

  • If the account is not the latest editor, then we ignore the article. Because someone might already fix the content manually.
  • If the account is the latest editor and the article is created by the account, and it has signal version. Then we simply delete it.
  • If the account is the latest editor and there are earlier version, we found the last version which edited by a valid account. And we restore the article to that version. So we could have the right content for the article, before the spammer put the links into it.

I created another script based on the maintenance samples, thanks for these developers. With the script, I deleted hundreds of accounts and more then 2 thousands articles in a few hours. If you are interested about the script, you can download it from here. Put it in your mediawiki/maintenance folder. The usage is very simple

USAGE: php removeSpamAccountsAndPost.php [--delete] email

It takes only one parameter, you can find the articles by nickname or email. My database is mysql, so you can use ‘%’ as pattern matching for LIKE statement.

php removeSpamAccountsAndPost.php chihchun
php removeSpamAccountsAndPost.php chihchun%

The script only give you a list for preview by default, if you are sure that these accounts and articles should be deleted. Please add `–delete’ for let the script REAL DELETING THE ACCOUNT AND ARTICLES for you.

php removeSpamAccountsAndPost.php --delete chihchun