Tag: data collection

My First Application of FOI in Hong Kong

It was new for me when I heard anyone can acquire almost any data from HK government for legitimate reasons under that Code on Access to Information.

This code is a response to the notion of “FOI” (For Our Information; Freedom Of Information), which calls for citizens’ free access to government information so that the transparency of government management can be ensured and citizen rights can be protected.

According to Wiki, In 2006, nearly 70 countries went through relative legislation. Among these laws are USA’s FOIA (Freedom of Information Act) and, of course, Hong Kong’s Code on Access to information.

Despite the code in place, a practical question remains. Will those government officers fulfill their duty and do give reply to every single data request? So I decided to give a try on Accessinfo.hk.

Accessinfo.hk is a website positioned as a platform for citizens to post their information requests to authorities and receive feedback. It was initiated by a group of Open Data activist, including Guy Freeman, who is currently data scientist in HK01. The website publishes every question and answer to everyone, and, at the same time, monitors the process. Before localizing the Alaveteli system ( http://alaveteli.org/ ) to Hong Kong, its sister site WhatDoTheyKnow ( https://www.whatdotheyknow.com/ ) had already seen wide application in the United Kingdom.

accessinfo

Screenshot: accessinfo.hk

Continue reading “My First Application of FOI in Hong Kong”

網絡數據包分析:從 Google Maps 獲取 Fusion Table 原始數據

網上經常見到使用 Google Maps 繪製的地圖,如果希望對地圖中的興趣點(Point of Interest,POI)進行二次分析,就需要得到繪製地圖背後的結構化數據。如果是使用 Google Fusion Table 繪製的地圖,可以通過網絡抓包找到 Fusion Table 的ID,進而拼接出原始地址。本文來自同學 Lam Man Kit 的投稿,僅做技術交流。數據記者在使用時,需要注意原始數據的版權。而本地的研究者也需要遵守公平使用原則。本文以 FactWire 的數據報道 「分析182個領展停車場月租收費 9成2貴過房委會同區 最大差距達1.18倍」 爲例。

圖:通過網絡抓包分析 Fusion Table 的 ID

Continue reading “網絡數據包分析:從 Google Maps 獲取 Fusion Table 原始數據”

wget最簡爬蟲:一行命令助攻調查記者

書寫爬蟲已經成爲數據記者的必備技能。雖然有諸如ScrapingHub、Morph、ParseHub等在線服務,可以一定程度上實現無代碼抓取網頁,但很多時候,還是需要手動編寫爬蟲邏輯。爬蟲書寫分爲兩個部分,第一個是爬,第二個是取。「爬」即是從一個網頁出發,找到它所包含的鏈接,逐一訪問,不斷重複這個過程,最終收穫到需要的頁面。這個過程和人們瀏覽網頁是類似的,有種「順藤摸瓜」的意思。「取」則是從網頁中提取有效信息的過程,將「半結構化」的網頁,轉換爲「結構化」的數據表格。

本文介紹最簡單的爬蟲,只需要一行命令: wget -r

Continue reading “wget最簡爬蟲:一行命令助攻調查記者”