Some Scraping Targets and Ideas

10 Saturday Mar 2018

Tags

COMM7780/JOUR7280, data collection, Scraping

This is a casual post to dump some target sites for scraping or just project ideas. Those messages were first sent through COMM7780/JOUR7280 WeChat group. Although we have only explored part of those possibilities this semester, the list is good for future reference. We can bounce off ideas in the comment below and enrich this list.

Top Targets: Movie, Shopping and News

Let’s first have a look at what the students care about from HW2 submission:

Scraping Targets from HW2 submission – COMM7780/JOUR7280

Continue reading →

Workshop recap: How Does HKBU Library Preserve Vintage Documents Using OCR?

04 Sunday Mar 2018

Posted by Erin Chan in Event, Tutorial

≈ Leave a comment

Tags

data collection, Digitalization, library, OCR, Scan

Technology has changed our way of researching and our reading habit after the Internet became the popular platform for the release of news and information. The documents and publications from the non-information era are still invaluable for us especially when it comes to referencing and history learning. Yet, these resources are black and white and read all over, which does not fit in today’s mode of information processing. To digitalise these old documents, four students from Baptist University (BU) learned about the technique and usage of software in Optical Character Recognition (OCR) workshop.

Using OCR machine to preserve information on old documents.

Continue reading →

Data News of the Week | What can we, the 20-year-old, do to change the world?

23 Friday Feb 2018

Posted by jessiepyt in Resources, Tutorial

≈ 1 Comment

Tags

data, data analysis, data cases, data collection, Data Journalism, Datamining, DNW, Heatmap, Strava

Nathan Ruser, a 20-year-old Australian National University student who is majoring in international security with a keen interest in cartography, discovered a fitness app had revealed the locations of secret military sites in Syria and elsewhere. He posted on Twitter about this, did not expect much response.

But the news ricocheted across the internet. Security experts said the Strave app’s “heat map” could be used by hostile entities glean valuable intelligence. The Pentagon said it was reviewing the situation.

How he found the news?

“Whoever thought that operational security could be wrecked by a Fitbit?” Mr. Ruser, said in an interview with New York Times from Thailand, where he is spending part of the Australian summer break.

When he looked over Syria on Strava’s map — which is based on location data from millions of users, including military personnel, who share their exercise activity — the area “lit up with those U.S. bases,” he said.

Before publicly sharing his findings over the weekend, he discussed them in a private chat group on Twitter, made up of people interested in intelligence and security issues. “I know about two-thirds of what I know about the world from the group chats,” he said.

Continue reading →

My First Application of FOI in Hong Kong

13 Saturday Jan 2018

Posted by graceli618 in Resources, Tutorial

≈ 1 Comment

Tags

data collection, GIJC17, open data, open government

It was new for me when I heard anyone can acquire almost any data from HK government for legitimate reasons under that Code on Access to Information.

This code is a response to the notion of “FOI” (For Our Information; Freedom Of Information), which calls for citizens’ free access to government information so that the transparency of government management can be ensured and citizen rights can be protected.

According to Wiki, In 2006, nearly 70 countries went through relative legislation. Among these laws are USA’s FOIA (Freedom of Information Act) and, of course, Hong Kong’s Code on Access to information.

Despite the code in place, a practical question remains. Will those government officers fulfill their duty and do give reply to every single data request? So I decided to give a try on Accessinfo.hk.

Accessinfo.hk is a website positioned as a platform for citizens to post their information requests to authorities and receive feedback. It was initiated by a group of Open Data activist, including Guy Freeman, who is currently data scientist in HK01. The website publishes every question and answer to everyone, and, at the same time, monitors the process. Before localizing the Alaveteli system ( http://alaveteli.org/ ) to Hong Kong, its sister site WhatDoTheyKnow ( https://www.whatdotheyknow.com/ ) had already seen wide application in the United Kingdom.

accessinfo

Screenshot: accessinfo.hk

Continue reading →

網絡數據包分析：從 Google Maps 獲取 Fusion Table 原始數據

26 Sunday Nov 2017

Posted by Pili Hu in Tool

≈ Leave a comment

Tags

data collection, Network Analysis, Scraping

網上經常見到使用 Google Maps 繪製的地圖，如果希望對地圖中的興趣點（Point of Interest，POI）進行二次分析，就需要得到繪製地圖背後的結構化數據。如果是使用 Google Fusion Table 繪製的地圖，可以通過網絡抓包找到 Fusion Table 的ID，進而拼接出原始地址。本文來自同學 Lam Man Kit 的投稿，僅做技術交流。數據記者在使用時，需要注意原始數據的版權。而本地的研究者也需要遵守公平使用原則。本文以 FactWire 的數據報道「分析182個領展停車場月租收費 9成2貴過房委會同區最大差距達1.18倍」爲例。

圖：通過網絡抓包分析 Fusion Table 的 ID

Continue reading →

wget最簡爬蟲：一行命令助攻調查記者

06 Monday Nov 2017

Posted by Bobo Wei in Resources, Tool

≈ Leave a comment

Tags

crawler, 爬蟲, data collection, scraper, wget

書寫爬蟲已經成爲數據記者的必備技能。雖然有諸如ScrapingHub、Morph、ParseHub等在線服務，可以一定程度上實現無代碼抓取網頁，但很多時候，還是需要手動編寫爬蟲邏輯。爬蟲書寫分爲兩個部分，第一個是爬，第二個是取。「爬」即是從一個網頁出發，找到它所包含的鏈接，逐一訪問，不斷重複這個過程，最終收穫到需要的頁面。這個過程和人們瀏覽網頁是類似的，有種「順藤摸瓜」的意思。「取」則是從網頁中提取有效信息的過程，將「半結構化」的網頁，轉換爲「結構化」的數據表格。

本文介紹最簡單的爬蟲，只需要一行命令： wget -r

Continue reading →

	A quick video I made… on New towns fail to be self-cont…
	Erin Chan on Create Simple Filled Map (HK)…
	National Congress: s… on “Big Data” Tells Y…
	Pili Hu on Data News of the Week \| Gender…
	Pili Hu on Key Takes from Jessica Lo…

The Data & News Society

~ news/numbers; stats/stories

Tag Archives: data collection

Some Scraping Targets and Ideas

Top Targets: Movie, Shopping and News

Workshop recap: How Does HKBU Library Preserve Vintage Documents Using OCR?

Data News of the Week | What can we, the 20-year-old, do to change the world?

My First Application of FOI in Hong Kong

網絡數據包分析：從 Google Maps 獲取 Fusion Table 原始數據

wget最簡爬蟲：一行命令助攻調查記者