Category: Tutorial

Inside Douban’s Top 250, a door pries into the world of audiences in mainland China

Summary: In this article, we crawl and analyze the top 250 films evaluated by DouBan users, find their preference on specific directors, types and regions, and also see trends of movies in different regions, particularly in America and Hong Kong. Besides, we pay attention to analyze the rise and down of Hong Kong film productions, and also add some background information for a better understanding. 

Douban top 250

In general, the best ways to reflect a movie’s preference and prevalence are through box office and ratings. But box office is biased since lots of movies are blocked in China and people are likely to be affected by several factors.  Thus, it’s more objective to examine it through ratings.

Douban Movie is a famous Chinese film rating site, with millions of users watching, rating, and commenting on movies day by day. Considering a large number of Douban users and the relatively objective measurement, data from the Top 250 list is suitable for our analysis.

Continue reading “Inside Douban’s Top 250, a door pries into the world of audiences in mainland China”

Data News of the Week: A World Defended and Invaded by Data – a Technical News Story Covering NSA Files Leak

Do you consider your personal information well protected? You use different passwords for different accounts, keep your social network activity private, even fake your profile on social media. Your efforts are probably in vain.

Everything you have done is under surveillance and your life pattern can even be figured out by people who are thousands of millions away from you. The government knows that you called Daisy three times in twenty-four hours, with one after midnight. You use Google Map in Central, Hong Kong at 2pm and your route is also recorded. You may wonder: I am just nobody. Why will somebody spare effort to analyze my data? In fact, you are somebody. Three degrees of separation points out that if you have 190 friends on Facebook, then after “three hops”, the network you can reach is even bigger than the population of Colorado.

Continue reading “Data News of the Week: A World Defended and Invaded by Data – a Technical News Story Covering NSA Files Leak”

Workshop recap: How Does HKBU Library Preserve Vintage Documents Using OCR?

Technology has changed our way of researching and our reading habit after the Internet became the popular platform for the release of news and information. The documents and publications from the non-information era are still invaluable for us especially when it comes to referencing and history learning. Yet, these resources are black and white and read all over, which does not fit in today’s mode of information processing. To digitalise these old documents, four students from Baptist University (BU) learned about the technique and usage of software in Optical Character Recognition (OCR) workshop.

Using OCR machine to preserve information on old documents.

Continue reading “Workshop recap: How Does HKBU Library Preserve Vintage Documents Using OCR?”

Hong Kong Midnight Dinning Guide

Summary: Hong Kong is a commercial hub that never sleeps in Asia. There are numerous restaurants feeding workaholics who work overtime and party animals who indulge in strongly beating music and alcohol at midnight. We search on OpenRice, the most popular dining guide website in Hong Kong, and find that there are 942 restaurants still open after 11.30am in Hong Kong. Among them, we crawl the information of 250 most popular ones among them in order to pitch an overall scene of Hong Kong’s midnight dinner.

We try to figure out four points below:

  • Where to hunt midnight food in Hong Kong?
  • How much do you need to pay a meal at midnight?
  • What kinds of food are provided at midnight?
  • What kinds of restaurants can you choose at midnight?

After that, we make a recommendation on midnight restaurants based on our analysis results.

Continue reading “Hong Kong Midnight Dinning Guide”

Earthquakes in Southeast Asia in 50 years

Summary: We used API (Application Programming Interface) as the source to extract data from the USGS database in order to analyze the last 50 years and estimate the frequency of earthquakes in Southeast Asia. With the help of Python, the extracted data was exported into CSV file for categorizing different parameters such as by country, magnitude and year.


Application programming interface (API) is commonly used to extract data from a remote website server. In layman term, API is used to retrieve data or information from another program. There are several websites such as Facebook, the USGS, Twitter, Reddit, which offer web-based API helping get information or data.

In order to retrieve data, we will send requests to the host web server where you want to extract the data and tweak parameters like URL in the module to connect to the server. Different websites have different requests format and can easily be accessed through the host’s website.

In our module, we will be extracting the data of earthquakes that hit Southeast Asia in the last 50 years from the web server of USGS using API.

One of the most frequent natural disasters on planet earth is earthquake. The sharp unleash of energy from Earth’s lithosphere generates seismic waves which lead to sudden shaking of the surface of the Earth. This natural disaster has led to the death of thousands of millions of people all around the world.

The strength of earthquakes is measured through Richter magnitude scale or just magnitude. The magnitude is the scale which ranges from 1-10.

The most highly sensitive region in the world prone to the earthquakes are Southeast Asian countries. To find the trend in the region, we extracted 50 years data from USGS by using API and convert the numbers into CSV file through Python coding for a comprehensive understanding of earthquakes situation in Southeast Asia.

Continue reading “Earthquakes in Southeast Asia in 50 years”

Using Big Data to Figure Out How Fair China Daily News is

Summary: Unfair and imbalanced news stories always mislead readers, hiding and even distort truths, thus decreasing the credibility of media as well as increasing ‘news victims’. As a qualified news organization, one must get its news as close to the fact as possible. This time we want to take China Daily as an example, to analyze whether its news is fair or not.

We decided to rely on data to quantize the requirement, thus we use python to show the most effective way to figure out the fairness of news.

Background: Difficulty to Reach Absolute Objectivity

According to the Cambridge online dictionary, objectivity means “not influenced by personal opinion or feeling.” For a long time in journalism, objectivity meant writing a story without putting any personal opinion into it.

Over the last several years, many journalists stopped using “objectivity” in favor of the word “fairness.” Complete objectivity, they reasoned, is impossible. Fairness is more possible. Fairness means that you tell a story in ways that are fair to all sides once all the available information is considered.

Telling a story fairly is more difficult than it sounds. Reporters try to put colorful images and descriptions into their stories. For fresh reporters, especially those working in a second language, it can sometimes be difficult to distinguish between colorful description and editorializing. Some words have a feeling or connotation to them that is hard to recognize. Some English words have “loaded” or “double” meanings that are extremely positive or negative. Writers should be aware of the positive or negative meanings of a word and how its use to affect an article. Also, as human beings, we all have feelings and opinions about events and issues around us—-it is sometimes difficult to conceal those feelings, especially if we feel strongly about something. These feelings sometimes come through in our stories in the words we choose.

Therefore, the TextBlob, a module of python, is designed for pointing out humans’ subjectivity in news.

Continue reading “Using Big Data to Figure Out How Fair China Daily News is”

Data News of the Week | Gender Pay Gap: Why and How?

Professor Jordan Peterson has been the center of attention in last few weeks for participating in a number of debates regarding gender wage gap. Unlike the feminists calling for the reduction of discrimination over salary, he believes gender wage gap is an explainable consequence of multiple social factors rather than a problem caused by discrimination. Is he right? After all, why is there gender wage gap? Looking into 3 reports (Why is There a Gender Wage Gap – Our World in Data, Six Key Facts About the Gender Pay Gap – Our World in Data, Gender Pay Gap: the Day Women Start Working For Free – Washington Post) and a  published recently with analytics over statistics regarding gender wage gap will give us a thorough understanding of current gender wage gap. Continue reading “Data News of the Week | Gender Pay Gap: Why and How?”

Data News of the Week | What can we, the 20-year-old, do to change the world?

Nathan Ruser, a 20-year-old Australian National University student who is majoring in international security with a keen interest in cartography, discovered a fitness app had revealed the locations of secret military sites in Syria and elsewhere. He posted on Twitter about this, did not expect much response.

But the news ricocheted across the internet. Security experts said the Strave app’s “heat map” could be used by hostile entities glean valuable intelligence. The Pentagon said it was reviewing the situation.

How he found the news?

 “Whoever thought that operational security could be wrecked by a Fitbit?” Mr. Ruser, said in an interview with New York Times from Thailand, where he is spending part of the Australian summer break.

When he looked over Syria on Strava’s map — which is based on location data from millions of users, including military personnel, who share their exercise activity — the area “lit up with those U.S. bases,” he said.

Before publicly sharing his findings over the weekend, he discussed them in a private chat group on Twitter, made up of people interested in intelligence and security issues. “I know about two-thirds of what I know about the world from the group chats,” he said.

Continue reading “Data News of the Week | What can we, the 20-year-old, do to change the world?”

Lightning News from Public Data Sets

It is time to break-down the broad concept of “data journalism”. When talking about the combination of data and news, we usually refer to two processes, sometimes conducted in an integral manner. One process is to discover news points from datasets. The datasets can provide a lead for further investigation. The final product does not necessarily reflect the usage of data. It may look the same as normal news products mainly composed of interviews and photos. This is called “data mining” in the science domain. Another process is to present news points using data. There come to all kinds of charts and interactive/ immersive presentations. This is called “data visualisation” in the science domain.

Let’s focus on the “data mining” part in this article. That is to discover news from datasets, or more precisely discover a news lead from datasets. The further development of the entire news story may take much more efforts with a combination of traditional and modern methods. For easier discussion, we treat “news” in the general form: something the audience does not know before reading, a.k.a, something that “appears new”. It could be the status update of a current affair, or it could be the “new knowledge” to the readers (probably be “common knowledge” to experts which we don’t want to waste time debating).

As advocated by the “Road to Jan”: the most profound theory takes the simplest form. As a first step, we try not using programming, or even sophisticated spreadsheet skills. One can readily find some “news” with a bit “nose for news” and be computer literate is good enough. In this article, we will demo a few news points mined by our undergraduate students from Hong Kong government data portal: . It took around 20 minutes in the second class of a data journalism course. We start with a public dataset from the portal, check out the data tables and eyeball if there is anything interesting. The process is so quick that we would like to give it a brand name: Lightning News. One can sharpen his/her news sense and data sense by doing this as daily exercise.

Continue reading “Lightning News from Public Data Sets”

農曆新年學習資料大禮包!Data Journalism Learning Tools

農曆新年伊始,小編在此給各位愛學習的小夥伴推薦一些學習Data Journalism有用的線上課程,資料和工具。希望新的一年,大家學業進步!順順利利!(原文部分轉載自, 有部分編輯, 點擊查看更多: 學習資料清單)

  • 系列課程:從查找資料開始、學習解讀資料的意義、資料視覺化到用資料說故事。

1.Doing Journalism with Data: First Steps, Skills and Tools(課程在LEARNO.NET平台開放,一共5節,直接註冊,免費學習)

2.Data Exploration and Storytelling(數據大師Alberto Cairo & Heather Krause開設)

Continue reading “農曆新年學習資料大禮包!Data Journalism Learning Tools”