Curate datasets for fun, and profit

Cover photo source: raul kalvo

The term “Curator” was traditionally used in the context of museums, library, gallery and art exhibitions. It general refers to the person who creatively plan and well organise resources to maximise the utility for the audience. The process that a curator gets the job done is thus “curate”.

TLDR readers, here are the sites that worth your visit:

The classical role of journalism is to organise and filter information. Good journalism does not have to be all original. The sharp sense that helps put together fragmented information is often the value add part. This process is very like “curation”, regardless of the title, be it journalist or editor. When it comes to the data journalism world, the curation of datasets becomes a more prominent component. Practitioners have reached the consensus that single dataset is usually not enough to conduct pure “data-driven” reporting. One usually needs to enrich the dataset with other sources, which can be expert interview, auxiliary datasets, research reports, etc. Sometimes, people find no exact dataset can answer the questions they are tackling and end in organising their own dataset from different sources. Both scenarios happen in the data journalism production processes and they are problem oriented — a key feature of journalism, and also research.

What if we do something in the other way? That is, we are not sure what questions can be answered, or what new problems can be discovered, before we get the entire data set. We just have a hunch that it is going to be useful, eventually, in certain way. The process is content oriented and thus “curate”. If people have real good hunch, they can be good data curator. The fact is, many curated datasets end in no use, or not enough use compared with the efforts spent to curate the dataset.

Journalism usually adopts the research process instead of the curation process. Here’s the highlight of the difference:

  • Research — It is problem oriented. It involves a series of “reverse searching” for data and evidence. One starts from the problem and approaches raw data iteration by iteration. Research starts from “what we want”.
  • Curate — It is content oriented. One encounters multiple “interesting” datasets, reformat them, and join them into larger ones. Relevance and coherence are key considerations. After several stages, some problems or insights may pop up from the structured dataset. Curate starts from “what we have”.

The two processes are not completely independent.

Curation helps research. People who do academic research usually have unexpected discoveries while observing curated datasets. The problem may not exist at the beginning at the research or the problem may not be positioned/ phrased exactly the same as it in the final research output. Good researchers always take note of new evidences and new datasets. At certain point, new ideas may spark out of the fine combination of past clues.

Screenshot: more than 10,000 scattered datasets on Old DataHub

Research helps curation. If you check what are available, there are numerous data sources, and numerous open access data sources. People had the good will to collect them in one place so that ideas might spark. One classical example is the Old DataHub ( ), created by the maintainer of CKAN (  ). CKAN is a widely adopted open data portal software and example users include Hong Kong government ( ). If you go to the Old DataHub, you will soon be overwhelmed by the amount of available datasets. Those datasets come in heterogeneous formats and lack a unified schema to comprehend and to explore. So here comes the first challenge: curation is more than collection. A collection of datasets is not much more than a collection of pointers. The latter one has a good example on GitHub: the Awesome Public Datasets ( ). A categorisation of topics make it easier to find the wanted datasets. Drawing relationships between scattered datasets is the first step but far from enough. If the curator can bear some research questions in mind, he/she can combine the datasets with a purpose. The combination is on content level, i.e. data entry, not on the “listing datasets” level.

Screenshot: Tree structure of datasets on

Developing curation with some research flavour, there comes Enigma ( ), which was branded as “the largest collection of structured public datasets”. Note that “structure” is the keyword here. Given the structured nature, it can easily support data slicing via RESTful API. It also organises datasets into a tree structure, so it is convenient for users to ask relevant questions. For example, you may come across the “Lobbyist Details of 2012” by searching the name of a particular person. Then you got the idea to study the change of lobbying behaviour in past 5 years. Going up one level in the tree structure, you can find the same dataset for other years. The number of datasets available on Enigma increased slowly since its inception. They pick certain topics and develop the datasets under those topics in depth. This is the right way from the user’s perspective — people do not care where those data come from, as long as they answer certain, not necessarily many, questions in depth. The usual situation for curated datasets is that people know they have potential but few people can unleash that potential, namely sensing research-wrothy or news-worthy problems barely from the datasets. Interesting, Enigma recently categorised some datasets as “Newsworthy”, which is a hint for data journalists.

Screenshot: searching “china gdp” on Statista

Statista ( ) moves one step further — going from raw data to aggregated data, or statistics. If you look closely what Enigma offers, you will soon find most of the datasets come in form of “list of records/ tuples”. Technically, a record or a tuple is a data point. A standard flattened dataset, sometimes called “long table” by data journalists, is most powerful for multidimensional exploratory analysis. Passing through a pivot table, one can easily get key statistics, known as cross-tab, sometimes called “wide table”. The interesting fact is that traditional journalists favour cross-tab because the data (number?) can be readily cited in reports. Data scientists favour flattened table because they can leverage (show off) all kinds of data mining tools. Data journalists are in between, depending on their background and how many data processing techniques they master. Regardless of the occupation, statistics are always the downstream. So the question becomes, why not curating statistics directly instead of raw data? Statista took that approach. It systematically collected many public datasets, sliced them into problem-oriented pieces, and aggregated them into huge amount of small “datasets” — yes, statistics usually come in smaller-sized tables, and also referred to as “datasets” most of the time. Those smaller datasets usually have less than 10 and up to a few dozens data points. One can easily plot all the data points onto fundamental charts: bar, pie, line and scatter. Many statistics on Statista come with bar charts. Users can download the dataset or embed charts. The provisioned datasets in free version those small and aggregated one. It is hard for people to conduct further data analysis. Nevertheless, it is a convenient tool for people to gather some quick facts and ideas. I usually start my initial research on Statista and jump to other sources for more detailed datasets.

Good news: HKBU Library has subscription to Statista.

Screenshot: Data USA’s map visualisation on different indicators

Curated dataset accompanied with curated charts can bring another level of value to the readers. That is the approach adopted by Data USA ( ), a join effort of Deloitte, Datawheel and MIT Media Lab. Previously, we see Enigma which provides structured multi-dimensional datasets in raw format. People need to learn a bit data processing and data visualisation techniques to turn the datasets into useful charts. We also see Statista, on another extreme, aggregate the data into small crosstabs and systematically transform crosstab into basic charts, nearly all of which are bar charts. Data USA deliberately took the approach of “curated charts”. One will be amazed of the snappy collection of different charts at the first sight. There are bars, lines, maps, treemaps, … It is easy to tell that there is a huge multi-dimensional dataset in the backend. Instead of giving API to slice this dataset directly like Enigma, Data USA gives curated charts, which are examples of how this dataset can be used and more importantly which part is relevant. The user can download dataset of a chart directly, or they can put the chart into a shopping cart. The amazing part is, once the user finish shopping, Data USA can calculate the relevant slices and combine them into a single CSV. It enables the workflow of quick explore –> correlate small datasets/ slice big dataset –> in-depth data analysis. Another two similar examples are Data Africa ( ) for Africa and Data Viva ( ) for Brazil, all created by a data startup Datawheel ( ).

Screenshot: Our World in Data

Given dataset and chart, next level is story. That is the project from Oxford University, Our World in Data ( ). An ordinary user soon feels too overwhelming to discover stories after browsing Data USA for a few minutes, even though everyone agrees it is a rich gold mine. Our World in Data not only gives open data and open charts for download and cite, but also combine them into stories. All the raw materials come up with stories. In some sense, one can treat it as scientist-initiated data-centric blog, or in our current language, curated data stories. Now the question is, how far is it from data journalism? Journalism is about stories and most of the blogs on Our World in Data qualifies for it. In fact, it is not much different if you spot the Atlas ( ) made by Quartz. Quartz is a renowned new media which made good use of data and charts. Atlas is the collection of those “cooking materials”. More importantly, every piece on Atlas is associated with an article from Quartz. That solves the problem for most beginners — the data is there but they do not know what kind of question they can ask. The same end result but different origins: Quartz started with stories, then released their open source charting tool, then built the CMS/ search engine to collect data/ charts/ articles.

There are a full spectrum of possible models from pure curation to pure research.

Regardless of the model, data curation is time consuming. One may try to curate a dataset for fun, as we did for the 2011 HK Census data ( ) three years ago. Most of time, people struggle to find a business model that can well justify the data curation process. The key is to avoid the situation that curated datasets end in no use at all. Atlas, Our World in Data, Data USA picked a donation-based model, no matter the donation is from the organisation itself, government, research institutes, or commercial firms. Statista adopts a freemium model, namely give easy to harvest public data for free but sell premium data at good price. Enigma provide customised solution stack for harvesting and integrating data. There are two key features we can learn from successful models:

  • Quantity and coverage. Statista has large amount of data sources and cover most industries so that most users can find certain value and some users are willing to pay. The rationale of freemium model is very like super market: you sell some items for profit and some items at loss; overall, you make a profit.
  • Technology. Enigma curated many public datasets for free. Although they can not sell those datasets, the technology stack designed for this problem can be also applied elsewhere and that is the business value.

If you have time, I strongly suggest to curate some datasets for fun. It is good exercise for students entering this area. If you are interested in this process after some initial trial, think of a business model first. Or it is highly likely to end in a half-baked project, which we have already seen more than 10 instances in Hong Kong — all initiated by talented people, curated with heart and abandoned in half-bake mainly due to no adoption and lack of time commitment.


Author/ Pili Hu

Posted by: Pili Hu

One thought on “Curate datasets for fun, and profit”

Leave a Reply