Discovering GeoDB 5. Measuring size and cost

Guess what time it is! It’s GeoDB time, on the same bat channel at the same bat day of the week (when you are a huge Batman fan you just can’t pass on this reference :P). In this new post we will be looking at how much storing data actually costs. SPOILER ALERT: If you have never wondered this, prepare to be shocked, it’s quite a bit of money. Saddle up and get ready to ride this journey with us as we analyse how best to store millions of gigabytes of location information. This post originally appeared on Medium.


medium 5-1

“There are three kinds of lies: lies, damned lies, and statistics”
Attributed to British Prime Minister Benjamin Disraeli

Another week, another post. We continue with the unveiling of GeoDB, but first, let’s recap what we’ve seen so far.

  • In [2], The power of place, we’ve analyzed the great value of private location data.
  • In [3], Game theory, we’ve reviewed how in a free and competitive market, the price will be dictated by both sellers and buyers.
  • In [4], Blockchain 101, we’ve summarized the pillars of blockchain technology and we’ve indicated how we aspire to use it to work with private location data.
  • In [5], Modular Blockchain Architectures, we’ve explained our concept of a modular blockchain architecture and why we believe that the interconnection between blockchains and therefore, this type of architecture, will be the predominant in the coming years.

Everything said so far is nothing but romanticism: i) huge amounts of money are being invested in the field of big data, ii) locations are very valuable for big data analysis, iii) users should have control over their private information and should receive adequate compensation for providing it, iv) blockchain technology allows us to guarantee the immutability of the information and therefore improve the quality of the data for the analysis, and many other beautiful ideas.

But there is a question that any critical reader should ask himself, does all this make any sense as a whole? That is, is it reasonable to propose a big data scenario with private location data? and in such case, is it possible to use current big data technology to guarantee immutability or other properties in the data?

It’s commonly said that on paper everything is possible. With little effort, we can use any data to support any idea. As benjamin Disraeli said, statistics can be a type of lies [1].

medium 5-2

In this post we’re going to show you the results of some studies that we’ve carried out about the estimated size of our big data and about the cost of storing this information using blockchain technology.

We know that it’s inevitable to introduce certain bias in the exposed results due to i) our experience in the private location field and ii) the novelty of the blockchain paradigm, but we’ll do our best to maintain an objective position that allows you to obtain your own conclusions.

How big is our big data

Maybe after reading the previous posts, someone may doubt whether this is indeed a big data scenario or only one with a lot of data. Let’s take a closer look.

Suppose that an app uses a SDK provided by GeoDB, allowing their users to transfer their locations in exchange for an economic reward. In this app, each position (located pin point) will be composed of seven 8-byte fields or 52 bytes per position: i) latitude, ii) longitude, iii) timestamp and iv) four other fields to store data about the state of the mobile network or the device.

The app runs in the background and captures a position every 3 minutes for a total of 16 hours (assuming the users turns off their phones at night). With this configuration, for each user 320 locations (16 * 60/3) per day will be generated, or what is the same, 17.920 bytes or 17,5 KiloBytes (KB).

After processing the positions to eliminate outliers let’s suppose that the size is reduced by 10%, so the size of the user’s locations will be 15,75 KB. To this set of locations is added some semantic data about the user such as mobile model, age range or gender. In total we suppose an addition of 0,5 KB, that leaves a final value of 16,25 KB per user per day in this example.

Assuming that the above functionality is used by 1.000 users of the app, each day the app will generate 15,87 MegaBytes (MB) with locations. In addition, we estimate that the SDK will generate an additional 5% of information to store statistical and semantic information necessary to carry out big data analysis. So, the final size will be 16,62 MB per day.

In the long-term, for every 1.000 users, 5,94 GigaBytes (GB) of data would be generated in a year, 29,7 GB in 5 years and 124,73 GB in 21 years. But 1.000 users is a toy sample and, you know what? the current (Q2 2018) size of bitcoin blockchain after eight years is 169,12 GB [6]. Assuming a more plausible number of users in a range of 1.000.000 to 100.000.000, the values would be the following:

1M of users after:

  • 1 year: 5.939,3 GB
  • 5 years: 29.696,52 GB
  • 21 years: 124.725,4 GB

100M of users after:

  • 1 year: 59.3930,48 GB
  • 5 years: 2.969.652,41 GB
  • 21 years: 12.472.540,14 GB
medium 5-6

So you can draw your own conclusions, 12.472.540,14 GB are 12.180,21 TeraBytes (TB) or 11,89 PetaBytes (PB), and according to a study carry out by amazon in 2012 [7], the price of a big data infrastructure of this size would cost $,12 per year.

The cost of storing in blockchain

We propose an infrastructure for the storage and commercialization of data under a big data paradigm using blockchain so that users, as information generators, and entities, as customers seeking to obtain large volumes of high quality data, can obtain benefits.

But how much does it cost to store a single GB in a public blockchain? A huge amount of money, let’s see why.

Taking as a reference the first week of July 2017 and the first week of July 2018, the cost of persisting a GB of information in the blockchains [8] of Bitcoin (BTC), Ethereum (ETH) and Stellar (XLM) compared [9] to the cost of storing the same information in traditional Hard Drives (HD) disks or in the newest and most efficient Solid State Drive (SSD) disks are:

Cost per GB in July 2017

  • BTC: 22.766.250,000$
  • ETH: 4.672.500,000$
  • XLM: 3.166.229,000$
  • HD: 0,025$
  • SSD: 1,010$

Cost per GB in July 2018

  • BTC: 57.909.998,000$
  • ETH: 7.716.975,000$
  • XLM: 31.662.297,000$
  • HD: 0,023$
  • SSD: 0,75$

The following graph clearly reflects the cost and the current trend.

medium 5-4

Let’s put this data in context.

A current smartphone like the iPhone X has 64GB [10] of storage in its basic model. Storing 64GB of data in the blockchain of Bitcoin would cost 3.706.239.872$, slightly less than the annual GDP [11] of a country of 7.000.000 inhabitants like Sierra Leone [12].

In view of this, is it reasonable to store big data using blockchain technologies? To answer affirmatively it is only necessary to think about how blockchain works.

medium 5-5

If we consider that, we can follow an hybrid proposal in which we use:

  1. Blockchain technology to store data and,
  2. Popular blockchains to store the resumes of the blocks, a minimum part of the information, to guarantee the immutability and authenticity of the data with the same level of security as if all the data were stored in these blockchains

Following this approach and using a blockchain in which the cost of the transactions is used to cover the cost of the necessary disk space for (1) and a popular blockchain like Ethereum for (2), it would be possible to store a lot of information for very little money.

medium 5-6
medium 5-7

Obviously, the previous costs should not be considered as the final costs since in a real infrastructure there are other costs such as electricity, interconnection or redundancy. However, we think that they clearly reflect that the cost of blockchain for storage is not an issue for our proposal if we follow a similar approach.

This hybrid approach may sound very good on paper, but you may be wondering if an interconnection between several blockchains will not end up leading to other costs. Like so many other things in life, everything depends on the point of reference, or in our case, on how we carry out the interconnection. Our next entry will be about it.

Until then, have a nice time.