|BigQuery is Google Cloud’s enterprise data warehouse for analytics. Encrypted, durable and high available.
Simplicity, simplicity, BigQuery abstracts away the details of the underlying hardware, database and all configurations.
I have been working with GCP and BigQuery in particular for the last year. In this blogpost I will try to cover my impressions on a high level, touching into things like ease of use, responsiveness, using BigQuery, pricing and limitations.
Availability and ease of use
It was really refreshing when I started working on Google’s BigQuery, in that it follows the web philosophy established by the G Suite office tools. Do everything from the browser.
And you really can. You can use BigQuery from your office computer, home computer or tablet (even cell phone!) without a single install. You do not need to think about VPN tunnels or remote desktop, it is all easily accessible by default. Security and id management are all handled by Google and this has a very positive takeaway:
No custom username or password!
Remember the days when your database password had its own lifecycle (especially when to change passwords)? Eventually they would be managed by a passwords.txt or password service (you used the latter of course!), since you had no chance of remembering a dozen passwords that would change out of sync.
The fact that you can use BigQuery in any browser will certainly alter the way you work. If you need help by a coworker, you can just bring the query to her workstation where you can continue to work on the problem.
Google’s cloud platform has certainly done something right when it comes to availability and ease of use.
Starting with BigQuery
You can find BigQuery here. You are required to have a Google account to use it.
BigQuery requires a basic knowledge of SQL. You should know your SELECTs and JOINs, and of course; your data. Grasping what to join and how to join, and what to count is a big topic in itself and it’s not here!
BigQuery is also a column oriented database. So it is optimised for having a lot of columns and encourages data duplication. So if you have your typical star schema, BigQuery would prefer not to join the tables, but instead have the data duplicated in the fact table (denormalization). This is something to consider when you design your database models. This article goes into more depth on this topic.
Currently BigQuery ships with two SQL dialects: Legacy and standard. Personally I tend to fall back to Legacy, just because the syntax feels more familiar to me. (I probably should evolve to standard since it is the recommended dialect.)
One thing you might find unusual is that you do not specify indexes or foreign keys. At all. Behind the scene, BigQuery will just use more computers to do finish the task. This is all managed, so you don’t have to think about it. The bad thing about this is that key integrity is something BigQuery will not help you with. You’ll have to manage this in you ingress layer (ETL).
Using Big Query Web console
The SQL web console is pretty forgiving when it comes to errors and it also has autocomplete and handy hints to make your life easier. It has actually been so good that I haven’t considered installing a custom developer console.
As you can see from the Query above
SELECT YEAR, POPULATION, REGION FROM `project:dataset.table_*` where _TABLE_SUFFIX BETWEEN '1900' and '9999'
you can query multiple tables in a single query (like doing UNION ALL). This makes it easy to indirectly partition your tables on some key and then query them effectively.
All your queries are stored in your history with statistics on the query and the full result. This is particularly handy when you need to go back and investigate a query you have been running. An important note is that all queries are cached by default for 24 hours. If you have mutating data, you need to remember to turn off cache in the options menu!
There is a caveat that you should be aware of:
BigQuery has a very limited support for INSERT/UPDATE operations. In fact, it encourages you to design your way around it, by offering a very limited number of operations that you can perform on your tables each day.
This means you should avoid doing updates on a row basis, and instead try to make the updates you do be in one statement.
That said, there are no rollbacks either, so use this sparingly.
On my last non-cloud project, one of the most frustrating tasks was running analytical queries that would connect and analyze billions of rows. Sometimes these queries would run for hours and days. This didn’t play well with the policy that VMs should turn off after user inactivity, causing massive DB resources being hogged by a query that no one would ever see. If the query was restarted before a DBA could terminate the zombie, everything would slow down because the database would be busy shuffling data in and out of memory. (Oracle Exadata)
BigQuery, on the other hand, is designed to be responsive on Big Data and it will shine brighter and brighter the more data it gets. It is extremely scalable and will (almost) always give you a response in seconds even for queries in massive data volumes. And even if your browser should hang, the query will finish on the server side and the result is available in your history for 24 hours! You do not have to worry about queries taking all your DB resources.
BigQuery ships with a simple visualization tool that comes for free: Data Studio.
It connects easily to the database (in fact, it can connect to several data sources) and can give you basic charts and graphs. It is of course web based, so there are no installs and it supports collaborate work. After the datasource is connected, you can create custom fields to combine fields, etc. It cannot go beyond what you would do in SQL. There are a few limitations in that you cannot do custom queries, table joins, export PDFs or interact to it by API, and the components are somewhat limited compared to the BI tools that you have to pay for. It will probably get the job done in PoC phases or if you don’t have complex business logic in the visualization layer (which you shouldn’t!). You definitely should try it out!
BigQuery pricing is not straight forward, as you would expect. It feels somewhat unpredictable.
However, BigQuery is pretty cheap when it comes to storage and you pay only for disk usage and every query you run. A free tier makes most small datasets a zero sum game. If you don’t do much data exploration, but only have a few reports this turns out to be surprisingly cheap. A customer I worked for had 0,5 TB of data or so, and is only generating a few reports. In the end, the bill shows less than 10$ a month(!) for BigQuery. Other customers I have been in touch with also confirm that it is extremely cheap compared to traditional license cost of enterprise scaled databases. Google could improve the pricing model to offer more predictable pricing, for the customers that are interested in this. (like AWS Redshift)
In conclusion, I’m very impressed how easy it is to use this platform. There is virtually no cost in trying BigQuery out, both in time and money. This is definitely something to try out if you feel your databases/datawarehouse is underperforming or if you spend too much money on licenses, or if you are a startup and want a analytical platform that can grow with you.
BigQuery is an analytical platform. Don’t try to use it as a backend for a service where you need millisecond response. You are using the wrong tech and will disappointed.