[dba-Tech] Baidu does really big data

Arthur Fuller fuller.artful at gmail.com
Fri Aug 26 10:43:23 CDT 2016


The Chinese equivalent of Google's search engine, Baidu
<http://www.nextplatform.com/2016/08/24/baidu-takes-fpga-approach-accelerating-big-sql/?utm_source=dbweekly&utm_medium=email>
sits
on a exabyte of data and processes ~100 petabytes of data per day, updates
~10 billion web pages per day, and handles ~100 petabytes of log updates
per day.This is achieved using Field Programmable Gate Arrays (FPGAs)
<https://en.wikipedia.org/wiki/Field-programmable_gate_array>. There's a
bit of an explanation at the link, but I confess that I'm way out of my
depth here.

To date, the largest database I've ever worked on has an initial footprint
of 1 PertaByte, with an anticipated growth of 1 TB per year. It was a
medical database called OLIS (Ontario Labs Information System), and it
housed all the data from all the medical labs in Ontario; that includes
everything from X-rays to blood and stool samples, and so on. In addition,
it houses all the data about all the physicians and patients in the
province, and also employs PITA (P*oint in Time Architecture*, described in
an article I wrote for Simple-Talk
<https://www.simple-talk.com/sql/database-administration/database-design-a-point-in-time-architecture/>.
A fundamental truth about standard relational databases is that they
destroy information. Every Delete  obviously destroys information, but so
does every Update. Certain applications demand the ability to recreate the
situation as it existed one or two or ten years ago. That is the purpose of
PITA, in which design no rows are ever physically deleted or updated;
rather, the original rows are copied and every table contains a pair of
columns, EffectiveDate and EndDate; some rows contain a Null in EndDate,
indicating that they are the current rows. By adding one more table, call
it DateRange, having exactly one row with columns EffectiveData and
EndDate, one can Join this table to existing queries so that they are all
automatically scoped by this date-range.

Due to the abundant issues attendant to the Medical Information, the
physical implementation of OLIS was distributed across 8 server-clusters.
We database developers could not even see the data returned from our
queries -- just a bunch of asterisks to indicate successful retrieval. This
might sound overly paranoid, but consider that some developer might want to
check whether the person s/he is dating has ever had an AIDS test, and what
was the result.

The size of the OLIS database pales in comparison to what Baidu handles
daily. The upside is, Baidu has hardware I can only dream of, beginning at
the chip level and extending outward from there.

-- 
Arthur


More information about the dba-Tech mailing list