[dba-Tech] Baidu does really big data

Fri Aug 26 23:10:51 CDT 2016

Very interesting. 

I worked with some Oracle databases, for our provincial government. They were never that large but they did use the method of never deleting or updating a record without a backup. In these cases a unique transaction number was added to the changed record. The backup file had all the particulars of the transaction and when searching for all the related records a reciprocal call was made within the backup file. In the days that I worked on this database, files were limited by the size of the hard drives. Big drives were astronomically expensive to buy and maintain.

Now a days, there are no such size limits. Single drives are huge...4 to 6 TB but that is just the start. There has always been RAIDS of various configurations but by their physical configuration requirements their performance tops out quickly and even with redundancy there is always the danger of data lost.

The new databases all sit on top of massive systems that connect together as a single drive. The software is Linux, Hadoop and ZFS and anyone so inclined could build and connect all their systems through a similar method. Most companies just use Cloud suppliers as the tech can be formattable.

The huge Chinese databases are just built on top of Kylin (an Ubuntu derivative) using the same tech. Thousands of drives with one file system...hundreds of blades are added daily. It truly boggles the mind. The big thing is that data lost is functionally non-existent. Another staggering concept.

If I was starting in big databases today all that I would have learned about relational databases would be just a subsystem that might assemble the data gathered by these giant NOSQL DB Back Ends. 

Aside: It is interesting to note that MySQL/MariaDB, 64bit version can access 256TB of data, 2,147,483,647 records and can run on top of Linux and ZFS FS, of equal or greater size...a theoretical maximum of course. I guess that's more than the MS Access MDB can access? ;-)

Jim

----- Original Message -----
From: "Arthur Fuller" <fuller.artful at gmail.com>
To: "Discussion of Hardware and Software issues" <dba-tech at databaseadvisors.com>
Sent: Friday, August 26, 2016 8:43:23 AM
Subject: [dba-Tech] Baidu does really big data

The Chinese equivalent of Google's search engine, Baidu
<http://www.nextplatform.com/2016/08/24/baidu-takes-fpga-approach-accelerating-big-sql/?utm_source=dbweekly&utm_medium=email>
sits
on a exabyte of data and processes ~100 petabytes of data per day, updates
~10 billion web pages per day, and handles ~100 petabytes of log updates
per day.This is achieved using Field Programmable Gate Arrays (FPGAs)
<https://en.wikipedia.org/wiki/Field-programmable_gate_array>. There's a
bit of an explanation at the link, but I confess that I'm way out of my
depth here.

To date, the largest database I've ever worked on has an initial footprint
of 1 PertaByte, with an anticipated growth of 1 TB per year. It was a
medical database called OLIS (Ontario Labs Information System), and it
housed all the data from all the medical labs in Ontario; that includes
everything from X-rays to blood and stool samples, and so on. In addition,
it houses all the data about all the physicians and patients in the
province, and also employs PITA (P*oint in Time Architecture*, described in
an article I wrote for Simple-Talk
<https://www.simple-talk.com/sql/database-administration/database-design-a-point-in-time-architecture/>.
A fundamental truth about standard relational databases is that they
destroy information. Every Delete  obviously destroys information, but so
does every Update. Certain applications demand the ability to recreate the
situation as it existed one or two or ten years ago. That is the purpose of
PITA, in which design no rows are ever physically deleted or updated;
rather, the original rows are copied and every table contains a pair of
columns, EffectiveDate and EndDate; some rows contain a Null in EndDate,
indicating that they are the current rows. By adding one more table, call
it DateRange, having exactly one row with columns EffectiveData and
EndDate, one can Join this table to existing queries so that they are all
automatically scoped by this date-range.

Due to the abundant issues attendant to the Medical Information, the
physical implementation of OLIS was distributed across 8 server-clusters.
We database developers could not even see the data returned from our
queries -- just a bunch of asterisks to indicate successful retrieval. This
might sound overly paranoid, but consider that some developer might want to
check whether the person s/he is dating has ever had an AIDS test, and what
was the result.

The size of the OLIS database pales in comparison to what Baidu handles
daily. The upside is, Baidu has hardware I can only dream of, beginning at
the chip level and extending outward from there.

-- 
Arthur
_______________________________________________
dba-Tech mailing list
dba-Tech at databaseadvisors.com
http://databaseadvisors.com/mailman/listinfo/dba-tech
Website: http://www.databaseadvisors.com