BA371 — NoSQL databases
- Recall: (Why) do we have to convert data back and forth between XML/JSON and relational?
Can we not do it all in XML/JSON? Can we create XML or JSON databases?
- Both XML databases and
JSON databases exist:
- Querying XML databases: XQuery (W3C standard);
- Querying JSON databases: no standard yet.
- More JSON than XML databases / services are dropping their XML services in favor of JSON:
- JSON is 'light weight.'
- JSON maps nicely to OOP.
- Also, relational does not work well for certain types of data; e.g.,
- Full text searching —> e.g., Apache Lucene (included in ElasticSearch).
- —> Document databases; e.g.,
- Geographic data —> Geographic Information Systems (GIS); e.g., ArcGIS,
- Network data (social or other) —>
network and graph databases; e.g., Neo4j and others.
- RR theorette: For discrete phenomena a relational representation always exists, but in some cases or domains it is not very practical or impractically expensive.
- Example I: full text searching in TeachEngineering slow in MySQL —> switched to Apache Lucene.
- Example II: relational model not very efficient for spatial data —> GIS.
- Example IIa: relational model not great for storing network data —> network databases store networds as nodes (vertices) and links (edges) in a network (topological).
- Note that many commercial relational database products have (nonrelational?) additions to handle these nonrelational
or quasi-relational data; e.g.,
Oracle's Spatial & Graph option or Oracle's NoSQL database
- Problem: in many texts; e.g., Coronel & Morris (2018) and Cotrupe (2018), these data are characterized as 'unstructured.' This,
however, is confusing (if not misleading) as much of the so-called 'unstructured' data is actually highly structured. It is just not easily or
efficiently represented in a relational structure.
- The 'Big Data' factor (Ch. 14 pp. 658-664):
- Many 'big data' initiatives are distributed; i.e., data are spread over multiple (many) devices at multiple locations.
- Distributed is possible with relational but becomes expensive quickly.
- (Industry-strength) relational databases provide consistency (shared state): i.e., all
nodes in a distributed system see the same data at the same time; i.e., guarantees that each read reflects the
- ACID: Atomicity, Consistency, Isolation, Durability
- Atomicity: each transaction is treated as a single entity which either succeeds or fails completely.
- Consistency: a transaction can only bring the database from one valid state to another (aka 'shared state').
- Isolation: result of concurrent execution of transactions is the same as sequential execution of those transactions.
- Durability: once committed a transaction remains committed.
- However, for many applications availability (eventual consistency) rather than (absolute) consistency is
all that is needed: e.g.,
- Business Analytics: in finding any trend or pattern, a single transaction that took place a second ago is not really relevant (if it is, it bucks the trend!).
- If a new YouTube video comes online, it may be fine if the rest of the world only sees it a few seconds/minutes later.
- If a new TeachEngineering document comes online, it is fine if the rest of the world only sees it 15 minutes later.
- When a bookstore in Florida offers a discount copy of Grapes of Wrath, it may be fine if in one part of the country this entry is
seen a few seconds (or minutes) earlier than in other parts.
- Public stock tickers/quotes have delayed pricing data.
- —> BASE: Basically Available, Soft state, Eventually consistent.
- In addition:
- Scaling up (large amounts of data) and scaling out (geographically distributed data) on RDBMS
quickly becomes very complex and/or very expensive with RDBMS (hardware and licenses).
- 'Availability' is possible without (very expensive) RDBMS —> NoSQL databases:
- Data distribution is (typically) done by sharding: 'table data are spread over
multiple nodes and even multiple databases; e.g., West-coast customers (rows in the customers table) reside
on a different node than East-coast customers (also part of the customers table).
- Data are often replicated across several nodes —> addition or removal (or loss!!) of nodes results in redistribution
of data and responsibilities over all nodes.
- With NoSQL there are no primary or foreign keys, no joins, no relational algebra and no normalization —> data
duplication (no just across nodes!) is common:
- TeachEngineering 1.0 (relational): each educational standard is only represented once.
- TeachEngineering 2.0 (NoSQL-JSON document store): 1,121 duplications for the ten most used standards.
- Most do not provide RDBMS-like transaction control; e.g., no locking, no concurrency control. ...which may be OK in
- NoSQL databases come in different types:
- Key-value stores/databases: some data are 'keyed' to a key value.
- Document (mostly JSON) stores/databases: data stored as JSON strings.
- Column-oriented databases: store data by table column rather than by row.
- A word on Hadoop (pp.669-671): map-reduce
with replicated data storage (originally developed at Yahoo!; now an Apache project).
- Much of (business) big data needs a platform which allows for computing on large
amounts of spatially distributed, multi-format data stored on inexpensive hardware.
- RDBMS-way can be prohibitively expensive while much of its power might not be needed.
- Some (or much) of the data are not, and do not need to be in relational form.
- Furthermore: if the data are distributed over multiple platforms, multiple locations and multiple formats, we
might want to compute differently:
- Traditional way: collect all the data from the distributed servers, then compute centrally.
- An alternative way: maximize computing where the data are (locality-aware computing) and do
only centrally what MUST be done centrally:
- MapReduce: first extract and filter data at the
location where they reside (map); then aggregate centrally (reduce):
- Must be !very careful! E.g., mean(mean(1, 2, 3, 4, 5), mean(10, 20)) ≠ mean(1, 2, 3, 4, 5, 10, 20)
- Hadoop is a low-level tool —> requires application engineering —> several Hadoop application tools
have been developed and have 'joined' the Hadoop ecosystem; e.g.,
Apache HBase and
- Note that NoSQL databases have no 3NF; no joins —> lots of data duplication.
- No integrity constraints (primary/foreign keys).
- No standard database interaction language (such as SQL) exists for these databases.
- Overall message: The database world is on the move. Relational is still mostly king, but not in all places.
"Everybody's talking 'bout the new kid in town."