The practical transformation of raw data into
actionable information. BI typically involves reporting, online
analytical processing, analytics, data and text mining, complex event
processing, business performance management, text mining, and analytics
(both predictive and prescriptive). For example, a BI suite might
combine ETL from many data stores with dashboards, reports, and
analytics of a business' data, such as sales by region over time. Players include Jaspersoft, SAP Business
Objects, IBM Cognos, Oracle's BI Platform, Microstrategy, and, for
business planning, Anaplan.
Cassandra
Apache Cassandra is a high performance (non-normalized) database
that is often deployed on distributed clusters.
CDN
Content Delivery Network, such as Windows Azure,
Amazon CloudFront, often run by telcos, such as AT&T, to efficiently
deliver video across the internet.
cluster computing
Typically a set of inexpensive computers operating as if they were a
single super-computer through LAN-based distributed computing
orchestrated by a middleware layer with a "master".
column-oriented database
Column-oriented storage layouts are well-suited for OLAP / data
warehouses, which typically involve a small number of highly complex
queries over terabytes of data. The alternative is row-oriented storage,
which is well-suited for OLTP interactions.
CrossTab Report
A report that aids in the analysis of data by
combining the summaries of multiple variables. For example, it might
reveal that most of the purchasers of a certain product live in a
particular region or belong to a certain age group. See also
pivot table.
CRM
Customer Relationship
Management. A database for CRM is typically read-optimized (rather than
write-optimized).
CRUD
create, read, update, delete
Derby
Apache Derby is an OLTP database.
Domain
A metadata layer than can be used for an ad hoc report. For example,
a domain might represent a join between two tables.
ETL
Extract, transact, load. ETL is typically a step in creating a
unified data warehouse from multiple operational data stores, such as
Marketing, Sales, Enterprise Resource Planning (ERP).
Flume
Apache Flume is a distributed system for collecting, aggregating,
and moving large amounts of log data from multiple sources to a central
data store. Flume can be used with Apache Avro, a framework for remote
procedure calls (RPCs) and serialization, that is part of Apache Hadoop
and uses JSON for defining data types and protocols.
grid computing
loosely coupled, distributed computers that perform a task. Less
centralized than cluster computing or peer-to-peer computing.
Hadoop
Apache Hadoop is fundamental to Big Data because it provides a way
for commodity hardware in fault-tolerant, distributed clusters to store
and process vast amounts of data. Hadoop involves the combination of the
HDFS and MapReduce. To facilitate MapReduce projects, Hadoop users can
control query execution using the "Pig Latin" language of the Apache Pig
project. Apache HBase is a non-relational database for sparse data that
MapReduce processes.
Hadoop-Hive data source
A big data source for fast writes. Queries typically run overnight.
The query language is HiveQL.
HDFS
Hadoop Distributed File System, which holds the data the MapReduce
processes.
Hibernate
Object-relational Mapping (ORM) that handles the impedance mismatch
Java classes and RDBMS tables.
IMBD
In-Memory Database, which is faster than a database that has to seek
data on a disk.
Java Database Connectivity. JDBC enables a Java
application to connect to a data source. The API consists of two
packages java.sql and javax.sql. The data source can be a relational
database management system (RDBMS) or an non-relational ODBC-aware data
source. The database driver is in the form of a .jar file. For connecting to the database, Oracle recommends using the javax.sql
package, which provides javax.sql.DataSource interface. See
http://docs.oracle.com/javase/8/docs/api/javax/sql/DataSource.html.
The executeQuery method returns a ResultSet object. package, which
enables queries with the Statement.execute method and processing the
"hits" with an implementation of the ResultSet interface. JDBC supports four types of connections, with Type 4 being
platform-independent pure Java and using the protocol of the database
itself.
Jersey
An open-source framework for developing web
services that uses Jackson to serialize/deserialize POJOs to JSON.
Jackson
Jackson is the default provider of serialization/deserialization
for Jersey.
For production scenarios, it is best to use
JDBC with the Java Naming and Directory Interface (JDNI),
a directory service which enables a web server and serlet container,
such as Apache Tomcat, to manage a "pool" of already-created connections
for improved performance. JNDI also supports distributed transactions
involving multiple data sources. See
http://docs.oracle.com/javase/8/docs/api/javax/sql/DataSource.html.
JSON
JSON
JavaScript Object Notation. Translates serialized objects into
attribute-value pairs. The most popular format for responses to a
RESTful API call because it describes serialized Java objects, also
known as plain old Java objects (POJOs), with less overhead than does
the XML typically used with SOAP APIs.
The colon is the delimiter that indicates that the value of the
firstName
is John. The brackets enclose an array of type phoneNumbers.
KPI
Key Performance Indicator metric. For example,
customer loyalty, net sales, mean time between failure, graduation rate,
or national unemployment.
MapReduce
A scalable, fault-tolerant framework for
processing terabytes of data (BigData). The Map function perform a
partitioned task. The Reduce function summarizes the results of the
partitioned tasks. The fault-tolerance is useful if a large cluster of
computers needs a long time to process the work. One use case: Google's
indexing of the entire World Wide Web.
Masboard
Dashboard with external content, such as a news
feed.
Maven
A build automation tool for building Java
projects, running JUnit tests, generating documentation, and packaging
the build as a .jar file. POM.xml configures the Project Object Model (POM),
which specifies things like the version number of Java to use, whether
to run a web server, and can include dependencies like Hibernate. Apache
Maven has replaced Apache Ant in popularity because Maven: allows the
organization to use plug-ins with minimal configuration; tends to impose
a shared convention of doing build tasks; can be run from within an IDE.
MDX
Multidimensional Expressions (MDX) is a query language for OLAP
databases, much like SQL is a query language for relational databases.
The XML wrapper is called mdXML. It is also a calculation language, with
syntax similar to spreadsheet formulas. For example:
SELECT
{ [Measures].[Store Sales] } ON COLUMNS,
{ [Date].[2002], [Date].[2003] } ON ROWS
FROM Sales
WHERE ( [Store].[USA].[CA] )
Measure
For reports, something similar to a field but
represents an expression, such as the average freight.
Mongo DB
The most popular NoSQL database system. JSON-like documents are
managed with support for concurrency by the use of sharding.
OAuth
An open standard
for authorization that allows web surfers to log into third party web
sites using their Google password, without exposing that password to the
third party.
Online Analytical Processing. One wide table with many columns:
customer, product, year, order. This is optimal for reading (fast
access) of summarized information (year or quarter). Efficient for big
reads. Non-normalized. The customer name repeats for each order. See
http://db.lcs.mit.edu/projects/cstore/vldb.pdf
OLTP
Online transaction processing system, typically with tables like
Customer, Order, Product. Insert, Update, Delete. To get business
intelligence requires table JOINs. Efficient for small, frequent write
operations because a single operations writes all the fields of a row (tuple). Compare to OLAP. OpenStack.
OSGi
Open Source Gateway initiative. A Java-based framework for "bundles"
of functionality, delivered in .JAR files, to work together as
components. Used in the Business Intelligence and Reporting Tools (BIRT)
open source reporting engine and in the plug-in architecture for the
Eclipse, Confluence Wiki, and the Jira bug tracker.
A table of
summaries. Given a spreadsheet, a pivot table
summarizes a two-dimensional spreadsheet (columns and rows) into a
desired third dimension, such as which Person sold which Product in
which Region. Insofar as Excel supports pivot tables, Excel is a tool
for analytics.
POJO
Plain Old Java Object
POM
Project Object
Model represented by the pom.xml of Maven
Works with the
Web’s HTTP protocol "methods" for enable lightweight CRUD: post
(create), get (read), head (read metadata), patch (update), delete.
HTTP "methods" are invokes through URIs. Supports the representation of
object state through JSON (as well as XML).
shard
A shard is a horizontal partition of a database table. For example,
in the Customer table, the rows can be sharded by geographic region.
Performance improves if each shard has a relatively small index.
Spotfire
Tibco Spotfire is a data visualization tool and also the name of its
analytics platform.
Spring
A framework that
supports JDBC and can be an alternative to EJB.
Thrift
Apache Thrift is an interface definition language
that is used to define and create services for a wide variety of
programming and scripting languages.