Archive

Archive for the ‘Accumulo’ Category

Big Data with Apache Accumulo Preserving Security with Open Source

March 13, 2014 Leave a comment

Episode 19 of the podcast was a talk with Adam Fuchs.

Adam talked about Apache Accumulo which is a system built for doing random i/o with peta bytes of data.

Distributing the computation to the data with cell level security is where Accumulo really shines.

Accumulo provides a richer data model than simple key-value stores, but is not a fully relational database. Data is represented as key-value pairs, where the key and value are comprised of the following elements:

Key Value
Row ID Column Timestamp
Family Qualifier Visibility

All elements of the Key and the Value are represented as byte arrays except for Timestamp, which is a Long. Accumulo sorts keys by element and lexicographically in ascending order. Timestamps are sorted in descending order so that later versions of the same Key appear first in a sequential scan. Tables consist of a set of sorted key-value pairs.

Accumulo stores data in tables, which are partitioned into tablets. Tablets are partitioned on row boundaries so that all of the columns and values for a particular row are found together within the same tablet. The Master assigns Tablets to one TabletServer at a time. This enables row-level transactions to take place without using distributed locking or some other complicated synchronization mechanism. As clients insert and query data, and as machines are added and removed from the cluster, the Master migrates tablets to ensure they remain available and that the ingest and query load is balanced across the cluster.

images/data_distribution.png
images/failure_handling.png
Subscribe to the Podcast and here all of what Adam had to say.
You can get started using Apache Accumulo with our development environment https://github.com/stealthly/hdp-accumulo
/*******************************************
 Joe Stein
 Founder, Principal Consultant
 Big Data Open Source Security LLC
 Twitter: @allthingshadoop
********************************************/

Big Data Open Source Security

May 25, 2013 1 comment

In security there has never (IMHO) been enough open source solutions and Bruce Schneier has written about this several times in the past, and there’s no need to rewrite the arguments again.

Now with “NoSQL” and “Big Data” Open Source trends in the market place Security finally has an intersection… a union if I may where new solutions to solve problems that have plagued our society can finally begin to arrise (and have already in many cases). Fraud, Malware, Phishing, Spam, etc all can be tackled now with new Security solutions because of Big Data and Open Source.

At the front lines of this is Apache Accumulo which is a Big Data, Open Source and Secure NoSQL Database that runs on top of Apache Hadoop. It was originally developed by the United States National Security Agency and submitted to the Apache Foundation as Open Source in 2011 with 3 years of development and production operation already having occurred.

Accumulo extends the BigTable data model to implement a security mechanism known as cell-level security. Every key-value pair has its own security label, stored under the column visibility element of the key, which is used to determine whether a given user meets the security requirements to read the value. This enables data of various security levels to be stored within the same row, and users of varying degrees of access to query the same table, while preserving data confidentiality.

SECURITY LABEL EXPRESSIONS

When mutations are applied, users can specify a security label for each value. This is done as the Mutation is created by passing a ColumnVisibility object to the put() method:

Text rowID = new Text("row1");
Text colFam = new Text("myColFam");
Text colQual = new Text("myColQual");
ColumnVisibility colVis = new ColumnVisibility("public");
long timestamp = System.currentTimeMillis();

Value value = new Value("myValue");

Mutation mutation = new Mutation(rowID);
mutation.put(colFam, colQual, colVis, timestamp, value);

SECURITY LABEL EXPRESSION SYNTAX

Security labels consist of a set of user-defined tokens that are required to read the value the label is associated with. The set of tokens required can be specified using syntax that supports logical AND and OR combinations of tokens, as well as nesting groups of tokens together.

For example, suppose within our organization we want to label our data values with security labels defined in terms of user roles. We might have tokens such as:

admin
audit
system
These can be specified alone or combined using logical operators:

// Users must have admin privileges:
admin

// Users must have admin and audit privileges
admin&audit

// Users with either admin or audit privileges
admin|audit

// Users must have audit and one or both of admin or system
(admin|system)&audit

When both | and & operators are used, parentheses must be used to specify precedence of the operators.

AUTHORIZATION

When clients attempt to read data from Accumulo, any security labels present are examined against the set of authorizations passed by the client code when the Scanner or BatchScanner are created. If the authorizations are determined to be insufficient to satisfy the security label, the value is suppressed from the set of results sent back to the client.

Authorizations are specified as a comma-separated list of tokens the user possesses:

// user possess both admin and system level access
Authorization auths = new Authorization("admin","system");

Scanner s = connector.createScanner("table", auths);

USER AUTHORIZATIONS

Each accumulo user has a set of associated security labels. To manipulate these in the shell use the setuaths and getauths commands. These may also be modified using the java security operations API.

When a user creates a scanner a set of Authorizations is passed. If the authorizations passed to the scanner are not a subset of the users authorizations, then an exception will be thrown.

To prevent users from writing data they can not read, add the visibility constraint to a table. Use the -evc option in the createtable shell command to enable this constraint. For existing tables use the following shell command to enable the visibility constraint. Ensure the constraint number does not conflict with any existing constraints.

config -t table -s table.constraint.1=org.apache.accumulo.core.security.VisibilityConstraint

Any user with the alter table permission can add or remove this constraint. This constraint is not applied to bulk imported data, if this a concern then disable the bulk import permission.

SECURE AUTHORIZATIONS HANDLING

For applications serving many users, it is not expected that an accumulo user will be created for each application user. In this case an accumulo user with all authorizations needed by any of the applications users must be created. To service queries, the application should create a scanner with the application users authorizations. These authorizations could be obtained from a trusted 3rd party.

Often production systems will integrate with Public-Key Infrastructure (PKI) and designate client code within the query layer to negotiate with PKI servers in order to authenticate users and retrieve their authorization tokens (credentials). This requires users to specify only the information necessary to authenticate themselves to the system. Once user identity is established, their credentials can be accessed by the client code and passed to Accumulo outside of the reach of the user.

QUERY SERVICES LAYER

Since the primary method of interaction with Accumulo is through the Java API, production environments often call for the implementation of a Query layer. This can be done using web services in containers such as Apache Tomcat, but is not a requirement. The Query Services Layer provides a mechanism for providing a platform on which user facing applications can be built. This allows the application designers to isolate potentially complex query logic, and enables a convenient point at which to perform essential security functions.

Several production environments choose to implement authentication at this layer, where users identifiers are used to retrieve their access credentials which are then cached within the query layer and presented to Accumulo through the Authorizations mechanism.

Typically, the query services layer sits between Accumulo and user workstations.

Apache Accumulo version 1.5 just came out for download with docs

New software as a service solutions will start to spring up into the market as will new out of the box open source solutions. Whether we are trying to prevent health care fraud, protect individuals from identify theft or corporations from intrusion all without comprimsing the (C)onfidentiality, (I)ntegrity and the (A)vailability of the data and distributes systems.

/*
Joe Stein
http://www.linkedin.com/in/charmalloc
*/