Archive
Hadoop isn’t dead but you might be doing it wrong!
I haven’t blogged (or podcasted for that matter) in a while. There are lots of different reasons for that and I am always happy to chat and grab tea if folks are interested but after attending this year’s HIMSS conference I just couldn’t hold it in anymore.
I went to HIMSS so excited it was supposed to be the year of Big Data! Everything was about transformation and interoperability and OMGZ the excitement.
The first keynote Monday evening was OFF THE HOOK http://www.himssconference.org/education/sessions/keynote-speakers. The rest of the time myself and two of my colleagues where at the expo. It is basically CES for Healthcare (if you don’t know what CES is then think DEFCON for Healthcare… or something). Its big.
But where was the Big Data?
Not really anywhere … There were 3 recognizable “big data companies” and one of them was in the booth as a partner for cloud services. It was weird. What happened?
One of the engineers from Cerner has a lightening talk at the Kafka Summit, go Cerner!!
Didn’t everyone get the memo? We need to help reduce costs of patient care!
Here are two ways to help reduce costs of patient care!
- (Paraphrasing Michael Dell from his keynote) Innovation funding for Healthcare IT will come from optimizing your data center resources.
- (This one is from me but inspired by Bruce Schneier) Through Open Source we can enable better systems by sharing in the R&D costs and also make them more secure.
Totally agree with #1, have seen it first hand people saving 82% of their data center bill. Not even using spot (or as they say “preemptive“) instances yet. Amazing!
As for #2, you have to realize that different people are good at different things. One person can write anything but sometimes 2 or 3 or 45 of them can write it better…. at least make sure the tests always keep passing and evolving properly, etc, etc, etc, stewardship, etc.
Besides all of that, the conference was great. There were a lot of companies and people I recognized and bumped into and it was great to catch up.
I was also really REALLY excited to see how far physician signatures and form signing has (finally) come in healthcare removing all that paper. Fax is almost dead but there are still a couple of companies kicking.
One last thing, the cyber security part of the expo was also disappointing. I know it was during the RSA Conference but Healthcare needs good solutions too. For that there were a good set of solutions not bad in some cases legit and known (thanks for showing up!) but the “pavilion” was downstairs in the back left corner. Maybe if HIMSS coincided with Strata it would have been different, hard to say.
There was one tweet about it https://twitter.com/elodinadotnet/status/705176912964923393 (at least) not sure if there were more.
So, Big Data, Healthcare, Security, OH MY! I am in!
I will be talking more about problems and solutions with using the Open Source Interoperable XML based FHIR standard in Healthcare removing the need to integrate and make interoperable HL7 systems in New York City on 03/29/2016 http://www.meetup.com/Apache-Mesos-NYC-Meetup/events/229503464/ and getting into realtime stream processing on Mesos.
I will also be conducting a training on SMACK Stack 1.0 (Streaming Mesos Analytics Cassandra Kafka) using telephone systems and API to start stream events and interactions with different systems because of them. Yes, I bought phones and yes you get to keep yours.
What has attracted me (for almost 2 years now) to running on Mesos Hadoop systems and eco-system components is the ease it brings for the developers, systems engineers, data scientists, analysts and the users of the software systems that run (as a service often) those components. There are lots of things to research and read in those cases I would
1) scour my blog
2) read this https://www.oreilly.com/ideas/a-tale-of-two-clusters-mesos-and-yarn
3) and this http://blog.cloudera.com/blog/2015/08/how-to-run-apache-mesos-on-cdh/
4) your own thing
Hadoop! Mesos!
~ Joestein
p.s. if you have something good to say about Hadoop and want to talk about it and it is gripping and good and gets back to the history and continued efforts. Let me know. Thanks!
Ideas and goals behind the Go Kafka Client
I think a bunch of folks have heard already that B.D.O.S.S. was working on a new Apache Kafka Client For Go. Go Kafka Client was open sourced last Friday. Today we are starting the release of Minotaur which is our lab environment for Apache Zookeeper, Apache Mesos, Apache Cassandra, Apache Kafka, Apache Hadoop and our new Go Kafka Client.
To get started using the consumer client check out our example code and its property file.
Ideas and goals behind the Go Kafka Client:
1) Partition Ownership
2) Fetch Management
3) Work Management
4) Offset Management
func main() { config, consumerIdPattern, topic, numConsumers, graphiteConnect, graphiteFlushInterval := resolveConfig() startMetrics(graphiteConnect, graphiteFlushInterval) ctrlc := make(chan os.Signal, 1) signal.Notify(ctrlc, os.Interrupt) consumers := make([]*kafkaClient.Consumer, numConsumers) for i := 0; i < numConsumers; i++ { consumers[i] = startNewConsumer(*config, topic, consumerIdPattern, i) time.Sleep(10 * time.Second) } <-ctrlc fmt.Println("Shutdown triggered, closing all alive consumers") for _, consumer := range consumers { <-consumer.Close() } fmt.Println("Successfully shut down all consumers") } func startMetrics(graphiteConnect string, graphiteFlushInterval time.Duration) { addr, err := net.ResolveTCPAddr("tcp", graphiteConnect) if err != nil { panic(err) } go metrics.GraphiteWithConfig(metrics.GraphiteConfig{ Addr: addr, Registry: metrics.DefaultRegistry, FlushInterval: graphiteFlushInterval, DurationUnit: time.Second, Prefix: "metrics", Percentiles: []float64{0.5, 0.75, 0.95, 0.99, 0.999}, }) } func startNewConsumer(config kafkaClient.ConsumerConfig, topic string, consumerIdPattern string, consumerIndex int) *kafkaClient.Consumer { config.Consumerid = fmt.Sprintf(consumerIdPattern, consumerIndex) config.Strategy = GetStrategy(config.Consumerid) config.WorkerFailureCallback = FailedCallback config.WorkerFailedAttemptCallback = FailedAttemptCallback consumer := kafkaClient.NewConsumer(&config) topics := map[string]int {topic : config.NumConsumerFetchers} go func() { consumer.StartStatic(topics) }() return consumer } func GetStrategy(consumerId string) func(*kafkaClient.Worker, *kafkaClient.Message, kafkaClient.TaskId) kafkaClient.WorkerResult { consumeRate := metrics.NewRegisteredMeter(fmt.Sprintf("%s-ConsumeRate", consumerId), metrics.DefaultRegistry) return func(_ *kafkaClient.Worker, msg *kafkaClient.Message, id kafkaClient.TaskId) kafkaClient.WorkerResult { kafkaClient.Tracef("main", "Got a message: %s", string(msg.Value)) consumeRate.Mark(1) return kafkaClient.NewSuccessfulResult(id) } } func FailedCallback(wm *kafkaClient.WorkerManager) kafkaClient.FailedDecision { kafkaClient.Info("main", "Failed callback") return kafkaClient.DoNotCommitOffsetAndStop } func FailedAttemptCallback(task *kafkaClient.Task, result kafkaClient.WorkerResult) kafkaClient.FailedDecision { kafkaClient.Info("main", "Failed attempt") return kafkaClient.CommitOffsetAndContinue }
Plans moving forward with the Go Kafka Client:
Ideas and goals behind Minotaur:
Plans moving forward with Minotaur:
XML to Avro Conversion
We all know what XML is right? Just in case not, no problem here is what it is all about.
<root> <node>5</node> </root>
Now, what the computer really needs is the number five and some context around it. In XML you (human and computer) can see how it represents context to five. Now lets say instead you have a business XML document like FPML
<FpML xmlns="http://www.fpml.org/2007/FpML-4-4" xmlns:fpml="http://www.fpml.org/2007/FpML-4-4" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="4-4" xsi:schemaLocation="http://www.fpml.org/2007/FpML-4-4 ../fpml-main-4-4.xsd http://www.w3.org/2000/09/xmldsig# ../xmldsig-core-schema.xsd" xsi:type="RequestTradeConfirmation"> <!-- start of distinct --> <strike> <strikePrice>32.00</strikePrice> </strike> <numberOfOptions>150000</numberOfOptions> <optionEntitlement>1.00</optionEntitlement> <equityPremium> <payerPartyReference href="party2"/> <receiverPartyReference href="party1"/> <paymentAmount> <currency>EUR</currency> <amount>405000</amount> </paymentAmount> <paymentDate> <unadjustedDate>2001-07-17Z</unadjustedDate> <dateAdjustments> <businessDayConvention>NONE</businessDayConvention> </dateAdjustments> </paymentDate> <pricePerOption> <currency>EUR</currency> <amount>2.70</amount> </pricePerOption> </equityPremium> </equityOption> <calculationAgent> <calculationAgentPartyReference href="party1"/> </calculationAgent> <documentation> <masterAgreement> <masterAgreementType>ISDA2002</masterAgreementType> </masterAgreement> <contractualDefinitions>ISDA2002Equity</contractualDefinitions> <!-- populate credit support document with correct value --> <creditSupportDocument>TODO</creditSupportDocument> </documentation> <governingLaw>GBEN</governingLaw> </trade> <party id="party1"> <partyId>Party A</partyId> </party> <party id="party2"> <partyId>Party B</partyId> </party> </FpML>
That is a lot of extra unnecessary data points. Now lets look at this using Apache Avro.
With Avro, the context and the values are separated. This means the schema/structure of what the information is does not get stored or streamed over and over and over and over (and over) again.
The Avro schema is hashed. So the data structure only holds the value and the computer understands the fingerprint (the hash) of the schema and can retrieve the schema using the fingerprint.
0x d7a8fbb307d7809469ca9abcb0082e4f8d5651e46d3cdb762d02d0bf37c9e592
This type of implementation is pretty typical in the data space.
When you do this you can reduce your data between 20%-80%. When I tell folks this they immediately ask, “why such a large gap of unknowns”. The answer is because not every XML is created the same. But that is the problem because you are duplicating the information the computer needs to understand the data. XML is nice for humans to read, sure … but that is not optimized for the computer.
Here is a converter we are working on https://github.com/stealthly/xml-avro to help get folks off of XML and onto lower cost, open source systems. This allows you to keep parts of your systems (specifically the domain business code) using the XML and not having to be changed (risk mitigation) but store and stream the data with less overhead (optimize budget).
/*******************************************
Joe Stein
Founder, Principal Consultant
Big Data Open Source Security LLC
http://www.stealth.ly
Twitter: @allthingshadoop
********************************************/
Big Data Open Source Security
In security there has never (IMHO) been enough open source solutions and Bruce Schneier has written about this several times in the past, and there’s no need to rewrite the arguments again.
Now with “NoSQL” and “Big Data” Open Source trends in the market place Security finally has an intersection… a union if I may where new solutions to solve problems that have plagued our society can finally begin to arrise (and have already in many cases). Fraud, Malware, Phishing, Spam, etc all can be tackled now with new Security solutions because of Big Data and Open Source.
At the front lines of this is Apache Accumulo which is a Big Data, Open Source and Secure NoSQL Database that runs on top of Apache Hadoop. It was originally developed by the United States National Security Agency and submitted to the Apache Foundation as Open Source in 2011 with 3 years of development and production operation already having occurred.
Accumulo extends the BigTable data model to implement a security mechanism known as cell-level security. Every key-value pair has its own security label, stored under the column visibility element of the key, which is used to determine whether a given user meets the security requirements to read the value. This enables data of various security levels to be stored within the same row, and users of varying degrees of access to query the same table, while preserving data confidentiality.
SECURITY LABEL EXPRESSIONS
When mutations are applied, users can specify a security label for each value. This is done as the Mutation is created by passing a ColumnVisibility object to the put() method:
Text rowID = new Text("row1");
Text colFam = new Text("myColFam");
Text colQual = new Text("myColQual");
ColumnVisibility colVis = new ColumnVisibility("public");
long timestamp = System.currentTimeMillis();
Value value = new Value("myValue");
Mutation mutation = new Mutation(rowID);
mutation.put(colFam, colQual, colVis, timestamp, value);
SECURITY LABEL EXPRESSION SYNTAX
Security labels consist of a set of user-defined tokens that are required to read the value the label is associated with. The set of tokens required can be specified using syntax that supports logical AND and OR combinations of tokens, as well as nesting groups of tokens together.
For example, suppose within our organization we want to label our data values with security labels defined in terms of user roles. We might have tokens such as:
admin
audit
system
These can be specified alone or combined using logical operators:
// Users must have admin privileges:
admin
// Users must have admin and audit privileges
admin&audit
// Users with either admin or audit privileges
admin|audit
// Users must have audit and one or both of admin or system
(admin|system)&audit
When both | and & operators are used, parentheses must be used to specify precedence of the operators.
AUTHORIZATION
When clients attempt to read data from Accumulo, any security labels present are examined against the set of authorizations passed by the client code when the Scanner or BatchScanner are created. If the authorizations are determined to be insufficient to satisfy the security label, the value is suppressed from the set of results sent back to the client.
Authorizations are specified as a comma-separated list of tokens the user possesses:
// user possess both admin and system level access
Authorization auths = new Authorization("admin","system");
Scanner s = connector.createScanner("table", auths);
USER AUTHORIZATIONS
Each accumulo user has a set of associated security labels. To manipulate these in the shell use the setuaths and getauths commands. These may also be modified using the java security operations API.
When a user creates a scanner a set of Authorizations is passed. If the authorizations passed to the scanner are not a subset of the users authorizations, then an exception will be thrown.
To prevent users from writing data they can not read, add the visibility constraint to a table. Use the -evc option in the createtable shell command to enable this constraint. For existing tables use the following shell command to enable the visibility constraint. Ensure the constraint number does not conflict with any existing constraints.
config -t table -s table.constraint.1=org.apache.accumulo.core.security.VisibilityConstraint
Any user with the alter table permission can add or remove this constraint. This constraint is not applied to bulk imported data, if this a concern then disable the bulk import permission.
SECURE AUTHORIZATIONS HANDLING
For applications serving many users, it is not expected that an accumulo user will be created for each application user. In this case an accumulo user with all authorizations needed by any of the applications users must be created. To service queries, the application should create a scanner with the application users authorizations. These authorizations could be obtained from a trusted 3rd party.
Often production systems will integrate with Public-Key Infrastructure (PKI) and designate client code within the query layer to negotiate with PKI servers in order to authenticate users and retrieve their authorization tokens (credentials). This requires users to specify only the information necessary to authenticate themselves to the system. Once user identity is established, their credentials can be accessed by the client code and passed to Accumulo outside of the reach of the user.
QUERY SERVICES LAYER
Since the primary method of interaction with Accumulo is through the Java API, production environments often call for the implementation of a Query layer. This can be done using web services in containers such as Apache Tomcat, but is not a requirement. The Query Services Layer provides a mechanism for providing a platform on which user facing applications can be built. This allows the application designers to isolate potentially complex query logic, and enables a convenient point at which to perform essential security functions.
Several production environments choose to implement authentication at this layer, where users identifiers are used to retrieve their access credentials which are then cached within the query layer and presented to Accumulo through the Authorizations mechanism.
Typically, the query services layer sits between Accumulo and user workstations.
Apache Accumulo version 1.5 just came out for download with docs
New software as a service solutions will start to spring up into the market as will new out of the box open source solutions. Whether we are trying to prevent health care fraud, protect individuals from identify theft or corporations from intrusion all without comprimsing the (C)onfidentiality, (I)ntegrity and the (A)vailability of the data and distributes systems.
/*
Joe Stein
http://www.linkedin.com/in/charmalloc
*/
Cloudera, Yahoo and the Apache Hadoop Community Security Branch Release Update
In the wake of Yahoo! having announced that they would discontinue their Hadoop distribution and focus their efforts into Apache Hadoop http://yhoo.it/i9Ww8W the landscape has become tumultuous.
Yahoo! engineers have spent their time and effort contributing back to the Apache Hadoop security branch (branch of 0.20) and have proposed release candidates.
Currently being voted and discussed is “Release candidate 0.20.203.0-rc1”. If you are following the VOTE and the DISCUSSION then maybe you are like me it just cannot be done without a bowl of popcorn before opening the emails. It is getting heated in a good and constructive kind of way. http://mail-archives.apache.org/mod_mbox/hadoop-general/201105.mbox/thread there are already more emails in 5 days of May than there were in all of April. woot!
My take? Has it become Cloudera vs Yahoo! and Apache Hadoop releases will become fragmented because of it? Well, it is kind of like that already. 0.21 is the latest and can anyone that is not a committer quickly know or find out the difference between that and the other release branches? It is esoteric 😦 0.22 is right around the corner too which is a release from trunk.
Lets take HBase as an example (a Hadoop project). Do you know what version of HDFS releases can support HBase in production without losing data? If you do then maybe you don’t realize that many people still don’t even know about the branch. And, now that CDH3 is out you can use that (thanks Cloudera!) otherwise it is highly recommended to not be in production with HBase unless you use the append branch http://svn.apache.org/viewvc/hadoop/common/branches/branch-0.20-append/ of 0.20 which makes you miss out on other changes in trunk releases…
__ eyes crossing inwards and sideways with what branch does what and when the trunk release has everything __
Hadoop is becoming an a la cart which features and fixes can I live without for all of what I really need to deploy … or requiring companies to hire a committer … or a bunch of folks that do nothing but Hadoop day in and day out (sounds like Oracle, ahhhhhh)… or going with the Cloudera Distribution (which is what I do and don’t look back). The barrier to entry feels like it has increased over the last year. However, stepping back from that the system overall has had a lot of improvements! A lot of great work by a lot of dedicated folks putting in their time and effort towards making Hadoop (in whatever form the elephant stampedes through its data) a reality.
Big shops that have teams of “Hadoop Engineers” (Yahoo, Facebook, eBay, LinkedIn, etc) with contributors and/or committers on that team should not have lots of impact because ultimately they are able to role their own releases for whatever they need/want themselves in production and just support it. Not all are so endowed.
Now, all of that having been said I write this because the discussion is REALLY good and has a lot of folks (including those from Yahoo! and Cloudera) bringing up pain points and proposing some great solutions that hopefully will contribute to the continued growth and success of the Apache Hadoop Community http://hadoop.apache.org/…. still if you want to run it in your company (and don’t have a committer on staff) then go download CDH3 http://www.cloudera.com it will get you going with the latest and greatest of all the releases, branches, etc, etc, etc. Great documentation too!
[tweetmeme]
/*
Joe Stein
http://www.linkedin.com/in/charmalloc
*/