Getting started with Apache Spark in CDH 5.x is easy using this simple example. Apache Spark is a general-purpose, cluster computing f...
Wednesday, August 22, 2018
Apache Hive on Apache Spark: Motivations and Design Principles
Two of the most vibrant communities in the Apache Hadoop ecosystem are now working together to bring users a Hive-on-Spark option that...
How-to: Build Advanced Time-Series Pipelines in Apache Crunch
Learn how creating dataflow pipelines for time-series analysis is a lot easier with Apache Crunch. In a previous blog post, I describe...
Bayesian Machine Learning on Apache Spark
Markov Chain Monte Carlo methods are another example of useful statistical computation for Big Data that is capably enabled by Apache ...
Building Lambda Architecture with Spark Streaming
The versatility of Apache Spark's API for both batch/ETL and streaming workloads brings the promise of lambda architecture to the ...
Getting Started with Big Data Architecture
What does a "Big Data engineer" do, and what does "Big Data architecture" look like? In this post, you'll get ...
Apache Kafka for Beginners
When used in the right way and for the right use case, Kafka has unique attributes that make it a highly attractive option for data in...
Calculating CVA with Apache Spark
Thanks to Matthew Dixon, principal consultant at Quiota LLC and Professor of Analytics at the University of San Francisco, and Mohamma...
How-to: Translate from MapReduce to Apache Spark (Part 2)
The conclusion to this series covers Combiner-like aggregation functionality, counters, partitioning, and serialization. Apache Spark ...
Working with Apache Spark: Or, How I Learned to Stop Worrying and Love the Shuffle
Our thanks to Ilya Ganelin, Senior Data Engineer at Capital One Labs, for the guest post below about his hard-earned lessons from usin...
Deploying Apache Kafka: A Practical FAQ
This post contains answers to common questions about deploying and configuring Apache Kafka as part of a Cloudera-powered enterprise d...
Ibis on Impala: Python at Scale for Data Science
This new Cloudera Labs project promises to deliver the great Python user experience and ecosystem at Hadoop scale. Across the user com...
How Apache Spark, Scala, and Functional Programming Made Hard Problems Easy at Barclays
Thanks to Barclays employees Sam Savage, VP Data Science, and Harry Powell, Head of Advanced Analytics, for the guest post below about...
Interactive Analytics on Dynamic Big Data in Python using Kudu, Impala, and Ibis
The following post was originally published in the Ibis project blog. (Ibis is a data analysis framework incubating in Cloudera Labs t...
Time Series for Spark: 0.2.0 Released
The 0.2.0 release of the spark-ts package includes includes a fleshed-out Java API, among other things. The spark-ts library, which wa...
Progress Report: Bringing Erasure Coding to Apache Hadoop
Get an update on the progress of the effort to bring erasure coding to HDFS, including a report about fresh performance benchmark test...
How-to: Build a Real-Time Search System using StreamSets, Apache Kafka, and Cloudera Search
Thanks to Jonathan Natkins, a field engineer from StreamSets, for the guest post below about using StreamSets Data Collector—open sour...
Making Python on Apache Hadoop Easier with Anaconda and CDH
Enabling Python development on CDH clusters (for PySpark, for example) is now much easier thanks to new integration with Continuum Ana...
Introducing Apache Arrow: A Fast, Interoperable In-Memory Columnar Data Structure Standard
Engineers from across the Apache Hadoop community are collaborating to establish Arrow as a de-facto standard for columnar in-memory p...
Time Series for Spark Joins Cloudera Labs
Bringing Time Series for Spark into Cloudera Labs is a reflection of its potentially future usefulness in more use cases. Time is more...
Building a Data Science Portfolio: Storytelling with Data (Part 2: Data Exploration)
The following post (Part 2 of two parts) by Vik Paruchuri, founder of data science learning platform Dataquest, offers some detailed a...
Securing Apache Spark Shuffle using Apache Commons Crypto
Learn how the performance advantages of the Crypto cryptographic library will provide an upgrade for Spark shuffle encryption over the...
Apache Kudu and Apache Impala (Incubating): The Integration Roadmap
Impala users can expect new performance and usability benefits via improved integration with Kudu. It's been nearly one year since...
Introducing sparklyr, an R Interface for Apache Spark
Earlier this week, RStudio announced sparklyr, a new package that provides an interface between R and Apache Spark. We republish RStud...
Resource Management for Apache Impala (incubating)
Apache Impala (incubating) includes several features that allow you to restrict or allocate resources so as to maximize stability and ...
How-to: Deploy a Secure Enterprise Data Hub on Microsoft Azure – Part 1
Learn how to use Cloudera Director, Microsoft Active Directory (AD DS, AD CS, AD DNS), SAMBA, and SSSD to deploy a secure EDH cluster ...
How-to: Fuzzy Name Indexing in Apache Hadoop with Rosette and Cloudera Search
In this guide, learn how to use Cloudera Search with Basis Technology's Rosette® to perform fuzzy name searches in multiple langua...
How-to: Deploy a Secure Enterprise Data Hub on Microsoft Azure – Part 2
In Part 1 of the blog, we covered all the prerequisites needed to deploy a CDH cluster on the Microsoft Azure cloud platform. In Part ...
How to secure ‘Internet exposed’ Apache Hadoop
You may have heard of the recent (and ongoing) hacks targeting open source database solutions like MongoDB and Apache Hadoop. From wha...
Hardening Apache ZooKeeper Security: SASL Quorum Peer Mutual Authentication and Authorization
Background Apache ZooKeeper is a core infrastructure component in Apache Hadoop stack and is also widely used by many companies for se...
Up and running with Apache Spark on Apache Kudu
After the GA of Apache Kudu in Cloudera CDH 5.10, we take a look at the Apache Spark on Kudu integration, share code snippets, and exp...
Working with UDFs in Apache Spark
User-defined functions (UDFs) are a key feature of most SQL environments to extend the system's built-in functionality. UDFs allow...
Analyzing US flight data on Amazon S3 with sparklyr and Apache Spark 2.0
We posted several blog posts about sparklyr (introduction, automation), which enables you to analyze big data leveraging Apache Spark ...
How-to: Log Analytics with Solr, Spark, OpenTSDB and Grafana
Organizations analyze logs for a variety of reasons. Some typical use cases include predicting server failures, analyzing customer beh...
Blacklisting in Apache Spark
At Cloudera, we're always working to provide our customers and the Apache Spark community with the most robust, most reliable soft...
Deep Learning Frameworks on CDH and Cloudera Data Science Workbench
The emergence of "Big Data" has made machine learning much easier because the key burden of statistical estimation—generaliz...
Apache Impala Leads Traditional Analytic Database
Unmodified TPC-DS-based performance benchmark show Impala's leadership compared to a traditional analytic database (Greenplum), es...