Online Free Computer Tutorials.: August 2018

Wednesday, August 22, 2018

How-to: Run a Simple Apache Spark App in CDH 5

Getting started with Apache Spark in CDH 5.x is easy using this simple example. Apache Spark is a general-purpose, cluster computing f...

Apache Hive on Apache Spark: Motivations and Design Principles

Two of the most vibrant communities in the Apache Hadoop ecosystem are now working together to bring users a Hive-on-Spark option that...

How-to: Build Advanced Time-Series Pipelines in Apache Crunch

Learn how creating dataflow pipelines for time-series analysis is a lot easier with Apache Crunch. In a previous blog post, I describe...

Bayesian Machine Learning on Apache Spark

Markov Chain Monte Carlo methods are another example of useful statistical computation for Big Data that is capably enabled by Apache ...

Building Lambda Architecture with Spark Streaming

The versatility of Apache Spark's API for both batch/ETL and streaming workloads brings the promise of lambda architecture to the ...

Getting Started with Big Data Architecture

What does a "Big Data engineer" do, and what does "Big Data architecture" look like? In this post, you'll get ...

Apache Kafka for Beginners

When used in the right way and for the right use case, Kafka has unique attributes that make it a highly attractive option for data in...

Calculating CVA with Apache Spark

Thanks to Matthew Dixon, principal consultant at Quiota LLC and Professor of Analytics at the University of San Francisco, and Mohamma...

How-to: Translate from MapReduce to Apache Spark (Part 2)

The conclusion to this series covers Combiner-like aggregation functionality, counters, partitioning, and serialization. Apache Spark ...

Working with Apache Spark: Or, How I Learned to Stop Worrying and Love the Shuffle

Our thanks to Ilya Ganelin, Senior Data Engineer at Capital One Labs, for the guest post below about his hard-earned lessons from usin...

Deploying Apache Kafka: A Practical FAQ

This post contains answers to common questions about deploying and configuring Apache Kafka as part of a Cloudera-powered enterprise d...

Ibis on Impala: Python at Scale for Data Science

This new Cloudera Labs project promises to deliver the great Python user experience and ecosystem at Hadoop scale. Across the user com...

How Apache Spark, Scala, and Functional Programming Made Hard Problems Easy at Barclays

Thanks to Barclays employees Sam Savage, VP Data Science, and Harry Powell, Head of Advanced Analytics, for the guest post below about...

Interactive Analytics on Dynamic Big Data in Python using Kudu, Impala, and Ibis

The following post was originally published in the Ibis project blog. (Ibis is a data analysis framework incubating in Cloudera Labs t...

Time Series for Spark: 0.2.0 Released

The 0.2.0 release of the spark-ts package includes includes a fleshed-out Java API, among other things. The spark-ts library, which wa...

Progress Report: Bringing Erasure Coding to Apache Hadoop

Get an update on the progress of the effort to bring erasure coding to HDFS, including a report about fresh performance benchmark test...

How-to: Build a Real-Time Search System using StreamSets, Apache Kafka, and Cloudera Search

Thanks to Jonathan Natkins, a field engineer from StreamSets, for the guest post below about using StreamSets Data Collector—open sour...

Making Python on Apache Hadoop Easier with Anaconda and CDH

Enabling Python development on CDH clusters (for PySpark, for example) is now much easier thanks to new integration with Continuum Ana...

Introducing Apache Arrow: A Fast, Interoperable In-Memory Columnar Data Structure Standard

Engineers from across the Apache Hadoop community are collaborating to establish Arrow as a de-facto standard for columnar in-memory p...

Time Series for Spark Joins Cloudera Labs

Bringing Time Series for Spark into Cloudera Labs is a reflection of its potentially future usefulness in more use cases. Time is more...

Building a Data Science Portfolio: Storytelling with Data (Part 2: Data Exploration)

The following post (Part 2 of two parts) by Vik Paruchuri, founder of data science learning platform Dataquest, offers some detailed a...

Securing Apache Spark Shuffle using Apache Commons Crypto

Learn how the performance advantages of the Crypto cryptographic library will provide an upgrade for Spark shuffle encryption over the...

Apache Kudu and Apache Impala (Incubating): The Integration Roadmap

Impala users can expect new performance and usability benefits via improved integration with Kudu. It's been nearly one year since...

Introducing sparklyr, an R Interface for Apache Spark

Earlier this week, RStudio announced sparklyr, a new package that provides an interface between R and Apache Spark. We republish RStud...

Resource Management for Apache Impala (incubating)

Apache Impala (incubating) includes several features that allow you to restrict or allocate resources so as to maximize stability and ...

How-to: Deploy a Secure Enterprise Data Hub on Microsoft Azure – Part 1

Learn how to use Cloudera Director, Microsoft Active Directory (AD DS, AD CS, AD DNS), SAMBA, and SSSD to deploy a secure EDH cluster ...

How-to: Fuzzy Name Indexing in Apache Hadoop with Rosette and Cloudera Search

In this guide, learn how to use Cloudera Search with Basis Technology's Rosette® to perform fuzzy name searches in multiple langua...

How-to: Deploy a Secure Enterprise Data Hub on Microsoft Azure – Part 2

In Part 1 of the blog, we covered all the prerequisites needed to deploy a CDH cluster on the Microsoft Azure cloud platform. In Part ...

How to secure ‘Internet exposed’ Apache Hadoop

You may have heard of the recent (and ongoing) hacks targeting open source database solutions like MongoDB and Apache Hadoop. From wha...

Hardening Apache ZooKeeper Security: SASL Quorum Peer Mutual Authentication and Authorization

Background Apache ZooKeeper is a core infrastructure component in Apache Hadoop stack and is also widely used by many companies for se...

Up and running with Apache Spark on Apache Kudu

After the GA of Apache Kudu in Cloudera CDH 5.10, we take a look at the Apache Spark on Kudu integration, share code snippets, and exp...

Working with UDFs in Apache Spark

User-defined functions (UDFs) are a key feature of most SQL environments to extend the system's built-in functionality. UDFs allow...

Analyzing US flight data on Amazon S3 with sparklyr and Apache Spark 2.0

We posted several blog posts about sparklyr (introduction, automation), which enables you to analyze big data leveraging Apache Spark ...

How-to: Log Analytics with Solr, Spark, OpenTSDB and Grafana

Organizations analyze logs for a variety of reasons. Some typical use cases include predicting server failures, analyzing customer beh...

Blacklisting in Apache Spark

At Cloudera, we're always working to provide our customers and the Apache Spark community with the most robust, most reliable soft...

Deep Learning Frameworks on CDH and Cloudera Data Science Workbench

The emergence of "Big Data" has made machine learning much easier because the key burden of statistical estimation—generaliz...

Apache Impala Leads Traditional Analytic Database

Unmodified TPC-DS-based performance benchmark show Impala's leadership compared to a traditional analytic database (Greenplum), es...

Latest Posts

Tags

Wednesday, August 22, 2018

Popular

Archive