Confluent

The Confluent Platform is a stream data platform that leverages the proven capabilities of Apache Kafka to make all of your data available as realtime streams. It enables you to easily organize large amounts of data into a well-managed, unified stream data platform for your organization, which can be used for a wide variety of realtime as well as batch operational and analytical purposes. With the Confluent Platform, you can focus more time on deriving business value from your data, and worry less about figuring out how to to move data around in your organization.

Company: Confluent Language: Scala
Languages supported: Java Scala Go C C++ Python

Hadoop

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

Language: Java
Languages supported: Java

Oozie

Oozie is a workflow scheduler system to manage Apache Hadoop jobs. Its Workflows jobs are Directed Acyclical Graphs (DAGs) of actions. Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availabilty. Is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system specific jobs (such as Java programs and shell scripts). Indeed is a scalable, reliable and extensible system.

Language: Java
Languages supported: XML

Cassandra

Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data. Cassandra's support for replicating across multiple datacenters is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages. Cassandra's data model offers the convenience of column indexes with the performance of log-structured updates, strong support for denormalization and materialized views, and powerful built-in caching.

Company: Datastax Language: Java
Languages supported: Cql

Crossdata

Crossdata is a distributed framework and a data layer that unifies the interaction with batch and streaming sources supporting multiple datastore technologies thanks to its generic architecture and a custom SQL-like language with support for streaming queries. Supporting multiple architectures imposes two main challenges: how to normalize the access to the datastores, and how to cope with datastore limitations. To access multiple technologies Crossdata defines a common unifying interface containing the set of operations that a datastore may support. New connectors can be easily added to increase its connectivity capabilities. Two types of connectors are defined: native and spark-based. Native connectors are faster for simple operations, while Spark-based connectors offer a larger set of functionality. The Crossdata planner decides which connector will be used for any request based its characteristics. Offer a shell, Java/REST APIs, and ODBC for BI.

Company: Stratio Language: Java
Languages supported: Java Scala SQL REST

Ambari

Apache Ambari is a completely open operational framework for provisioning, managing and monitoring Apache Hadoop clusters. Ambari includes an intuitive collection of operator tools and a set of APIs that mask the complexity of Hadoop, simplifying the operation of clusters. With hundreds of years of combined experience, Hortonworks, along with members of the Hadoop community have answered the call to deliver the key services required for enterprise Hadoop.

Language: Java

Presto DB

Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. Presto was designed and written from the ground up for interactive analytics and approaches the speed of commercial data warehouses while scaling to the size of organizations like Facebook.

Company: Facebook Language: Java
Languages supported: SQL

Datacube

The purpose of a data cube is to store aggregate information about large numbers of data points. The data cube stores aggregate information about interesting subsets of the input data points. For example, if you're writing a web server log analyzer, your input points could be log lines, and you might be interested in keeping a count for each browser type, each browser version, OS type, OS version, and other attributes. You might also be interested in counts for a particular combination of (browserType,browserVersion,osType), (browserType,browserVersion,osType,osVersion), etc. It's a challenge to quickly add and change counters without wasting time writing database code and reprocessing old data into new counters. A data cube helps you keep these counts. You declare what you want to count, and the data cube maintains all the counters as you supply new data points.

Company: Urban Airship Language: Java
Languages supported: Java

Tachyon

Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks, such as Spark and MapReduce. It achieves high performance by leveraging lineage information and using memory aggressively. Tachyon caches working set files in memory, thereby avoiding going to disk to load datasets that are frequently read. This enables different jobs/queries and frameworks to access cached files at memory speed. Tachyon is Hadoop compatible. Existing Spark and MapReduce programs can run on top of it without any code change. The project is open source (Apache License 2.0) and is deployed at multiple companies. It has more than 40 contributors from over 15 institutions, including Yahoo, Intel, and Redhat. The project is the storage layer of the Berkeley Data Analytics Stack (BDAS) and also part of the Fedora distribution.

Language: Java
Languages supported: Java

Spark-jobserver

Spark-jobserver provides a RESTful interface for submitting and managing Apache Spark jobs, jars, and job contexts. This repo contains the complete Spark job server project, including unit tests and deploy scripts. It was originally started at Ooyala.

Company: Ooyala Language: Scala
Languages supported: Scala

Oryx 2

Oryx 2 is a realization of the lambda architecture built on Apache Spark and Apache Kafka, but with specialization for real-time large scale machine learning. It is a framework for building applications, but also includes packaged, end-to-end applications for collaborative filtering, classification, regression and clustering.

Company: Cloudera Language: Java
Languages supported: Java Scala

Amazon DynamoDB

Amazon DynamoDB is a fast and flexible NoSQL database service for all applications that need consistent, single-digit millisecond latency at any scale. It is a fully managed database and supports both document and key-value data models. Its flexible data model and reliable performance make it a great fit for mobile, web, gaming, ad-tech, IoT, and many other applications.

Company: Amazon
Languages supported: Java NodeJS Perl Ruby Python

Flume

Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.

Language: Java
Languages supported:

Spark

Apache Spark is an open-source cluster computing framework originally developed in the AMPLab at UC Berkeley. In contrast to Hadoop's two-stage disk-based MapReduce paradigm, Spark's in-memory primitives provide performance up to 100 times faster for certain applications. The Spark engine runs in a variety of environments, from cloud services to Hadoop or Mesos clusters. It is used to perform ETL, interactive queries (SQL), advanced analytics (e.g. machine learning) and streaming over large datasets in a wide range of data stores (e.g. HDFS, Cassandra, HBase, S3). Spark supports a variety of popular development languages including Java, Python and Scala.

Language: Scala
Languages supported: Scala Java Python SQL

Thrift

The Apache Thrift software framework, for scalable cross-language services development, combines a software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js, Smalltalk, OCaml and Delphi and other languages.

Language: C++
Languages supported:

Hive

The Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

Language: Java
Languages supported: HiveQL

Mesos

Apache Mesos abstracts CPU, memory, storage, and other compute resources away from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to easily be built and run effectively. Mesos is built using the same principles as the Linux kernel, only at a different level of abstraction. The Mesos kernel runs on every machine and provides applications (e.g., Hadoop, Spark, Kafka, Elastic Search) with API’s for resource management and scheduling across entire datacenter and cloud environments.

Language: C++

Storm

Apache Storm is a free and open source distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Storm is simple, can be used with any programming language, and is a lot of fun to use! Storm has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. Storm is fast: a benchmark clocked it at over a million tuples processed per second per node. It is scalable, fault-tolerant, guarantees your data will be processed, and is easy to set up and operate.

Language: Clojure
Languages supported: Java Python Ruby Perl Javascript

Ignite

Apache Ignite In-Memory Data Fabric is a high-performance, integrated and distributed in-memory platform for computing and transacting on large-scale data sets in real-time, orders of magnitude faster than possible with traditional disk-based or flash technologies. Apache Ignite In-Memory Data Fabric is designed to deliver uncompromised performance for a wide set of in-memory computing use cases from high performance computing, to the industry most advanced data grid and streaming.

Language: Java
Languages supported: Java

Drill

Apache Drill is a distributed MPP query layer that supports SQL and alternative query languages against NoSQL and Hadoop data storage systems. It was inspired in part by Google's Dremel.

Language: Java
Languages supported: SQL

Neo4j

Neo4j is a Graph Database. It is a high performance graph store with all the features expected of a mature and robust database, like a friendly query language and ACID transactions. The programmer works with a flexible network structure of nodes and relationships rather than static tables — yet enjoys all the benefits of enterprise-quality database. For many applications, Neo4j offers orders of magnitude performance benefits compared to relational DBs.

Company: Neo4j Language: Java
Languages supported: Java Cypher

Kylin

Apache Kylin is an open source Distributed Analytics Engine from eBay Inc. that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets.

Company: eBay Language: Java, Javascript
Languages supported: SQL

Flink

Apache Flink is an in-memory scalable data processing framework. Like Spark, the exposed API looks like a function collection API. That is, your data is essentially sequences and list of any kind of data, and you have methods like map, groupBy, or reduce to perform computations on these data sets which will then automatically be scaled out to the cluster. The main difference to Spark is that Flink takes an approach known from databases to optimize queries in several ways.

Language: Java
Languages supported: Scala Java

GlusterFS

GlusterFS is an open source, distributed file system capable of scaling to several petabytes (actually, 72 brontobytes!) and handling thousands of clients. GlusterFS clusters together storage building blocks over Infiniband RDMA or TCP/IP interconnect, aggregating disk and memory resources and managing data in a single global namespace. GlusterFS is based on a stackable user space design and can deliver exceptional performance for diverse workloads.

Company: Red Hat Language: C

Zookeeper

ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. Each time they are implemented there is a lot of work that goes into fixing the bugs and race conditions that are inevitable. Because of the difficulty of implementing these kinds of services, applications initially usually skimp on them ,which make them brittle in the presence of change and difficult to manage. Even when done correctly, different implementations of these services lead to management complexity when the applications are deployed.

Language: Java

Stratio-Cassandra

Stratio Cassandra is a fork of Apache Cassandra where index functionality has been extended to provide near real time search such as ElasticSearch or Solr, including full text search capabilities and free multivariable search. It is achieved through an Apache Lucene based implementation of Cassandra secondary indexes, where each node of the cluster indexes its own data. Stratio Cassandra is one of the core modules on which Stratio’s BigData platform (SDS) is based.

Company: Stratio Language: Java
Languages supported: Cql Lucene

Sqoop

Apache Sqoop is a tool designed for efficiently transferring data betweeen structured, semi-structured and unstructured data sources. Relational databases are examples of structured data sources with well defined schema for the data they store. Cassandra, Hbase are examples of semi-structured data sources and HDFS is an example of unstructured data source that Sqoop can support.

Language: Java

DL4J

Deeplearning4j is the first commercial-grade, open-source, distributed deep-learning library written for Java and Scala. Integrated with Hadoop and Spark, DL4J is designed to be used in business environments, rather than as a research tool. It aims to be cutting-edge plug and play, more convention than configuration, which allows for fast prototyping for non-researchers.

Language: Java
Languages supported: Java Scala

Thunder

Thunder is a library for analyzing large-scale spatial and temporal neural data. It's fast to run, easy to extend, and designed for interactivity. It is built on Spark, a new framework for cluster computing. Thunder includes utilities for loading and saving different formats, classes for working with distributed spatial and temporal data, and modular functions for time series analysis, factorization, and model fitting. Analyses can easily be scripted or combined. It is written against Spark's Python API (Pyspark), making use of scipy, numpy, and scikit-learn.

Language: Python
Languages supported: Python

Hyperdex

HyperDex is a next generation key-value and document store with a wide array of features. HyperDex's key features -- namely its rich API, strong consistency, fault tolerance, support for MongoDB API, and ease of use -- provide strong guarantees to applications that are not matched by other NoSQL systems.

Language: C++
Languages supported: C Python Java Go Node Ruby

Pulsar

An open-source, real-time analytics platform and stream processing framework. Pulsar can be used to collect and process user and business events in real time, providing key insights and enabling systems to react to user activities within seconds. In addition to real-time sessionization and multi-dimensional metrics aggregation over time windows, Pulsar uses a SQL-like event processing language to offer custom stream creation through data enrichment, mutation, and filtering. Pulsar scales to a million events per second with high availability. It can be easily integrated with metrics stores like Cassandra and Druid.

Company: Ebay Language: Java
Languages supported: Java SQL-Like

Samza

Apache Samza is a distributed stream processing framework. It uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management. Simple API: Unlike most low-level messaging system APIs, Samza provides a very simple callback-based “process message” API comparable to MapReduce. Managed state: Samza manages snapshotting and restoration of a stream processor’s state. When the processor is restarted, Samza restores its state to a consistent snapshot. Samza is built to handle large amounts of state (many gigabytes per partition). Fault tolerance: Whenever a machine in the cluster fails, Samza works with YARN to transparently migrate your tasks to another machine. Durability: Samza uses Kafka to guarantee that messages are processed in the order they were written to a partition, and that no messages are ever lost. Scalability: Samza is partitioned and distributed at every level. Kafka provides ordered, partitioned, replayable, fault-tolerant streams. YARN provides a distributed environment for Samza containers to run in. Pluggable: Though Samza works out of the box with Kafka and YARN, Samza provides a pluggable API that lets you run Samza with other messaging systems and execution environments. Processor isolation: Samza works with Apache YARN, which supports Hadoop’s security model, and resource isolation through Linux CGroups.

Company: LinkedIn Language: Scala
Languages supported: Java Scala

Redis

Redis is an open source, BSD licensed, advanced key-value cache and store. It is often referred to as a data structure server since keys can contain strings, hashes, lists, sets, sorted sets, bitmaps and hyperloglogs.

Language: C
Languages supported: Java Erlang Go Perl Python Scala

HBase

Apache HBase is the Hadoop database, a distributed, scalable, big data store. Use Apache HBase when you need random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google's Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.

Language: Java

Impala

Impala is a fully integrated, state-of-the-art analytic database architected specifically to leverage the flexibility and scalability strengths of Hadoop - combining the familiar SQL support and multi-user performance of a traditional analytic database with the rock-solid foundation of open source Apache Hadoop and the production-grade security and management extensions of Cloudera Enterprise. Impala is an integrated part of a Cloudera enterprise data hub - taking advantage of the unified resource management, metadata, security, and system administration across multiple frameworks. It is the de facto standard for open source interactive business intelligence and data discovery - certified with the leading BI and analytic application vendors and supported broadly across the Hadoop ecosystem.

Company: Cloudera Language: C
Languages supported: Sql

Elastic Search

Elasticsearch is a flexible and powerful open source, distributed, real-time search and analytics engine. Architected from the ground up for use in distributed environments where reliability and scalability are must haves, Elasticsearch gives you the ability to move easily beyond simple full-text search. Through its robust set of APIs and query DSLs, plus clients for the most popular programming languages, Elasticsearch delivers on the near limitless promises of search technology..

Company: Elasticsearch Language: Java
Languages supported: Java Scala Python Go R Ruby

Aerospike

Aerospike is a distributed, scalable NoSQL database. It is architected with three key objectives: To create a high-performance, scalable platform that would meet the needs of today’s web-scale applications To provide the robustness and reliability (ie, ACID) expected from traditional databases. To provide operational efficiency (minimal manual involvement)

Company: Aerospike Language: C
Languages supported: Cql

Streaming CEP Engine

Streaming CEP engine is a Complex Event Processing platform built on Spark Streaming. It is the result of combining the power of Spark Streaming as a continuous computing framework and Siddhi CEP engine as complex event processing engine.

Company: Stratio Language: Scala
Languages supported: SQL

Falcon

Apache Falcon is a data governance engine that defines, schedules, and monitors data management policies. Falcon allows Hadoop administrators to centrally define their data pipelines, and then Falcon uses those definitions to auto-generate workflows in Apache Oozie.

Company: Hortonworks Language: Java

Pig

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets. At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the Hadoop subproject). Pig language layer currently consists of a textual language called Pig Latin, which has the following key properties: Ease of programming, optimization opportunities, Extensibility.

Language: Java
Languages supported: Pig Latin

Hue

Hue is a set of web applications that enable you to interact with a Hadoop cluster. Hue applications let you browse HDFS and jobs, manage a Hive metastore, run Hive, Cloudera Impala queries and Pig scripts, browse HBase, export data with Sqoop, submit MapReduce programs, build custom search engines with Solr, and schedule repetitive workflows with Oozie.

Company: Cloudera Language: Python

NiFi

Apache NiFi is an easy to use, powerful, and reliable system to process and distribute data. Apache NiFi was made for dataflow. It supports highly configurable directed graphs of data routing, transformation, and system mediation logic. Some of its key features include: Web-based user interface, Highly configurable, Data Provenance, Designed for extension and Secure.

Language: Java
Languages supported:

Kafka

Apache Kafka is publish-subscribe messaging rethought as a distributed commit log. Kafka is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of co-ordinated consumers. Kafka has a modern cluster-centric design that offers strong durability and fault-tolerance guarantees.

Language: Scala
Languages supported: Java Scala Go C C++ Python

Mahout

The Apache Mahout project's goal is to build a scalable machine learning library. Scalable to large data sets. Our core algorithms for clustering, classfication and collaborative filtering are implemented on top of scalable, distributed systems. However, contributions that run on a single machine are welcome as well. Scalable to support your business case. Mahout is distributed under a commercially friendly Apache Software license. Scalable community. The goal of Mahout is to build a vibrant, responsive, diverse community to facilitate discussions not only on the project itself but also on potential use cases. Come to the mailing lists to find out more. Currently Mahout supports mainly three use cases: Recommendation mining takes users' behavior and from that tries to find items users might like. Clustering takes e.g. text documents and groups them into groups of topically related documents. Classification learns from exisiting categorized documents what documents of a specific category look like and is able to assign unlabelled documents to the (hopefully) correct category.

Language: Java
Languages supported: Java

Tez

Apache Tez is a generic data-processing pipeline engine envisioned as a low-level engine for higher abstractions such as Apache Hadoop Map-Reduce, Apache Pig, Apache Hive etc.

Company: Hortonworks Language: Java
Languages supported: Java

RabbitMQ

RabbitMQ is a messaging broker - an intermediary for messaging. It gives your applications a common platform to send and receive messages, and your messages a safe place to live until received.

Language: Erlang
Languages supported: Java .NET Erlang

Spindle

Spindle is a prototype Spark-based web analytics query engine designed around the requirements of production workloads. Spindle exposes query requests through a multi-threaded HTTP interface implemented with Spray. Queries are processed by loading data from Apache Parquet columnar storage format on the Hadoop distributed filesystem.

Company: Adobe Language: Javascript, Python, Scala
Languages supported:

Deep-spark

Deep is a thin integration layer between Apache Spark and several NoSQL datastores. We actually support Apache Cassandra, MongoDB, Elastic Search, Aerospike, HDFS, S3 and any database accessible through JDBC, but in the near future we will add support for sever other datastores.

Company: Stratio Language: Java
Languages supported: Java Scala

GearPump

GearPump is a lightweight real-time big data streaming engine. It is inspired by recent advances in the Akka framework and a desire to improve on existing streaming frameworks. The name GearPump is a reference to the engineering term `gear pump`, which is a super simple pump that consists of only two gears, but is very powerful at streaming water.

Company: Intel Language: Scala
Languages supported: Scala

Parquet

Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem and others project that doesn't belong the Hadoop ecosystem(like Apache Spark), regardless of the choice of data processing framework, data model or programming language.

Language: Java, C++

Cockroach

Cockroach is a distributed key/value datastore which supports ACID transactional semantics and versioned values as first-class features. The primary design goal is global consistency and survivability, hence the name. Cockroach aims to tolerate disk, machine, rack, and even datacenter failures with minimal latency disruption and no manual intervention. Cockroach nodes are symmetric; a design goal is one binary with minimal configuration and no required auxiliary services.

Language: Go
Languages supported: SQL

MongoDB

MongoDB (from humongous) is a document-oriented database. MongoDB is schemaless and uses JSON-like documents. It supports ad hoc queries, indexes, file storage, server-side JavaScript execution and complex aggregation queries, replication and sharding.

Influx DB

InfluxDB is a time series, metrics, and analytics database. It’s written in Go and has no external dependencies. That means once you install it there’s nothing else to manage (like Redis, ZooKeeper, HBase, or whatever). InfluxDB is targeted at use cases for DevOps, metrics, sensor data, and real-time analytics. It arose from our need for a database like this on more than a few previous products we’ve built.

Language: Go
Languages supported: Go Node Ruby HTTP

Accumulo

Apache Accumulo is based on Google's BigTable design and is built on top of Apache Hadoop, Zookeeper, and Thrift. Apache Accumulo features a few novel improvements on the BigTable design in the form of cell-based access control and a server-side programming mechanism that can modify key/value pairs at various points in the data management process. Other notable improvements and feature are outlined here.

Language: Java
Languages supported: SQL

Atlas

Atlas was developed by Netflix to manage dimensional time series data for near real-time operational insight. Atlas features in-memory data storage, allowing it to gather and report very large numbers of metrics, very quickly. Atlas captures operational intelligence. Whereas business intelligence is data gathered for the purpose of analyzing trends over time, operational intelligence provides a picture of what is currently happening within a system. Atlas was built because the existing systems Netflix was using for operational intelligence were not able to cope with the increase in metrics we were seeing as we expanded our operations in the cloud. In 2011, we were monitoring 2 million metrics related to our streaming systems. By 2014, we were at 1.2 billion metrics and the numbers continue to rise. Atlas is designed to handle this large quantity of data and can scale with the hardware we use to analyze and store it.

Company: Netflix Language: Scala
Languages supported: RPN expressions SQL

Druid

Druid is a distributed, column-oriented, real-time analytics data store that is commonly used to power exploratory dashboards in multi-tenant environments. Druid excels as a data warehousing solution for fast aggregate queries on petabyte sized data sets. Druid supports a variety of flexible filters, exact calculations, approximate algorithms, and other useful calculations. Druid can load both streaming and batch data and integrates with Samza, Kafka, Storm, and Hadoop among others.

Language: Java
Languages supported: Java SQL-Like