Big data analytics is network intensive because it runs on a cluster of nodes. Due to the high volumes of data exchanged between nodes in the IBM® BigInsights® cluster, network isolation is vital for the following reasons:
- To prevent sniffing attacks
- To reduce the network congestion so that the corporate network is not affected by the big data cluster traffic
Isolating the cluster network also gives administrators greater flexibility in enforcing the cluster access.
This IBM® Redbooks® Analytics Support web doc serves as a guide for system implementers who are creating a secure zone for an IBM BigInsights cluster by providing an example of current industry practices. This document applies to IBM BigInsights Version 4.2 and later.
Big data analytics is network intensive because it runs on a cluster of nodes. Due to the high volumes of data exchanged between nodes in the IBM® BigInsights® cluster, network isolation is vital for the following reasons:
- To prevent sniffing attacks
- To reduce the network congestion so that the corporate network is not affected by the big data cluster traffic
Isolating the cluster network also gives administrators greater flexibility in enforcing the cluster access.
This IBM® Redbooks® Analytics Support web doc serves as a guide for system implementers who are creating a secure zone for an IBM BigInsights cluster by providing an example of current industry practices. This document applies to IBM BigInsights Version 4.2 and later.
IBM BigInsights cluster network deployment architecture
Figure 1 highlights one of the industry practices for network topology for IBM® BigInsights® clusters.
Figure 1. High-level IBM BigInsights network architecture
Users can access management nodes from the corporate network only after they authenticate with the corporate LDAP. Inbound traffic is encrypted and controlled by a firewall. Only ports that are related to cluster administration (Ambari), reverse proxy (Knox), JDBC ports (BigSQL and Hive), and SSH are open for users. Outbound traffic is not restricted. After users log in to the management node, they can connect via Secure Shell (SSH) to data nodes that are connected to the private network. All inbound traffic from the corporate network to data nodes is blocked. Data nodes are connected only to the cluster data network or to the private network.
Management nodes in the cluster have two network interfaces:
- One of the interfaces is connected to the corporate network (also called a public network).
- The other interface is connected to the private network (sometimes referred to as a data network).
All the data traffic is exchanged between management node and data nodes through the private network only. Thus, there is dedicated bandwidth and higher performance with the added benefit of security.
Setting up the data nodes to access external data sources with Apache Sqoop and similar tools
One of the use cases for big data analytics is to offload the organization's relational database management system (RDBMS) data to Apache Hadoop for archival or running analytics at large scale. Tools such as Apache Sqoop and IBM Fluid Query are used to import data from external sources. These tools launch MapReduce jobs, which read the data from external sources in parallel.
In this scenario, port forwarding in the firewall must be enabled, so that the data nodes can read external sources by forwarding the traffic to or from management nodes.
Run the following commands as root on the management nodes to enable port forwarding between data nodes and management nodes. Data nodes can then initiate communication to servers outside of the private network and receive data, but external servers cannot access the internal network.
echo 1 > /proc/sys/net/ipv4/ip_foward
/sbin/iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
/sbin/iptables -A FORWARD -i eth0 -o eth1 -m state --state RELATED,ESTABLISHED -j ACCEPT
/sbin/iptables -A FORWARD -i eth1 -o eth0 -j ACCEPT
These commands assume
eth0 is the public interface. The
eth1 traffic is routed to outside of the network as though it were coming from the management node. Adjust the interface names (
eth0 and
eth1) to match your environment. After the data is completely imported, port forwarding can be turned off.
Related publications
For more information, see the following web page:
IBM BigInsights V4.2 documentation
https://ibm.biz/BdrnVB