HDFS to S3

Install HDFS

yum -y install openjdk8

curl -O https://dlcdn.apache.org/hadoop/common/hadoop-3.3.2/hadoop-3.3.2.tar.gz
tar xzvf hadoop-3.3.2.tar.gz 

export HADOOP_CLASSPATH=/root/hadoop-3.3.2/share/hadoop/tools/lib/*
export HADOOP_HOME="/root/hadoop-3.3.2"
export PATH="${HADOOP_HOME}/bin:$PATH"

https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-common/SingleCluster.html

<hadoop_home>/sbin/hdfs

<hadoop_home>/hadoop/core-site.xml

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>

<hadoop_home>/hadoop/hdfs-site.xml

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>

Format the system hdfs namenode -format

Start NameNode and DataNode daemon: start-dfs.sh

https://stackoverflow.com/questions/48129029/hdfs-namenode-user-hdfs-datanode-user-hdfs-secondarynamenode-user-not-defined

 export HDFS_NAMENODE_USER="root"
 export HDFS_DATANODE_USER="root"
 export HDFS_SECONDARYNAMENODE_USER="root"
 export YARN_RESOURCEMANAGER_USER="root"
 export YARN_NODEMANAGER_USER="root"

Browse the web interface for the NameNode; by default available at: NameNode - http://localhost:9870
Make the HDFS directories required to execute MapReduce jobs:

hdfs dfs -mkdir /user
hdfs dfs -mkdir /user/<username>

Copy file

hdfs -put <file> /user/

Copy from HDFS to S3 without writing temp files

https://stackoverflow.com/questions/67673048/is-it-possible-to-write-directly-to-final-file-with-distcp

hadoop distcp -direct hdfs://nn1:8020/datasets/set1 s3a://bucket/datasets/set1

Carlos Aguni

Install HDFS

Copy from HDFS to S3 without writing temp files