Run this notebook: Open in Colab Open in Kaggle

This notebook was prepared by Donne Martin. Source and license info is on GitHub.

HDFS¶

Running HDFS Commands¶

The hdfs command is the main entry point to the Hadoop Distributed File System CLI. Running it without arguments displays the available subcommands: dfs (file system shell), namenode, datanode, fsck (filesystem check), and others. HDFS is the default storage layer for Hadoop and Spark clusters, providing fault-tolerant distributed storage by replicating data blocks across multiple nodes (default replication factor of 3).

!hdfs

HDFS File System Shell (FsShell)¶

hdfs dfs (also aliased as hadoop fs) opens the file system shell, which provides Unix-like commands (-ls, -put, -get, -cat, -mkdir, -rm) for interacting with HDFS. Running it without arguments displays the full list of supported operations. These commands work not only with HDFS but also with other Hadoop-compatible filesystems like S3 (via s3a:// prefix) and local filesystem (via file:// prefix).

!hdfs dfs

Listing the User’s Home Directory¶

hdfs dfs -ls without a path argument lists the contents of the current user’s HDFS home directory (typically /user/<username>). The output is similar to Unix ls -l, showing permissions, owner, group, file size, modification date, and path. This is the first command you would run to explore what data is available in your HDFS workspace.

!hdfs dfs -ls

Listing the HDFS Root Directory¶

hdfs dfs -ls / lists the contents of the HDFS root directory, showing all top-level directories such as /user, /tmp, /apps, and any other directories created by administrators or applications. This gives you an overview of the cluster’s directory structure and is useful for understanding how data is organized across different teams and applications.

!hdfs dfs -ls /

Uploading a Local File to HDFS¶

hdfs dfs -put copies a file from the local filesystem to HDFS. The first argument is the local source path, and the second is the HDFS destination path (relative to your home directory). The file is split into blocks (default 128 MB), and each block is replicated across multiple DataNodes for fault tolerance. Use -put -f to overwrite an existing file.

!hdfs dfs -put file.txt file.txt

Reading File Contents from HDFS¶

hdfs dfs -cat prints the entire contents of an HDFS file to standard output, similar to the Unix cat command. This is useful for inspecting small files or piping output to other commands. For large files, use -cat with | head or | tail to view only a portion, as reading gigabytes of data to your terminal would be impractical.

!hdfs dfs -cat file.txt

Viewing the End of a File¶

Piping hdfs dfs -cat through tail -n 10 displays only the last 10 lines. This is essential for quickly checking the most recent entries in log files or verifying that data was written completely. Note that HDFS still reads the entire file and streams it through the pipe – for truly efficient tail operations on very large files, consider storing data in a format that supports random access like Parquet.

!hdfs dfs -cat file.txt | tail -n 10

Viewing All Files in a Directory¶

Using a wildcard (*) with -cat concatenates the contents of all files in the directory and pipes the output through less for paginated viewing. This is useful when data is split across multiple part files (a common pattern in Hadoop/Spark output). The less pager allows you to scroll through the output interactively.

!hdfs dfs -cat dir/* | less

Downloading a File from HDFS¶

hdfs dfs -get copies a file from HDFS to the local filesystem, the reverse of -put. The first argument is the HDFS source path and the second is the local destination. This is how you pull processed results or trained models from the cluster to your local machine for analysis, visualization, or deployment.

!hdfs dfs -get file.txt file.txt

Creating a Directory on HDFS¶

hdfs dfs -mkdir creates a new directory on HDFS. Add the -p flag to create parent directories recursively (like mkdir -p in Unix). Organizing data into well-structured directories is important for data management on shared clusters, as it helps teams find and manage their datasets and enables efficient partition-based data access patterns.

!hdfs dfs -mkdir dir

Recursively Deleting a Directory¶

hdfs dfs -rm -r deletes a directory and all of its contents from HDFS, including subdirectories. Use with caution – HDFS does not have a recycle bin by default (though some distributions enable a trash feature). The -r flag is required for directories; without it, -rm only works on individual files. This operation also frees up the disk space consumed by all replicas of the deleted blocks.

!hdfs dfs -rm -r dir

Referencing HDFS Files in Spark¶

When passing file paths to Spark’s sc.textFile(), use the full HDFS URI format (hdfs://host:port/path) to read from HDFS explicitly. If no scheme is specified, Spark uses the default filesystem configured in core-site.xml (which is usually HDFS on a cluster). Paths without a leading / are relative to the user’s HDFS home directory. This integration is what makes Spark and HDFS a powerful combination for large-scale data processing.

data = sc.textFile ("hdfs://hdfs-host:port/path/file.txt")