If you are working with EMR or hadoop, the following file system command would be handy.
List the contents of a directory
hadoop fs -ls folderPath
Example:
To list the contents of the folder hdfs:/input/
hadoop fs -ls /input/data
To list all the files recursively in all subfolders
hadoop fs -ls -R /input/data
To remove a file
hadoop fs -rm /input/data/file1.txt
To remove the contents of a folder
hadoop fs -rm -r -f /input/data
Note it does not recognize combining the switches as ‘-rf’
-rm: Illegal option -rf Usage: hadoop fs [generic options] -rm [-f] [-r|-R] [-skipTrash]...
Copy file from hadoop filesystem to local filesystem
hadoop fs -copyToLocal /input/data/file1.txt /local/file1.txt
Copy a folder from hadoop fs to local filesystem
hadoop fs -copyToLocal /input/data/ /local/
Some Filesystem related exceptions you might encoutner when running jobs on EMR
Copy a file to hadoop file system
hadoop fs -copyFromLocal /local/file2.txt /input/data/
FileAlreadyExistsException
Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory hdfs://10.88.119.14:9000/input/data already exists
It’s trying to create a folder but one already exists with the same name. May be that the same job was run previously so it needs to be cleaned up before rerunning. You can use the command hadoop fs -rm -r -f
(see example above)to delete the folder. Make sure to keep a copy in case if you would need them.
Any way to open a file directly from hadoop cluster without copying it to the local file system. Using vim or something?