7 Most Used Command Line Utilities in Data Science




Most data science development is done under Linux based operating systems. There are many Linux command line interface but Bash is the most used one as it comes with Linux OS as a default command line interface. Bash some very useful utilities for basic to advanced data inspection, manipulation and processing tasks. Therefore it is important to be familiar with those utilities would be useful from a data science perspective.

1- cat
The “cat” name comes from “concatenate”. It is probably one of the most used Bash command in data science. The cat command is used to output file contents to the standard output. At advanced level, data scientist use cat command for concatenation, appending file(s) to another, numbering file lines etc.

2- wc
wc is used to calculate word counts, line counts, byte counts, and related information about text or data files. The default output for wc, if executed without options, is a row (from left to right)

  • number of lines
  • word count ,
  • number of characters and
  • file names (s).

3- wget
Web scraping or downloading files from Internet is quite common task in data science. Bash command 'wget' used to to download files from remote locations on Internet.

4- head
head command outputs the first few lines of a file to standard output. The number of displayed lines can be set with the -n option. Bash shell has similar command 'tail' to output n lines from end of a file.

5- find
The find command is used for searching the file system for particular files. It also used to specify search criteria.

6- cut
The cut command is used for slicing out sections of a line from a file. Those slices can also be made using a variety of criteria. Cut command can be useful for extracting columnar data from CSV files.

7- uniq
uniq can be very instrumental to modify the output of text files to standard output by suppressing identical consecutive lines. One would usually need to sort the file lines before apply uniq command. It allows to output only the duplicate lines, or add the number of occurrences of each line

Need help with a business problem?

Or Need to bounce an idea?