Bash in VS Code

Bourne-Again SHell
Author

Daniel Enriquez-Vera

Published

December 11, 2024

1. Unix Facts

A. Basics and definitions

  • Most pipelines include tools (awk, bash, perl, etc) that are based on UNIX (faster than R and Python).
  • Computer clusters use Linux (an open-source environment).
  • GUI: Graphical User Interface (classic interface)
  • Shell: Command line interface (terminal)
  • In windows: VirtualBox is an option
  • Flavours of Unix: Ubuntu, MacOS

(simple) Shell << Python (advanced)

Steps for Linux installation in Windows:

  • Turn windows features on or off
  • check windows subsystem for linux

Types:

  • Bourne shell (sh) - old stuff
  • Korn shell (ksh) - old stuff
  • Bourn again shell (bash) - Linux
# = 1. Which shell are we using?
echo $0
# = 2. What shells have we available?
cat /etc/shells
# = 3. Where is bash located?
which bash
# = 4. Create an empty file
touch peru.sh
# = 5. To check my file permissions
ls -al
# = 6. System variables (usually uppercase)
echo $BASH
echo $BASH_VERSION
echo $HOME
echo $PWD

B. Linux file system

Tree file system: path starts with a forward slash /

  • Absolute: /starts/from/the/root
  • Relative: to the working directory:
    • . (current)
    • ~ (home) or cd empty
    • ..(one level up), ../.. (two levels up)
    • - (previous directory and print)
    • ~- (previous directory)

Tip: To know where I am: pwdor readlink -f

Linux Hierarchical structure

C. Command Syntax

command [options] [file or arguments]

  • prompt: [user]$
  • STDIN: Standard input or 0
  • STDOUT: Standard output or 1 (file >)
  • STDERR: Standard error or 2 (command &> logfile, 2> for errors)
  • Basic commands: ls, cd, cp, mv and mkdir -p
  • Be careful with spaces, spelling and case; use always auto-completion TAB key
  • More information: man command, --help, whatis
  • Example: htlvlab\$ cat -1 xd.txt
  • For intermediate file creation: program1 input.txt | tee intermediate-file.txt | program2 > results.txt
# ls [option] [file]
ls -a #hidden
ls -t #time, -d details, -h size
ls -l -a #characteristics
ls -tlh
ls -l | grep -Ev '^d' #showing only files
ls -l | grep -E '^d' # only folders

# it is possible to combine
ls -la 

# use tree to see all the structure
tree -a

Practice

# Try the following
date
echo hello world! # > or >> to add or append
pwd 
cd
man ls
cat # show the content and concat

# remove - BE CAREFUL!
rm -rf folder
rmdir #for directories (empty)

Downloading

wget http://address

D. Wildcards

# * - all possible characters
ls *.xlsx
# . and ? one possible character
ls ?asta 
# Wildcards
ls [a-z]*.fasta
ls {abc,bcd}.fasta

Practice

Usual project organization: {bin,data,results,manuscripts}

# Create a whole folder structure, avoid spaces
mkdir -p project_htlv/{raw_data,metadata,curated_data,output}

mkdir results-$(date +%F)

Wilcard list

[:alpha:] any upper/lowercase letter or [a-zA-Z]
[:digit:] any digits or [0-9].
[:alnum:] any alphabet and digit or [a-zA-Z0-9]
[:upper:] uppercase or [A-Z]
[:lower:] lowercase latter or [a-z]
[:blank:] space and tab
[:print:] printable characters
[:cntrl:] non-printable characters
[:space:] white-space
[:xdigit:] hexadecimal digits
[:ascii:] ASCII characters
[:word:] alphanumeric characters including underscore(_)

‘^It’ #starts the line with It ‘[0-9]+’ #line with numbers [!a-d]* # negates [^A-Z] #match anything not in the set [‘[:punct:]’] #comma or . [a|b|c] a or b or c (cat|dog) cat or dog

Visualization

View or merge huge files cat, zcat, also: more, less, head and tail (use q to exit)

head -n 5 file.txt 
head -5 file.txt
# Example
x="Hola amiguitos!"
echo $x > hello.txt
cat hello.txt

## Excercise, using cat concat a list of fasta files

E. System

Check jobs running

ps aux

#dynamic realtime view
top 

#kill a process
kill number
kill -9 number
kill -9 %<num>

#disk space
df -h

Clusters

#For clusters
myquota

#memory 
free -h

#who is logged
w
who -a
whoami

#uptime
uptime

#network
ifconfig

#User and passwd
adduser xxxid
adduser xxxide -g bioinf

Copy from the server

rsync
scp

F. Scripting

All programs exit with an exit-status $? (0: Successfully, nonzero: error)

.sh extension for bash executable files.


echo """this is my first text,
If I write a haiku,
use triple marks""" > xd.txt; echo $?
echo "Hola, I speak spanish"
cat -1 xd.txt

Tips: Always avoid errors

  • Check that files exist.
  • Any error may cause your script to abort 😰.
  • Variable names should not star with a number.
  • Be careful with strings and spaces!

Shebang - a header to call bash

  • First line - our bash path
#!/bin/bash

All the BASH options! (Sanity check)

  • set -e: terminate the script if any command existed with a nonzero exit status.
  • set -u: not to run any command containing a reference to an unset variable.
  • set -o pipefail: if the last program terminates with a nonzero status the pipe, it wont stop.
echo "rm $VAR1/NOTSET*.*"`

My first script

  • First, create a file:
touch myscript.sh
  • Add a shebang:
#!/bin/bash
echo "I am writing this non-sense script for you - DEV" 
  • Then, make it executable:
chmod +x myscript.sh #permissions (x) for the owner (u)
chmod u+x myscript.sh
  • Finally: Execute it!
bash my script.sh
# set variables
dirpath='PATH'
fasta='file'
echo ${dirpath} #use always ${}
variables=$(pwd) #without spaces for commands
echo "hello ${NAME1}"
((NUM++)) #double parantheses for mathematical expressions

Nano

Control + o = save Control + x = exit


2. Data management

A. File management

# File Size
du -sh file
du -ah file
# Copy
cp filea.txt directory/filea.txt
# Move
mv filea.txt directory/filea.txt

Compressed files

Check compressed files

zcat file.gz | less

# Check if is valid   
gzip -t file.gz

# Compress multiple files, to keep the original -k
gzip *.fa

# Decompress
gunzip file.gz
gunzip SRR*_1R2.fastq.gz

# Decompress and keep the original
gunzip -c file.gz > file

Tar

tar zxvf file.tar.gz
#z unzip
#x exctract
#v verbosely
#f the following is a filename
#   c compress
# t view list

tar -tvf # viewing without extraction

#create a tar
tar zcvf file.tar.gz file1 file2 file3 folder1

Zip files

zip file.zip
unzip file.zip

tar.bz2

tar -jxvf file.tar.bz2

#to create
tar -cvjSf file.tar.bz2 file1 file2 folder1 folder2

File integrity

Reason: files may be corrupted after downloading.


#integrity for data transfer checksum
md5sum *fa > MD5_checksum
#to check
md5sum -c MD5_checksum
#to compare then with the original file
diff -u file1 file2 | cat

B. File manipulation

cut, sort, uniq, wc, grep, paste

sort -n lenghts.txt #ordena de forma numerica
head -n lenghts.txt
wc *.fastq | sort -n | head -1

grep: Global regular expression print

Search inside the file for a string or pattern.

grep "Hola" file.txt | wc -l
grep -w #exact word
grep -c # count number of lines
grep -n #line numbers
grep -v # do not match
grep -E #no need to escape special characters, -Ev invert option
grep --color
grep -l #files
grep -A 1 #prints one line after
grep -o '[A-Z]*' # get rid of numbers
grep -i "a line" | cut -f 1-3,5 # case insensitive
paste - - - - file |cut -f 1-2 | head -2 # merge the files horizontally with 4 columns
grep -E ab+c # one or more of the preceding character abbbbbbcd
grep -E ab*cd # zero or more of the preceding character
grep -E 'ab{1,5}cd' #number of repetitions of the preceding character set
'\<the\>' # word boundaries
grep -E '([0-9]+).*\1'
'([0-9]+)_([0-9]+)_\1_\2' #repeated twice
grep "[2|4]" #lines 2 and 4

### Practice
#fastq convert to fasta
paste - - - - <data/example.fastq |cut -f 1-2 | sed 's/^@/>/' | tr "\t" "\n"  | head -4

## More advance: from paired to single end 

paste < (paste - - - - < reads-1.fastq) \
      < (paste - - - - < reads-2.fastq) | \
      tr '\t' '\n' > reads-int.fastq
paste - - - - - - - - < reads-int.fastq \
    | tee > (cut -f 1-4 | tr '\t' '\n' > reads-1.fastq) \
    | cut -f 5-8 | tr '\t' '\n' > reads-2.fastq

wc -l #lines 
wc -w #words 
wc -c #characters
# Translate 
echo "santhilal" | tr a-z A-Z
fastafile=xxxx.fa
grep -v "^>" $fastafile | grep --color -i "[^ATCG]"

echo "There are $(grep -c '^>' $fastafile) entries in my FASTA file."

# Examples, grep -w "...." = 4 characters
cat H0.list | grep "^USP*" # Here '^' means 'line starts with'
cat H0.list | grep "USP*"
cat H0.list | grep "\.[0-9]$" # $the ending part

# real life examples
cat hg38_gene_annotation_mart_export.txt | grep -v "^#" | cut -f1 | sort | uniq | wc -l # This counts total number of unique genes in the annotation

cat hg38_gene_annotation_mart_export.txt | grep -v "^#" | cut -f7 | sort | uniq | wc -l # This counts total number of unique transcripts

cat hg38_gene_annotation_mart_export.txt | grep -v "^#" | grep -w "protein_coding" | cut -f1 | sort | uniq | wc -l # Counts only total number of protein coding genes

# Zcat to open .gz files

zcat Homo_sapiens.GRCh38.87_badStructure.table.gz | sed 's/gene_id "//' | sed 's/gene_name "//' | sed 's/gene_biotype "//' | sed 's/"//g' | sed '1i\Geneid\tGeneSymbol\tChromosome\tClass\tStrand\tLength' | sed 's/ //g' | head


# convert DNA to RNA
cat file.txt | tr T U
#tr -d 'AT' (delete)
tr '\t' ','#tsv to csv
tr a-z A-Z
tr ACTG TGAC

# convert reverse complement
grep -o '[ACTG]*' junk.txt | tr ACGT TGCA | rev
grep '[CG]' -o file | wc -l #count only CG

#find unique features
zcat $DATA_DIR/$GTF | grep -v '^#' | cut -f 3 | sort | uniq
#counts
zcat $DATA_DIR/$GTF | grep -v '^#' | cut -f 3 | sort | uniq -c
#first 3 exons
zcat $DATA_DIR/$GTF | grep -v '^#' | awk '$3 == "exon"' | head -3
#attributes
zcat $DATA_DIR/$GTF | grep -v '^#' | awk '$3 == "gene"' | cut -f 9 > genes.txt

find

# Find any files with "Linux" and ".Rmd" in the file names, -iname, -maxdepth 1, -ctime +1, -mtime -1, -mmin -15
find . -type f -name "*Linux*.Rmd"

# Count the number of files
find . -type f | wc -l
find . -type f -size +10M
find . -maxdepth 1 -type f -size +1M

# Do something with the file or folder (d)
find . -type f -name "*fa*.pl" -exec ls -l {} +

cut

cut: Choose a column from a tab-delimited file (spreadsheet) cut -f (column number to choose) file.txt

cut extracts columns separated by delimiters (-d)
cut -d " " -f 2 file.txt #column 2 or 1,3,5 or 1-5

uniq

Recognize two or more adjacent lines are the same.

  • uniq -c para contar
  • uniq -w N solo usa los xxx caracteres iniciales para evaluar si son iguales
sort haiku.txt | uniq -c 

history # historial

# Example first all step by step
ls ~ > file_list.txt
wc -l file_list.txt
rm file_list.txt

# piping the data from one command to another
ls ~ | wc -l

Piping

  • &&: Continue only if the first is completed successfully.
  • ||: Continue only if the first is completed unsuccessfully.

 continue to next line or pipe |

# Redirecting the output to print
tail -f 

true && echo "works"

##  Example
for file in ${@}
do 
    wc ${file}
done


for file in ../data/${@}
do
    output=$(basename ${file} .fastq)-wordcount.txt
    wc ${file} >> ../data/wordcount.txt
done

mkdir dataseq


while read
fasterq-dump -p --split-files $inputsra -O ./dataseq

3. Virtual Environment for Python

venv allow us to manage every project independently (Avoiding version conflicts). Basically, two options:

  • Conda
  • Virtualenv or venv

Installation: (only the first time.)

  • venv: built-in tool
  • virtualenv: more advanced and efficient features.
# Tip: Everytime I have to use bash (% or !)
!pip install virtualenv
!echo "Virtual environment allow to work every project with independent packages versions"

A. Create a virtual environment

# Option A
# a ".venv" environment has been created (WOS)
!python -m venv .venv

# Option B
!virtualenv .venv 

# Option C
# Control + Shift + P, venv
# create environment, select interpreter

## To delete (just remove it after deactivate and select another interpreter)
!rmdir /s /q .venv

B. Activate venv

# Windows
# Select interpreter
# in the terminal .venv\Scripts\activate.ps1

## Mac
# source env_name/bin/activate

## deactivate
## deselect the interpreter

C. Install modules for Python

Now every module will be downloaded and installed separately from the global environment (an specific version can also be specified), avoid using ! or % at this step.

!pip install ipykernel pandas jupyter pyyaml

4. Variables

  • We use a dollar sign ($) to access a variable.
results_dir="results/"

echo $results_dir

# = To be more specific with beginning and end
echo ${results_dir}

# = Make it more robust with ""
echo "${results_dir}_abc/"

A. Variables and arguments

An editor: Yes, we can use any (nano, vim, emacs, etc).

echo '
#!/bin/bash 
echo "script name: $0"
echo "first arg: $1"
echo "second arg: $2"
echo "third arg: $3" ' > args.sh

bash args.sh arg1 arg2 arg3
  • $0 = script name.
  • $1 = argument 1.
  • $# = number of arguments.
echo '#!/bin/bash 
if [ "$#" -lt 3 ] 
then
    echo "error: too few arguments, you provided $#, 3 required"
    echo "usage: script.sh arg1 arg2 arg3"
    exit 1
fi
echo "script name: $0"
echo "first arg: $1"
echo "second arg: $2"
echo "third arg: $3" ' > args.sh

Practice

bash args.sh arg1 arg2

#Passing arguments
echo $1 $2 $3

#script name
echo $0 

# = another form to pass, here 0 is the first argument
args=("$@")
echo ${args[1]} ${args[2]} ${args[3]}  

#the same result
echo $@

#number of arguments
echo $# 

Practice

## using read
echo "Enter name:"
read name
echo "Entered name: $name"

## if multiple variables
echo "Enter names:"
read name1 name2 name3
echo "Names: $name1, $name2, $name3"

## to use the same line
read -p "username:" user_var
echo "username: $user_var"

#to use a hidden or silent -sp
read -p "username:" user_var
read -sp "password:" pass_var
echo
echo "username: $user_var"
echo "password: $pass_var"

Practice

This is how arrays work in bash:

#run the inputs in a array read -a
read -a names
echo "names: ${names[0]}, ${names[1]}

#if after read is empty
read
echo "Name: $REPLY"

5. Conditionals

0 means success,anything else is false or failure. No need any overhead to call.

if [command] then
    [if-statement] #if it is true or 0
    elif
    then
    else
else
    [else-statement] #optional
fi

Practice

  • If pipelines is used with set -o, any nonzero exit will skip the statement to the next block.
echo `#!/bin/bash
if grep "pattern" some_file.txt > /dev/null 
    then
    # commands to run if "pattern" is found
    echo "found 'pattern' in 'some_file.txt" 
fi`
# We can also negate our program’s exit status with !: 
if ! grep "pattern" some_file.txt > /dev/null
    then echo "did not find 'pattern' in 'some_file.txt" 
fi

6. Expressions

  • An expression can be: String comparison, Numeric comparison, File operators and Logical operators and it is represented by [expression]:

A. String Comparisons:

  • = compare if two strings are equal; if [ "$a" = "$b" ]
  • != compare if two strings are not equal; if [ "$a" != "$b" ]
  • -n evaluate if string length is > 0
  • -z evaluate if string length is = 0
  • == is equal to if [ "$a" == "$b" ]

B. Expressions for strings:

  • < is less than, in ASCII alphabetical order if [[ "$a" < "$b" ]]
  • > is greater than, in ASCII alphabetical order if [[ "$a" > "$b" ]]

Examples:

[ s1 = s2 ] # (true if s1 same as s2, else false)
[ s1 != s2 ] # (true if s1 not same as s2, else false)
[ s1 ]  # (true if s1 is not empty, else false)
[ -n s1 ] #   (true if s1 has a length greater then 0, else false)
[ -z s2 ] #  (true if s2 has a length of 0, otherwise false)

C. Expressions for numbers

  • -eq compare if two numbers are equal
  • -ge compare if one number is >= to a number, (("$a" >= "$b"))
  • -le compare if one number is <= to a number, (("$a" <= "$b"))
  • -ne compare if two numbers are not equal
  • -gt compare if one number is > another number, (("$a" > "$b"))
  • -lt compare if one number is < another number, (("$a" < "$b"))

Practice

#Examples: 
[ n1 -eq n2 ]  #(true if n1 same as n2, else false)
[ n1 -ge n2 ]  #(true if n1greater then or equal to n2, else false)
[ n1 -le n2 ]  #(true if n1 less then or equal to n2, else false)
[ n1 -ne n2 ]  #(true if n1 is not same as n2, else false)
[ n1 -gt n2 ]  #(true if n1 greater then n2, else false)
[ n1 -lt n2 ]  #(true if n1 less then n2, else false)
echo -e "Enter the name of the file: \c"
read file_name
if [-e $file_name] #exists?
then
    echo "$file_name found"
else
    echo "$file_name not found"
fi

D. Practice

echo -e "Enter the name of the file: \c"
read file_name
if [ -f $file_name ]
then
    if [ -w $file_name]
    then
        echo "Type some text data, to quite press control + d"
    else
    fi
else
    echo "$file_name dont exist"
fi

7. Basic operations in Bash

Operations: + - * / % (remainer after division) After expr is necessary *

num1=20
num2=

echo $((num1+ num2))
echo $(expr $num1 + $num2)
echo $(($num1 + $num2))

# working with decimals, (awk)
echo "1.5+2.3" | bc
echo "scale=2; 20.5/5" |bc
echo "scale=2; sqrt($num)" | bc -l #library with more mathematical formulas

A. Basic operations

  • Tips: for floating numbers -> bc (basic calculator)

test command

  • Exits with either 0 or 1, supports standard comparison operators (independenly)
test "ATG" = "ATG" ; echo "$?"
test "ATG" = "atg" ; echo "$?" 
test 3 -lt 1; echo "$?"
test 3 -le 3; echo "$?"

Logical operators

  • -d dir; -b a special file; -c character file
  • -s empty; -f file; -e file exists; -h link
  • -r readable?; -w writable?; -x executable?
test -d some_directory ; echo $? # is this a directory? 
test -f some_file.txt ; echo $? # is this a file?
test -r some_file.txt ; echo $? # is this file readable?
#test+if
if test -f some_file.txt
    then [...]
fi
if [-f some_file.txt]
    then [...]
fi

Operators

  • -a AND; -o OR; ! negation; () group
age=25
if [ "$age" -gt 18 ] && [ "$age -lt 30" ]
then
    echo "valid age"
    else
    echo "age not valid"
fi
### Alternatives
##  [ "$age" -gt 18 -a "$age -lt 30" ]
##  [[ "$age" -gt 18 && "$age -lt 30" ]]

Practice

  • Tip: Provide an space after and before
#!/bin/bash
set -e
set -u
set -o pipefail

if [ "$#" -ne 1 -o ! -r "$1" ]
  then echo "usage: script.sh file_in.txt"
  exit 1 
fi

B. Case patterns

LANG=C #setting the language uppercase
echo -e "Enter some xxxx: \c"
read value 

case $value in
    pattern 1)  #[a-z][A-Z][0-9]?(special), *unknown
        statement ;; #echo "this is the $value"
    pattern 2)
        statement ;;
        *)
        statement ;;
    ...
esac

More case patterns

#!/usr/bin/bash
echo -e "Enter a character: \c"
read value
#environment variable to set the language
LANG=C
case $value in
    [a-z] )
        echo "lower case" ;;
    [A-Z] )
        echo "upper case" ;;
    [0-9] )
        echo "digit" ;;
    ? )
        echo "special" ;;
    * )
        echo "unknown input" ;;
esac        

C. Arrays

A set of elements

#!/usr/bin/bash
os=("ubuntu", "windows", "kali")
os[3]="mac" #it will add mac to the array
unset os[2] #remove the third one
# gaps in the array are possible

echo "${os[@]}" #all elements printed
echo "${os[1]}" #to print windows (2)
echo "${!os[@]}" #prints the elements (0 1 2)
echo "${#os[@]}" #prints the length of the array

D. Loops

It’s better to have a list of files Tip: avoid spaces, tabs, newlines, or special characters (*)

echo ${sample_files[@]}
echo ${sample_names[@]}

#text with 3 columns, and the third column is filenames
sample_files=($(cut -f 3 ../file.txt))
echo ${sample_files[@]}

#then use basename to remove extension
basename -s .fastq seqs/ABC001.fastq

E. While loops

n=1
while [ $n -le 10 ]
#other option (( $n <= 10 >))
do
    echo "$n"
    n=$((n+1))
    #other option (( ++n  ))
    
done
  • Tip: force a pause
sleep 1 #pause for 1 second
gnome-terminal & #open a terminal
xterm & #open a terminal

Practice

#!/bin/bash

set -e
set -u
set -o pipefail

# specify the input samples file, where the third column is the path to each sample FASTQ file
sample_info=samples.txt

# create a Bash array from the third column of $sample_info
sample_files=($(cut -f 3 "$sample_info"))

for fastq_file in ${sample_files[@]} 
do
    # strip .fastq from each file, and add suffix "-stats.txt" to create an output filename
    results_file="$(basename $fastq_file .fastq)-stats.txt"
    
    # run fastq_stat on a file, writing results to the filename we've # above
    fastq_stat $fastq_file > stats/$results_file
done

Practice

#!/bin/bash
    set -e
    set -u
    set -o pipefail

# specify the input samples file, where the third column is the path to each sample FASTQ file
sample_info=samples.txt

# our reference
reference=zmays_AGPv3.20.fa

# create a Bash array from the first column, which are 
# sample names. Because there are duplicate sample names 
# (one for each read pair), we call uniq 

sample_names=($(cut -f 1 "$sample_info" | uniq))

for sample in ${sample_names[@]} do
        # create an output file from the sample name
results_file="${sample}.sam"

bwa mem $reference ${sample}_R1.fastq ${sample}_R2.fastq > $results_file 
done

F. While read

while read p
do 
    echo $p
done < file.sh
#another option
cat file.sh | while read p
do 
    echo $p
done < file.sh
#another option files with special characters
while IFS= read -r p
do
    echo $p
done

G. Until for loop

#If the condition is false
n=1
until [ $n -ge 10 ]
#also possible with until (( $n > 10 ))
do
    echo $n
    n=$(( n+1 ))
done

## For loop

for variable in 1 2 3 4 5 #list of values
for variable in {1..10..2} #{start..end..increment}
for variable in file1 file 2 file 4
for OUTPUT in $(Linux-command-here)
for (( EXP1; EXP2; EXP3 ))
for (( i=0; i<5; i++ ))
for command in ls pwd date

Until for loop

for item in * #all inside the folder
do
    if [ -d $item ]
    if [ $i -gt 5 ]
    then 
        break #get out of the loop
        continue #
    then 
        echo $item    

# Another example
#!/bin/bash
for (( i=1 ; i<=10 ; + i++))
do
    if [ $i -eq 3 -o $i -eq 6 ]

Select loop for easy menus

select varName in list #Presented with numbers
do 
    command1
    command2
    ..
    commandN
    case $name in
    mark)
        echo mark selected
        ;;
    John)
        echo john selected
        ;;
    *)
        echo "Error"
    esac
done


select name in mark jhon tom ben
do
    echo "$name selected"
done

H. Break and continue

to exit the loop prematurely

for (( i=1 ; i<=10 ; i++ ))
do
    if [ $i -gt 5 ]
    then 
        break #continue also possible (excepcions)
    fi
    echo "$i"
done

8. Functions

function name(){
    Commands
}

#OR

name () {
    Commands
}

#Example
function Hello(){
    echo "Hello"
}

Practice

usage(){
    echo "you need to provide an argument : "
    echo "usage : $0 file_name"
}
is_file_exist() {
    local file="$1"
    [[ -f "$file" ]] && return 0 || return 1
} 

[[ $# -eq 0 ]] && usage

if ( is_file_exist "$1")
then
    echo "File found"
else
    echo "File not found"
fi

Local variables

function printx(){
    local name=$1 #it will use the variable only for this function
    echo "the name is $name"
}

Read-only

it cant be overwritten

var=31
readonly var
## for functions
readonly -f 

Signals and traps

except SIGKILL and SIGSTOP

trap "echo Exit signal is detected" SIGNIT # or 9
echo "pid is $$"
while (( COUNT < 10 ))
do
    sleep  10
    (( COUNT ++ ))
done
exit 0
###############
kill -9 pid
trap "echo Exit command is detected" 0 #if it received a 0 value
echo "Hello world"
exit 0
###example
trap "rm -f $file && echo file deleted; exit" 0 2 15

Debugging a script

bash -x script.sh

###in the shebang
#!/bin/bash -x
set -x
from line 1 to 10
set +x

#### Globbing
#!/bin/bash set -e
set -u
set -o pipefail
for file in *.fastq
do
    echo "$file: " $(bioawk -c fastx 'END {print NR}' $file)
done

Find

ls *.fq | process_fq #may contain errors
process_fq *.fq #it has some limits

##### Find - find path expressions
find / | head 
find -maxdepth 1 . #only this directory


touch path/absolute/xbc{A,B,C}_R{1,2}.fastq
find path

##### Find expressions
find path/to/folder -name "filex*fastq" -type f
# d for directories
# f for files
find path/to/folder -name "filex*fastq" -or -name "otrox*fastq" -type f
#another alternative
find path/to/folder -type f "!" -name "filex*fastq" -and "!" -name "*temp"

## to remove this files
find path/to/folder -name "*-temp.fastq" -exec rm {} \;

XARGS

xargs reads data from stdin (input) and executes the command from the argument one or more times based on input if no command -> echo programs that take multiple arguments (rm, touch, mkdir)

find path/to/folder -name "*-temp.fastq" | xargs -n 1 rm
#other example
find zmays-snps/data/seqs -name "*-temp.fastq" > files-to-delete.txt
cat files-to-delete.txt
cat files-to-delete.txt | xargs rm
######
find . -name "*.fastq" | xargs basename -s ".fastq" | xargs -I{} fastq_stat --in {}.fastq --out ../summaries/{}.txt 
## Notice that the BSD xargs will only replace up to five instances of the string specified 
## by -I by default, unless more are set with -R.

XARGS

We can launch multiple background processes with Bash for loops by adding the ampersand (&) at the end of the command (e.g., for filename in *.fastq; do program “$filename” & done) xargs to run simultaneously with the -P option

find . -name "*.fastq" | xargs basename -s ".fastq" | xargs -P 6 -I{} fastq_stat --in {}.fastq --out ../summaries/{}.txt 
## Notice that the BSD xargs will only replace up to five instances of the string specified 
## by -I by default, unless more are set with -R.

GNU parallel

find . -name "*.fastq" | parallel --max-procs=6 'program {/.} > {/.}-out.txt'

Make a list of files

ls -1 | sed 's/\.fastq\.gz//g' > ../PRJEB41550.txt

9. Bioinformatics software instalation

bwa-mem2 and fastp >Bwa-mem2

#sudo root - be extremely carefull, control + D to exit
sudo su

sudo apt-get update

# answer yes
sudo apt-get upgrade -y
sudo apt-get dist-upgrade -y
sudo apt-get autoremove -y
sudo apt-get autoclean -y


sudo apt-get install unzip
## Samtools
sudo apt-get upgrade
sudo apt-get update
sudo apt-get install -y libncurses-dev libbz2-dev liblzma-dev
sudo apt-get install libssl-dev
#### Option 2
sudo apt-get install libcurl4-openssl-dev
sudo apt-get install zlib1g-dev
#Steps: Install
sudo ...

export PATH=$PATH:/home/dev105/devapps/STAR/source

Installation

#Download samtools
wget  samtools bcftools htslib #LINK
tar xvjf samtoolsxxxx 
cd samtoolsxxx
./configure
make
sudo make install
##Export to the path
export PATH="$PATH:/home/labatl/devapps/bwa-mem2"
export PATH="$PATH:/home/labatl/devapps/htslib/htslib-1.20"
export PATH="$PATH:/home/labatl/devapps/htslib/bcftools-1.20"
export PATH="$PATH:/home/labatl/devapps/htslib/samtools-1.20"https://github.com/bwa-mem2/bwa-mem2
export PATH="$PATH:/home/labatl/devapps"
#Add it to
nano ~/.bashrc
## Source it
source ~/.bashrc

#Another option
nano ~/.bashrc
##Write at the end:
export PATH=$PATH:”filepath”

SED

Allows to filter/replace efficiently text in a pipeline

sed 's/hello/world/' input.txt > output.txt cat input.txt | sed 's/hello/world/' - > output.txt

  • -i: overwrite the same file
  • -n: supress output and specify the line with p:sed -n '45p' file.txt o sed -n '1p ; $p' one.txt two.txt three.txt
  • -s: reverse
  • -e or -f: specify non-option parameter: sed --expression='s/hello/world/' input.txt > output.txt or sed --file=myscript.sed input.txt > output.txt
  • -z: a set of lines
sed -n 1,2p Homo_sapiens.GRCh38.87_annotation.table
sed -n "1~2p" file.txt #-n not print, extract only odd numbered lines
zcat $DATA_DIR/$FASTQ | sed -n '2~4p;' | head | tr -d '\n' #extract every 4th line
sed -n '2000,2005p' #number of lines
sed "s/[0-9][0-9]*\.[[:space:]]//g" file.txt # g global

Manual[https://www.gnu.org/software/sed/manual/sed.html] (Extended REGEX)[http://austingroupbugs.net/view.php?id=528]

Loops

  • for is use for iterating over a fixed number of items (may be unknown at time of coding).
  • while is use for iterating until a certain condition is met
#### Loops
##### Examples of for
for file in ../data/${@}
do
    wc ${file} >> ../data/wordcount.txt
done

for file in ../data/${@}
do
    output=$(basename ${file} .fastq)-wordcount.txt 
    wc ${file} >> ../results/${output}
done

for loops
for i in $( ls ); do echo item: $i 
done

##### Examples of while
while true; do echo ‘hello’ done

#### Conditionals
if [ "foo" = "foo" ]; then
else
fi

##### Iterating with xargs -I one by one -0 avoid blanks
ls *.fasta | xargs -I% sh -c 'head %; echo "\n---\n"'
#sh -c "to call shell instead of bash"
#https://www.baeldung.com/linux/xargs-multiple-arguments
my_variable="ggplot2"

echo "My favorite R package is ${my_variable}"
my_var="chr1"
echo "${my_var}_1.vcf.gz"

#script

#STDIN -> STDOUT+STDERR
I/O redirections: |,<,>
ps -aef #show a list of all processes running
bash sam_run my_file.bam out_stats.txt
#!/bin/bash/
samtools stats $1 > $2
#!/bin/bash/
samtools stats $2 > $1 #output primero

#to hack SIGPIPE error in Jupyter
cleanup () {
    :
}

trap "cleanup" SIGPIPE
#safety
set -u #unset variables as an error and exit





grep -v "^>" ../data/tb1.fasta | grep --color -i "[^ATCG]"
CCCCAAAGACGGACCAATCCAGCAGCTTCTACTGCTAYCCATGCTCCCCTCCCTTCGCCGCCGCCGACGC

for file in ../data/${@}
do
    output=$(basename ${file} .fastq)-wordcount.txt
    wc ${file} >> ../data/wordcount.txt
done

mkdir dataseq

while read
fasterq-dump -p --split-files $inputsra -O ./dataseq



echo "Start!"

for sra in (txt)
do
    echo "${p}"
    while [ $(jobs | wc -l) -ge "$maxproc" ]
    do
        sleep 1
    done
    base=${fname%_R1*}
    echo starting new job with ongoing=$(jobs | wc -l)
    bwa-0.7.5a/bwa mem -t 4 human_g1k_v37.fasta "${base}_R1.fastq.gz" "${base}_R2.fastq.gz" >"$destdir/${base}_R1_R2.sam" &
done



#if to check for file existence
if [-f hello.txt]; then
    cat hello.txt
else
    echo "no such file"
fi

if [! -f "data/file"]; then
    wget url -O data/file
fi

#test for conditional evaluation
test -f file o [-f file]

#Loops
for i in SRRR*;
do
    echo ${i} #inserted in a line with other characters
done
for file in copy-SRR534005*; 
echo $file ${file//SRR534005/chicken}; 
do mv $file ${file//SRR534005/chicken}; 
done

for FILE in $(ls *ipynb); do
    echo $FILE
done

for ((i=0; i<5; i++)); do
    echo $i
done

test -a = &&
test -o = ||
=~ to match regular expressions patterns

for FILE in $(ls); do
    if [[ "${FILE}" =~ ^Bash.*sh ]]; then
        echo $FILE
    fi
done

COUNTER=10
while [ $COUNTER -gt 0 ]; do
    echo $COUNTER
    COUNTER=$(($COUNTER - 1))
done

-lt #less than
-gt #greater than
== #equaity
!= #inequality

#braces in loops
for NUM in {000..005}; do
    echo mkdir EXPT-${NUM}
done
echo foo.{c,cpp,h,hp}
#loop first 10 records
LINE_NUM=0
while read LINE
do
    if [[ $LINE_NUM -lt 10 ]]; then
        echo $LINE
        (( LINE_NUM++ ))
    fi
done < genes.txt

#command expressions $(command), #basename without path, dirname
#remove the file extension Y=${x%.*}
for FILE in $(ls ~/docs/*)
do
    DIR_NAME=$(dirname $FILE)
    FILE_NAME=$(basename $FILE)
    NAME=${FILE_NAME%.*}
    NEW_DIR=$DIR_NAME/$NAME
    NEW_FILE=${NAME}-copy.txt
    mkdir -p $NEW_DIR
    cat $FILE | head -3 | tail -2  > $NEW_DIR/$NEW_FILE
done



for FILE in $(ls $DATA_DIR/*_MA_J_S20_*gz)
do
    FIRST=$(zcat $FILE 2>/dev/null | head -1)
    echo $(basename $FILE)
    echo $FIRST
    echo
done
#script
#!/path/to/shell or bash path/file
#!/bin/bash or which bash

chmod +x /path/to/script

#!/bin/bash
# "#" gives you the number from the command line
echo $# 
for ARG in {0..$#}; do
    echo $ARG
done
EOF


bash echo.sh
for file in *.fastq
do
    echo ${file}
done

#script with input specifications
bash wordcount.sh *R1.fastq
for file in ${@}
do
    wc ${file}
done
#structure
bin, data, results, manuscript


cat > count_lines.sh << 'EOF'
#!/bin/bash

for FILE in $(ls)
do
    wc -l "${FILE}"
done
EOF

AWK

#tutorial

cat file.txt | awk -F '\t' '$1=="Mt"' | head -3
cat file.txt | awk -F '\t' '$1=="2"' {print $3} | sort | uniq -c
cat docs/echo.txt | awk 'BEGIN {FS = ","} ; $5 == "M"' > docs/male.txt

Additional sources

Seq

seq function generates a range of numbers seq 3 seq 2 5 seq 5 2 9

BASH tutorial - Notes

https://xie186.github.io/Novice2Expert4Bioinformatics/r-introduction.html https://gist.github.com/nathanhaigh/3521724

More information https://github.com/raynamharris/Shell_Intro_for_Transcriptomics/tree/master https://practicalcomputing.org/ https://explainshell.com/ https://data-skills.github.io/unix-and-bash/01-unix-git-intro/index.html

[https://wikis.utexas.edu/display/bioiteam/Scott’s+list+of+linux+one-liners] [ https://www.tutorialspoint.com/using-gzip-and-gunzip-in-linux ] https://www.thegeekstuff.com/2012/12/linux-tr-command/ https://www.gnu.org/software/gawk/manual/html_node/Comparison-Operators.html

https://genome.ucsc.edu/FAQ/FAQformat#format1

https://gist.github.com/DannyArends/04d87f5590090dfe0dc6b42e5e1bbe15 RNA Sequencing - Setup and Prerequisites https://www.youtube.com/watch?v=PlqDQBl22DI&t=14s

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE19080

My index for genome star alignment https://labshare.cshl.edu/shares/gingeraslab/www-data/dobin/STAR/STARgenomes/Human/ https://data.broadinstitute.org/Trinity/ https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/

https://ubuntu.com/tutorials/command-line-for-beginners#4-creating-folders-and-files

http://linuxcommand.org/tlcl.php

https://xie186.github.io/Novice2Expert4Bioinformatics/install-bioinformatics-software-in-linux.html https://hackr.io/blog/basic-linux-commands https://youtu.be/7-G_dQr7B44 Install MUSCLE using binaries: https://youtu.be/UJx34KVLIgI Install PhyML using binaries: https://github.com/vappiah/bioinfoscr… Gene extractor script is available here https://itol.embl.de/ iTOL browser https://youtu.be/2tMQYi_12IQ Python codes explained Installing muscle https://www.drive5.com/muscle/downloads.htm

https://jlsteenwyk.com/ClipKIT/advanced/index.html https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3001007

From https://cyverse-leptin-rna-seq-lesson-dev.readthedocs-hosted.com/en/latest/section-3.html

From https://cyverse-leptin-rna-seq-lesson-dev.readthedocs-hosted.com/en/latest/section-3.html

https://cyverse-leptin-rna-seq-lesson-dev.readthedocs-hosted.com/en/latest/section-3.html https://www.freecodecamp.org/espanol/news/grep-command-tutorial-how-to-search-for-a-file-in-linux-and-unix-with-recursive-find/ https://tldp.org/LDP/Bash-Beginners-Guide/html/sect_09_02.html https://datacarpentry.org/wrangling-genomics/ https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#creating-an-environment-with-commands https://sanbomics.com/2022/01/08/complete-rnaseq-alignment-guide-from-fastq-to-count-table/ https://cyverse-leptin-rna-seq-lesson-dev.readthedocs-hosted.com/en/latest/section-8.html

https://gist.github.com/darencard/e1933e544c9c96234e86d8cbccc709e0

https://www.thegeekstuff.com/2010/03/awk-arrays-explained-with-5-practical-examples/ https://notebook.community/crazyhottommy/scripts-general-use/Shell/Awk_anotates_vcf_with_bed

(another blog)[https://biowize.wordpress.com/2015/03/26/the-fastest-darn-fastq-decoupling-procedure-i-ever-done-seen/]

Sources for scripts https://jeroenjanssens.com/dsatcl/ https://missing.csail.mit.edu/2020/course-shell/ https://swcarpentry.github.io/shell-novice/04-pipefilter/index.html https://datascienceatthecommandline.com/2e/chapter-2-getting-started.html?q=stdin#combining-command-line-tools

wget http://kandurilab.org/users/santhilal/courses/UNIX/materials/course_file_repo.zip

Onedrive wget -qO - https://download.opensuse.org/repositories/home:/npreining:/debian-ubuntu-onedrive/xUbuntu_22.04/Release.key | gpg –dearmor | sudo tee /usr/share/keyrings/obs-onedrive.gpg > /dev/null

echo “deb [arch=$(dpkg –print-architecture) signed-by=/usr/share/keyrings/obs-onedrive.gpg] https://download.opensuse.org/repositories/home:/npreining:/debian-ubuntu-onedrive/xUbuntu_22.04/./” | sudo tee /etc/apt/sources.list.d/onedrive.list