Bash in VS Code

Bourne-Again SHell

Author

Daniel Enriquez-Vera

Published

December 11, 2024

1. Unix Facts

A. Basics and definitions

Most pipelines include tools (awk, bash, perl, etc) that are based on UNIX (faster than R and Python).
Computer clusters use Linux (an open-source environment).
GUI: Graphical User Interface (classic interface)
Shell: Command line interface (terminal)
In windows: VirtualBox is an option
Flavours of Unix: Ubuntu, MacOS

(simple) Shell << Python (advanced)

Steps for Linux installation in Windows:

Turn windows features on or off
check windows subsystem for linux

Types:

Bourne shell (sh) - old stuff
Korn shell (ksh) - old stuff
Bourn again shell (bash) - Linux

# = 1. Which shell are we using?
echo $0
# = 2. What shells have we available?
cat /etc/shells
# = 3. Where is bash located?
which bash
# = 4. Create an empty file
touch peru.sh
# = 5. To check my file permissions
ls -al
# = 6. System variables (usually uppercase)
echo $BASH
echo $BASH_VERSION
echo $HOME
echo $PWD

B. Linux file system

Tree file system: path starts with a forward slash /

Absolute: /starts/from/the/root
Relative: to the working directory:
- . (current)
- ~ (home) or cd empty
- ..(one level up), ../.. (two levels up)
- - (previous directory and print)
- ~- (previous directory)

Tip: To know where I am: pwdor readlink -f

C. Command Syntax

command [options] [file or arguments]

prompt: [user]$
STDIN: Standard input or 0
STDOUT: Standard output or 1 (file >)
STDERR: Standard error or 2 (command &> logfile, 2> for errors)
Basic commands: ls, cd, cp, mv and mkdir -p
Be careful with spaces, spelling and case; use always auto-completion TAB key
More information: man command, --help, whatis
Example: htlvlab\$ cat -1 xd.txt
For intermediate file creation: program1 input.txt | tee intermediate-file.txt | program2 > results.txt

# ls [option] [file]
ls -a #hidden
ls -t #time, -d details, -h size
ls -l -a #characteristics
ls -tlh
ls -l | grep -Ev '^d' #showing only files
ls -l | grep -E '^d' # only folders

# it is possible to combine
ls -la 

# use tree to see all the structure
tree -a

Practice

# Try the following
date
echo hello world! # > or >> to add or append
pwd 
cd
man ls
cat # show the content and concat

# remove - BE CAREFUL!
rm -rf folder
rmdir #for directories (empty)

Downloading

wget http://address

D. Wildcards

# * - all possible characters
ls *.xlsx
# . and ? one possible character
ls ?asta 
# Wildcards
ls [a-z]*.fasta
ls {abc,bcd}.fasta

Practice

Usual project organization: {bin,data,results,manuscripts}

# Create a whole folder structure, avoid spaces
mkdir -p project_htlv/{raw_data,metadata,curated_data,output}

mkdir results-$(date +%F)

Wilcard list

[:alpha:]	any upper/lowercase letter or [a-zA-Z]
[:digit:]	any digits or [0-9].
[:alnum:]	any alphabet and digit or [a-zA-Z0-9]
[:upper:]	uppercase or [A-Z]
[:lower:]	lowercase latter or [a-z]
[:blank:]	space and tab
[:print:]	printable characters
[:cntrl:]	non-printable characters
[:space:]	white-space
[:xdigit:]	hexadecimal digits
[:ascii:]	ASCII characters
[:word:]	alphanumeric characters including underscore(_)

‘^It’ #starts the line with It ‘[0-9]+’ #line with numbers [!a-d]* # negates [^A-Z] #match anything not in the set [‘[:punct:]’] #comma or . [a|b|c] a or b or c (cat|dog) cat or dog

Visualization

View or merge huge files cat, zcat, also: more, less, head and tail (use q to exit)

head -n 5 file.txt 
head -5 file.txt

# Example
x="Hola amiguitos!"
echo $x > hello.txt
cat hello.txt

## Excercise, using cat concat a list of fasta files

E. System

Check jobs running

ps aux

#dynamic realtime view
top 

#kill a process
kill number
kill -9 number
kill -9 %<num>

#disk space
df -h

Clusters

#For clusters
myquota

#memory 
free -h

#who is logged
w
who -a
whoami

#uptime
uptime

#network
ifconfig

#User and passwd
adduser xxxid
adduser xxxide -g bioinf

Copy from the server

rsync
scp

F. Scripting

All programs exit with an exit-status $? (0: Successfully, nonzero: error)

.sh extension for bash executable files.


echo """this is my first text,
If I write a haiku,
use triple marks""" > xd.txt; echo $?
echo "Hola, I speak spanish"
cat -1 xd.txt

Tips: Always avoid errors

Check that files exist.
Any error may cause your script to abort 😰.
Variable names should not star with a number.
Be careful with strings and spaces!

Shebang - a header to call bash

First line - our bash path

#!/bin/bash

All the BASH options! (Sanity check)

set -e: terminate the script if any command existed with a nonzero exit status.
set -u: not to run any command containing a reference to an unset variable.
set -o pipefail: if the last program terminates with a nonzero status the pipe, it wont stop.

echo "rm $VAR1/NOTSET*.*"`

My first script

First, create a file:

touch myscript.sh

Add a shebang:

#!/bin/bash
echo "I am writing this non-sense script for you - DEV"

Then, make it executable:

chmod +x myscript.sh #permissions (x) for the owner (u)
chmod u+x myscript.sh

Finally: Execute it!

bash my script.sh

# set variables
dirpath='PATH'
fasta='file'
echo ${dirpath} #use always ${}
variables=$(pwd) #without spaces for commands
echo "hello ${NAME1}"
((NUM++)) #double parantheses for mathematical expressions

Nano

Control + o = save Control + x = exit

2. Data management

A. File management

# File Size
du -sh file
du -ah file

# Copy
cp filea.txt directory/filea.txt
# Move
mv filea.txt directory/filea.txt

Compressed files

Check compressed files

zcat file.gz | less

# Check if is valid   
gzip -t file.gz

# Compress multiple files, to keep the original -k
gzip *.fa

# Decompress
gunzip file.gz
gunzip SRR*_1R2.fastq.gz

# Decompress and keep the original
gunzip -c file.gz > file

Tar

tar zxvf file.tar.gz
#z unzip
#x exctract
#v verbosely
#f the following is a filename
#   c compress
# t view list

tar -tvf # viewing without extraction

#create a tar
tar zcvf file.tar.gz file1 file2 file3 folder1

Zip files

zip file.zip
unzip file.zip

tar.bz2

tar -jxvf file.tar.bz2

#to create
tar -cvjSf file.tar.bz2 file1 file2 folder1 folder2

File integrity

Reason: files may be corrupted after downloading.


#integrity for data transfer checksum
md5sum *fa > MD5_checksum
#to check
md5sum -c MD5_checksum
#to compare then with the original file
diff -u file1 file2 | cat

B. File manipulation

cut, sort, uniq, wc, grep, paste

sort -n lenghts.txt #ordena de forma numerica
head -n lenghts.txt
wc *.fastq | sort -n | head -1

grep: Global regular expression print

Search inside the file for a string or pattern.

grep "Hola" file.txt | wc -l
grep -w #exact word
grep -c # count number of lines
grep -n #line numbers
grep -v # do not match
grep -E #no need to escape special characters, -Ev invert option
grep --color
grep -l #files
grep -A 1 #prints one line after
grep -o '[A-Z]*' # get rid of numbers
grep -i "a line" | cut -f 1-3,5 # case insensitive
paste - - - - file |cut -f 1-2 | head -2 # merge the files horizontally with 4 columns
grep -E ab+c # one or more of the preceding character abbbbbbcd
grep -E ab*cd # zero or more of the preceding character
grep -E 'ab{1,5}cd' #number of repetitions of the preceding character set
'\<the\>' # word boundaries
grep -E '([0-9]+).*\1'
'([0-9]+)_([0-9]+)_\1_\2' #repeated twice
grep "[2|4]" #lines 2 and 4

### Practice
#fastq convert to fasta
paste - - - - <data/example.fastq |cut -f 1-2 | sed 's/^@/>/' | tr "\t" "\n"  | head -4

## More advance: from paired to single end 

paste < (paste - - - - < reads-1.fastq) \
      < (paste - - - - < reads-2.fastq) | \
      tr '\t' '\n' > reads-int.fastq
paste - - - - - - - - < reads-int.fastq \
    | tee > (cut -f 1-4 | tr '\t' '\n' > reads-1.fastq) \
    | cut -f 5-8 | tr '\t' '\n' > reads-2.fastq

wc -l #lines 
wc -w #words 
wc -c #characters

# Translate 
echo "santhilal" | tr a-z A-Z

fastafile=xxxx.fa
grep -v "^>" $fastafile | grep --color -i "[^ATCG]"

echo "There are $(grep -c '^>' $fastafile) entries in my FASTA file."

# Examples, grep -w "...." = 4 characters
cat H0.list | grep "^USP*" # Here '^' means 'line starts with'
cat H0.list | grep "USP*"
cat H0.list | grep "\.[0-9]$" # $the ending part

# real life examples
cat hg38_gene_annotation_mart_export.txt | grep -v "^#" | cut -f1 | sort | uniq | wc -l # This counts total number of unique genes in the annotation

cat hg38_gene_annotation_mart_export.txt | grep -v "^#" | cut -f7 | sort | uniq | wc -l # This counts total number of unique transcripts

cat hg38_gene_annotation_mart_export.txt | grep -v "^#" | grep -w "protein_coding" | cut -f1 | sort | uniq | wc -l # Counts only total number of protein coding genes

# Zcat to open .gz files

zcat Homo_sapiens.GRCh38.87_badStructure.table.gz | sed 's/gene_id "//' | sed 's/gene_name "//' | sed 's/gene_biotype "//' | sed 's/"//g' | sed '1i\Geneid\tGeneSymbol\tChromosome\tClass\tStrand\tLength' | sed 's/ //g' | head


# convert DNA to RNA
cat file.txt | tr T U
#tr -d 'AT' (delete)
tr '\t' ','#tsv to csv
tr a-z A-Z
tr ACTG TGAC

# convert reverse complement
grep -o '[ACTG]*' junk.txt | tr ACGT TGCA | rev
grep '[CG]' -o file | wc -l #count only CG

#find unique features
zcat $DATA_DIR/$GTF | grep -v '^#' | cut -f 3 | sort | uniq
#counts
zcat $DATA_DIR/$GTF | grep -v '^#' | cut -f 3 | sort | uniq -c
#first 3 exons
zcat $DATA_DIR/$GTF | grep -v '^#' | awk '$3 == "exon"' | head -3
#attributes
zcat $DATA_DIR/$GTF | grep -v '^#' | awk '$3 == "gene"' | cut -f 9 > genes.txt

find

# Find any files with "Linux" and ".Rmd" in the file names, -iname, -maxdepth 1, -ctime +1, -mtime -1, -mmin -15
find . -type f -name "*Linux*.Rmd"

# Count the number of files
find . -type f | wc -l
find . -type f -size +10M
find . -maxdepth 1 -type f -size +1M

# Do something with the file or folder (d)
find . -type f -name "*fa*.pl" -exec ls -l {} +

cut

cut: Choose a column from a tab-delimited file (spreadsheet) cut -f (column number to choose) file.txt

cut extracts columns separated by delimiters (-d)
cut -d " " -f 2 file.txt #column 2 or 1,3,5 or 1-5

uniq

Recognize two or more adjacent lines are the same.

uniq -c para contar
uniq -w N solo usa los xxx caracteres iniciales para evaluar si son iguales

sort haiku.txt | uniq -c 

history # historial

# Example first all step by step
ls ~ > file_list.txt
wc -l file_list.txt
rm file_list.txt

# piping the data from one command to another
ls ~ | wc -l

Piping

&&: Continue only if the first is completed successfully.
||: Continue only if the first is completed unsuccessfully.

continue to next line or pipe |

# Redirecting the output to print
tail -f 

true && echo "works"

##  Example
for file in ${@}
do 
    wc ${file}
done


for file in ../data/${@}
do
    output=$(basename ${file} .fastq)-wordcount.txt
    wc ${file} >> ../data/wordcount.txt
done

mkdir dataseq


while read
fasterq-dump -p --split-files $inputsra -O ./dataseq

3. Virtual Environment for Python

venv allow us to manage every project independently (Avoiding version conflicts). Basically, two options:

Conda
Virtualenv or venv

Installation: (only the first time.)

venv: built-in tool
virtualenv: more advanced and efficient features.

# Tip: Everytime I have to use bash (% or !)
!pip install virtualenv
!echo "Virtual environment allow to work every project with independent packages versions"

A. Create a virtual environment

# Option A
# a ".venv" environment has been created (WOS)
!python -m venv .venv

# Option B
!virtualenv .venv 

# Option C
# Control + Shift + P, venv
# create environment, select interpreter

## To delete (just remove it after deactivate and select another interpreter)
!rmdir /s /q .venv

B. Activate venv

# Windows
# Select interpreter
# in the terminal .venv\Scripts\activate.ps1

## Mac
# source env_name/bin/activate

## deactivate
## deselect the interpreter

C. Install modules for Python

Now every module will be downloaded and installed separately from the global environment (an specific version can also be specified), avoid using ! or % at this step.

!pip install ipykernel pandas jupyter pyyaml

4. Variables

We use a dollar sign ($) to access a variable.

results_dir="results/"

echo $results_dir

# = To be more specific with beginning and end
echo ${results_dir}

# = Make it more robust with ""
echo "${results_dir}_abc/"

A. Variables and arguments

An editor: Yes, we can use any (nano, vim, emacs, etc).

echo '
#!/bin/bash 
echo "script name: $0"
echo "first arg: $1"
echo "second arg: $2"
echo "third arg: $3" ' > args.sh

bash args.sh arg1 arg2 arg3

$0 = script name.
$1 = argument 1.
$# = number of arguments.

echo '#!/bin/bash 
if [ "$#" -lt 3 ] 
then
    echo "error: too few arguments, you provided $#, 3 required"
    echo "usage: script.sh arg1 arg2 arg3"
    exit 1
fi
echo "script name: $0"
echo "first arg: $1"
echo "second arg: $2"
echo "third arg: $3" ' > args.sh

Practice

bash args.sh arg1 arg2

#Passing arguments
echo $1 $2 $3

#script name
echo $0 

# = another form to pass, here 0 is the first argument
args=("$@")
echo ${args[1]} ${args[2]} ${args[3]}  

#the same result
echo $@

#number of arguments
echo $#

Practice

## using read
echo "Enter name:"
read name
echo "Entered name: $name"

## if multiple variables
echo "Enter names:"
read name1 name2 name3
echo "Names: $name1, $name2, $name3"

## to use the same line
read -p "username:" user_var
echo "username: $user_var"

#to use a hidden or silent -sp
read -p "username:" user_var
read -sp "password:" pass_var
echo
echo "username: $user_var"
echo "password: $pass_var"

Practice

This is how arrays work in bash:

#run the inputs in a array read -a
read -a names
echo "names: ${names[0]}, ${names[1]}

#if after read is empty
read
echo "Name: $REPLY"

5. Conditionals

0 means success,anything else is false or failure. No need any overhead to call.

if [command] then
    [if-statement] #if it is true or 0
    elif
    then
    else
else
    [else-statement] #optional
fi

Practice

If pipelines is used with set -o, any nonzero exit will skip the statement to the next block.

echo `#!/bin/bash
if grep "pattern" some_file.txt > /dev/null 
    then
    # commands to run if "pattern" is found
    echo "found 'pattern' in 'some_file.txt" 
fi`
# We can also negate our program’s exit status with !: 
if ! grep "pattern" some_file.txt > /dev/null
    then echo "did not find 'pattern' in 'some_file.txt" 
fi

6. Expressions

An expression can be: String comparison, Numeric comparison, File operators and Logical operators and it is represented by [expression]:

A. String Comparisons:

= compare if two strings are equal; if [ "$a" = "$b" ]
!= compare if two strings are not equal; if [ "$a" != "$b" ]
-n evaluate if string length is > 0
-z evaluate if string length is = 0
== is equal to if [ "$a" == "$b" ]

B. Expressions for strings:

< is less than, in ASCII alphabetical order if [[ "$a" < "$b" ]]
> is greater than, in ASCII alphabetical order if [[ "$a" > "$b" ]]

Examples:

[ s1 = s2 ] # (true if s1 same as s2, else false)
[ s1 != s2 ] # (true if s1 not same as s2, else false)
[ s1 ]  # (true if s1 is not empty, else false)
[ -n s1 ] #   (true if s1 has a length greater then 0, else false)
[ -z s2 ] #  (true if s2 has a length of 0, otherwise false)

C. Expressions for numbers

-eq compare if two numbers are equal
-ge compare if one number is >= to a number, (("$a" >= "$b"))
-le compare if one number is <= to a number, (("$a" <= "$b"))
-ne compare if two numbers are not equal
-gt compare if one number is > another number, (("$a" > "$b"))
-lt compare if one number is < another number, (("$a" < "$b"))

Practice

#Examples: 
[ n1 -eq n2 ]  #(true if n1 same as n2, else false)
[ n1 -ge n2 ]  #(true if n1greater then or equal to n2, else false)
[ n1 -le n2 ]  #(true if n1 less then or equal to n2, else false)
[ n1 -ne n2 ]  #(true if n1 is not same as n2, else false)
[ n1 -gt n2 ]  #(true if n1 greater then n2, else false)
[ n1 -lt n2 ]  #(true if n1 less then n2, else false)

echo -e "Enter the name of the file: \c"
read file_name
if [-e $file_name] #exists?
then
    echo "$file_name found"
else
    echo "$file_name not found"
fi

D. Practice

echo -e "Enter the name of the file: \c"
read file_name
if [ -f $file_name ]
then
    if [ -w $file_name]
    then
        echo "Type some text data, to quite press control + d"
    else
    fi
else
    echo "$file_name dont exist"
fi

7. Basic operations in Bash

Operations: + - * / % (remainer after division) After expr is necessary *

num1=20
num2=

echo $((num1+ num2))
echo $(expr $num1 + $num2)
echo $(($num1 + $num2))

# working with decimals, (awk)
echo "1.5+2.3" | bc
echo "scale=2; 20.5/5" |bc
echo "scale=2; sqrt($num)" | bc -l #library with more mathematical formulas

A. Basic operations

Tips: for floating numbers -> bc (basic calculator)

`test` command

Exits with either 0 or 1, supports standard comparison operators (independenly)

test "ATG" = "ATG" ; echo "$?"
test "ATG" = "atg" ; echo "$?" 
test 3 -lt 1; echo "$?"
test 3 -le 3; echo "$?"

Logical operators

-d dir; -b a special file; -c character file
-s empty; -f file; -e file exists; -h link
-r readable?; -w writable?; -x executable?

test -d some_directory ; echo $? # is this a directory? 
test -f some_file.txt ; echo $? # is this a file?
test -r some_file.txt ; echo $? # is this file readable?
#test+if
if test -f some_file.txt
    then [...]
fi
if [-f some_file.txt]
    then [...]
fi

Operators

-a AND; -o OR; ! negation; () group

age=25
if [ "$age" -gt 18 ] && [ "$age -lt 30" ]
then
    echo "valid age"
    else
    echo "age not valid"
fi
### Alternatives
##  [ "$age" -gt 18 -a "$age -lt 30" ]
##  [[ "$age" -gt 18 && "$age -lt 30" ]]

Practice

Tip: Provide an space after and before

#!/bin/bash
set -e
set -u
set -o pipefail

if [ "$#" -ne 1 -o ! -r "$1" ]
  then echo "usage: script.sh file_in.txt"
  exit 1 
fi

B. Case patterns

LANG=C #setting the language uppercase
echo -e "Enter some xxxx: \c"
read value 

case $value in
    pattern 1)  #[a-z][A-Z][0-9]?(special), *unknown
        statement ;; #echo "this is the $value"
    pattern 2)
        statement ;;
        *)
        statement ;;
    ...
esac

More case patterns

#!/usr/bin/bash
echo -e "Enter a character: \c"
read value
#environment variable to set the language
LANG=C
case $value in
    [a-z] )
        echo "lower case" ;;
    [A-Z] )
        echo "upper case" ;;
    [0-9] )
        echo "digit" ;;
    ? )
        echo "special" ;;
    * )
        echo "unknown input" ;;
esac

C. Arrays

A set of elements

#!/usr/bin/bash
os=("ubuntu", "windows", "kali")
os[3]="mac" #it will add mac to the array
unset os[2] #remove the third one
# gaps in the array are possible

echo "${os[@]}" #all elements printed
echo "${os[1]}" #to print windows (2)
echo "${!os[@]}" #prints the elements (0 1 2)
echo "${#os[@]}" #prints the length of the array

D. Loops

It’s better to have a list of files Tip: avoid spaces, tabs, newlines, or special characters (*)

echo ${sample_files[@]}
echo ${sample_names[@]}

#text with 3 columns, and the third column is filenames
sample_files=($(cut -f 3 ../file.txt))
echo ${sample_files[@]}

#then use basename to remove extension
basename -s .fastq seqs/ABC001.fastq

E. While loops

n=1
while [ $n -le 10 ]
#other option (( $n <= 10 >))
do
    echo "$n"
    n=$((n+1))
    #other option (( ++n  ))
    
done

Tip: force a pause

sleep 1 #pause for 1 second
gnome-terminal & #open a terminal
xterm & #open a terminal

Practice

#!/bin/bash

set -e
set -u
set -o pipefail

# specify the input samples file, where the third column is the path to each sample FASTQ file
sample_info=samples.txt

# create a Bash array from the third column of $sample_info
sample_files=($(cut -f 3 "$sample_info"))

for fastq_file in ${sample_files[@]} 
do
    # strip .fastq from each file, and add suffix "-stats.txt" to create an output filename
    results_file="$(basename $fastq_file .fastq)-stats.txt"
    
    # run fastq_stat on a file, writing results to the filename we've # above
    fastq_stat $fastq_file > stats/$results_file
done

Practice

#!/bin/bash
    set -e
    set -u
    set -o pipefail

# specify the input samples file, where the third column is the path to each sample FASTQ file
sample_info=samples.txt

# our reference
reference=zmays_AGPv3.20.fa

# create a Bash array from the first column, which are 
# sample names. Because there are duplicate sample names 
# (one for each read pair), we call uniq 

sample_names=($(cut -f 1 "$sample_info" | uniq))

for sample in ${sample_names[@]} do
        # create an output file from the sample name
results_file="${sample}.sam"

bwa mem $reference ${sample}_R1.fastq ${sample}_R2.fastq > $results_file 
done

F. While read

while read p
do 
    echo $p
done < file.sh
#another option
cat file.sh | while read p
do 
    echo $p
done < file.sh
#another option files with special characters
while IFS= read -r p
do
    echo $p
done

G. Until for loop

#If the condition is false
n=1
until [ $n -ge 10 ]
#also possible with until (( $n > 10 ))
do
    echo $n
    n=$(( n+1 ))
done

## For loop

for variable in 1 2 3 4 5 #list of values
for variable in {1..10..2} #{start..end..increment}
for variable in file1 file 2 file 4
for OUTPUT in $(Linux-command-here)
for (( EXP1; EXP2; EXP3 ))
for (( i=0; i<5; i++ ))
for command in ls pwd date

Until for loop

for item in * #all inside the folder
do
    if [ -d $item ]
    if [ $i -gt 5 ]
    then 
        break #get out of the loop
        continue #
    then 
        echo $item    

# Another example
#!/bin/bash
for (( i=1 ; i<=10 ; + i++))
do
    if [ $i -eq 3 -o $i -eq 6 ]

Select loop for easy menus

select varName in list #Presented with numbers
do 
    command1
    command2
    ..
    commandN
    case $name in
    mark)
        echo mark selected
        ;;
    John)
        echo john selected
        ;;
    *)
        echo "Error"
    esac
done


select name in mark jhon tom ben
do
    echo "$name selected"
done

H. Break and continue

to exit the loop prematurely

for (( i=1 ; i<=10 ; i++ ))
do
    if [ $i -gt 5 ]
    then 
        break #continue also possible (excepcions)
    fi
    echo "$i"
done

8. Functions

function name(){
    Commands
}

#OR

name () {
    Commands
}

#Example
function Hello(){
    echo "Hello"
}

Practice

usage(){
    echo "you need to provide an argument : "
    echo "usage : $0 file_name"
}
is_file_exist() {
    local file="$1"
    [[ -f "$file" ]] && return 0 || return 1
} 

[[ $# -eq 0 ]] && usage

if ( is_file_exist "$1")
then
    echo "File found"
else
    echo "File not found"
fi

Local variables

function printx(){
    local name=$1 #it will use the variable only for this function
    echo "the name is $name"
}

Read-only

it cant be overwritten

var=31
readonly var
## for functions
readonly -f

Signals and traps

except SIGKILL and SIGSTOP

trap "echo Exit signal is detected" SIGNIT # or 9
echo "pid is $$"
while (( COUNT < 10 ))
do
    sleep  10
    (( COUNT ++ ))
done
exit 0
###############
kill -9 pid
trap "echo Exit command is detected" 0 #if it received a 0 value
echo "Hello world"
exit 0
###example
trap "rm -f $file && echo file deleted; exit" 0 2 15

Debugging a script

bash -x script.sh

###in the shebang
#!/bin/bash -x
set -x
from line 1 to 10
set +x

#### Globbing
#!/bin/bash set -e
set -u
set -o pipefail
for file in *.fastq
do
    echo "$file: " $(bioawk -c fastx 'END {print NR}' $file)
done

Find

ls *.fq | process_fq #may contain errors
process_fq *.fq #it has some limits

##### Find - find path expressions
find / | head 
find -maxdepth 1 . #only this directory


touch path/absolute/xbc{A,B,C}_R{1,2}.fastq
find path

##### Find expressions
find path/to/folder -name "filex*fastq" -type f
# d for directories
# f for files
find path/to/folder -name "filex*fastq" -or -name "otrox*fastq" -type f
#another alternative
find path/to/folder -type f "!" -name "filex*fastq" -and "!" -name "*temp"

## to remove this files
find path/to/folder -name "*-temp.fastq" -exec rm {} \;

XARGS

xargs reads data from stdin (input) and executes the command from the argument one or more times based on input if no command -> echo programs that take multiple arguments (rm, touch, mkdir)

find path/to/folder -name "*-temp.fastq" | xargs -n 1 rm
#other example
find zmays-snps/data/seqs -name "*-temp.fastq" > files-to-delete.txt
cat files-to-delete.txt
cat files-to-delete.txt | xargs rm
######
find . -name "*.fastq" | xargs basename -s ".fastq" | xargs -I{} fastq_stat --in {}.fastq --out ../summaries/{}.txt 
## Notice that the BSD xargs will only replace up to five instances of the string specified 
## by -I by default, unless more are set with -R.

XARGS

We can launch multiple background processes with Bash for loops by adding the ampersand (&) at the end of the command (e.g., for filename in *.fastq; do program “$filename” & done) xargs to run simultaneously with the -P option

find . -name "*.fastq" | xargs basename -s ".fastq" | xargs -P 6 -I{} fastq_stat --in {}.fastq --out ../summaries/{}.txt 
## Notice that the BSD xargs will only replace up to five instances of the string specified 
## by -I by default, unless more are set with -R.

GNU parallel

find . -name "*.fastq" | parallel --max-procs=6 'program {/.} > {/.}-out.txt'

Make a list of files

ls -1 | sed 's/\.fastq\.gz//g' > ../PRJEB41550.txt

9. Bioinformatics software instalation

bwa-mem2 and fastp >Bwa-mem2

#sudo root - be extremely carefull, control + D to exit
sudo su

sudo apt-get update

# answer yes
sudo apt-get upgrade -y
sudo apt-get dist-upgrade -y
sudo apt-get autoremove -y
sudo apt-get autoclean -y


sudo apt-get install unzip

## Samtools
sudo apt-get upgrade
sudo apt-get update
sudo apt-get install -y libncurses-dev libbz2-dev liblzma-dev
sudo apt-get install libssl-dev
#### Option 2
sudo apt-get install libcurl4-openssl-dev
sudo apt-get install zlib1g-dev

#Steps: Install
sudo ...

export PATH=$PATH:/home/dev105/devapps/STAR/source

Installation

#Download samtools
wget  samtools bcftools htslib #LINK
tar xvjf samtoolsxxxx 
cd samtoolsxxx
./configure
make
sudo make install

##Export to the path
export PATH="$PATH:/home/labatl/devapps/bwa-mem2"
export PATH="$PATH:/home/labatl/devapps/htslib/htslib-1.20"
export PATH="$PATH:/home/labatl/devapps/htslib/bcftools-1.20"
export PATH="$PATH:/home/labatl/devapps/htslib/samtools-1.20"https://github.com/bwa-mem2/bwa-mem2
export PATH="$PATH:/home/labatl/devapps"
#Add it to
nano ~/.bashrc
## Source it
source ~/.bashrc

#Another option
nano ~/.bashrc
##Write at the end:
export PATH=$PATH:”filepath”

SED

Allows to filter/replace efficiently text in a pipeline

sed 's/hello/world/' input.txt > output.txt cat input.txt | sed 's/hello/world/' - > output.txt

-i: overwrite the same file
-n: supress output and specify the line with p:sed -n '45p' file.txt o sed -n '1p ; $p' one.txt two.txt three.txt
-s: reverse
-e or -f: specify non-option parameter: sed --expression='s/hello/world/' input.txt > output.txt or sed --file=myscript.sed input.txt > output.txt
-z: a set of lines

sed -n 1,2p Homo_sapiens.GRCh38.87_annotation.table
sed -n "1~2p" file.txt #-n not print, extract only odd numbered lines
zcat $DATA_DIR/$FASTQ | sed -n '2~4p;' | head | tr -d '\n' #extract every 4th line
sed -n '2000,2005p' #number of lines
sed "s/[0-9][0-9]*\.[[:space:]]//g" file.txt # g global

Manual[https://www.gnu.org/software/sed/manual/sed.html] (Extended REGEX)[http://austingroupbugs.net/view.php?id=528]

Loops

for is use for iterating over a fixed number of items (may be unknown at time of coding).
while is use for iterating until a certain condition is met

#### Loops
##### Examples of for
for file in ../data/${@}
do
    wc ${file} >> ../data/wordcount.txt
done

for file in ../data/${@}
do
    output=$(basename ${file} .fastq)-wordcount.txt 
    wc ${file} >> ../results/${output}
done

for loops
for i in $( ls ); do echo item: $i 
done

##### Examples of while
while true; do echo ‘hello’ done

#### Conditionals
if [ "foo" = "foo" ]; then
else
fi

##### Iterating with xargs -I one by one -0 avoid blanks
ls *.fasta | xargs -I% sh -c 'head %; echo "\n---\n"'
#sh -c "to call shell instead of bash"
#https://www.baeldung.com/linux/xargs-multiple-arguments
my_variable="ggplot2"

echo "My favorite R package is ${my_variable}"
my_var="chr1"
echo "${my_var}_1.vcf.gz"

#script

#STDIN -> STDOUT+STDERR
I/O redirections: |,<,>
ps -aef #show a list of all processes running
bash sam_run my_file.bam out_stats.txt
#!/bin/bash/
samtools stats $1 > $2
#!/bin/bash/
samtools stats $2 > $1 #output primero

#to hack SIGPIPE error in Jupyter
cleanup () {
    :
}

trap "cleanup" SIGPIPE
#safety
set -u #unset variables as an error and exit





grep -v "^>" ../data/tb1.fasta | grep --color -i "[^ATCG]"
CCCCAAAGACGGACCAATCCAGCAGCTTCTACTGCTAYCCATGCTCCCCTCCCTTCGCCGCCGCCGACGC

for file in ../data/${@}
do
    output=$(basename ${file} .fastq)-wordcount.txt
    wc ${file} >> ../data/wordcount.txt
done

mkdir dataseq

while read
fasterq-dump -p --split-files $inputsra -O ./dataseq



echo "Start!"

for sra in (txt)
do
    echo "${p}"
    while [ $(jobs | wc -l) -ge "$maxproc" ]
    do
        sleep 1
    done
    base=${fname%_R1*}
    echo starting new job with ongoing=$(jobs | wc -l)
    bwa-0.7.5a/bwa mem -t 4 human_g1k_v37.fasta "${base}_R1.fastq.gz" "${base}_R2.fastq.gz" >"$destdir/${base}_R1_R2.sam" &
done



#if to check for file existence
if [-f hello.txt]; then
    cat hello.txt
else
    echo "no such file"
fi

if [! -f "data/file"]; then
    wget url -O data/file
fi

#test for conditional evaluation
test -f file o [-f file]

#Loops
for i in SRRR*;
do
    echo ${i} #inserted in a line with other characters
done
for file in copy-SRR534005*; 
echo $file ${file//SRR534005/chicken}; 
do mv $file ${file//SRR534005/chicken}; 
done

for FILE in $(ls *ipynb); do
    echo $FILE
done

for ((i=0; i<5; i++)); do
    echo $i
done

test -a = &&
test -o = ||
=~ to match regular expressions patterns

for FILE in $(ls); do
    if [[ "${FILE}" =~ ^Bash.*sh ]]; then
        echo $FILE
    fi
done

COUNTER=10
while [ $COUNTER -gt 0 ]; do
    echo $COUNTER
    COUNTER=$(($COUNTER - 1))
done

-lt #less than
-gt #greater than
== #equaity
!= #inequality

#braces in loops
for NUM in {000..005}; do
    echo mkdir EXPT-${NUM}
done
echo foo.{c,cpp,h,hp}
#loop first 10 records
LINE_NUM=0
while read LINE
do
    if [[ $LINE_NUM -lt 10 ]]; then
        echo $LINE
        (( LINE_NUM++ ))
    fi
done < genes.txt

#command expressions $(command), #basename without path, dirname
#remove the file extension Y=${x%.*}
for FILE in $(ls ~/docs/*)
do
    DIR_NAME=$(dirname $FILE)
    FILE_NAME=$(basename $FILE)
    NAME=${FILE_NAME%.*}
    NEW_DIR=$DIR_NAME/$NAME
    NEW_FILE=${NAME}-copy.txt
    mkdir -p $NEW_DIR
    cat $FILE | head -3 | tail -2  > $NEW_DIR/$NEW_FILE
done



for FILE in $(ls $DATA_DIR/*_MA_J_S20_*gz)
do
    FIRST=$(zcat $FILE 2>/dev/null | head -1)
    echo $(basename $FILE)
    echo $FIRST
    echo
done
#script
#!/path/to/shell or bash path/file
#!/bin/bash or which bash

chmod +x /path/to/script

#!/bin/bash
# "#" gives you the number from the command line
echo $# 
for ARG in {0..$#}; do
    echo $ARG
done
EOF


bash echo.sh
for file in *.fastq
do
    echo ${file}
done

#script with input specifications
bash wordcount.sh *R1.fastq
for file in ${@}
do
    wc ${file}
done
#structure
bin, data, results, manuscript


cat > count_lines.sh << 'EOF'
#!/bin/bash

for FILE in $(ls)
do
    wc -l "${FILE}"
done
EOF

AWK

#tutorial

cat file.txt | awk -F '\t' '$1=="Mt"' | head -3
cat file.txt | awk -F '\t' '$1=="2"' {print $3} | sort | uniq -c
cat docs/echo.txt | awk 'BEGIN {FS = ","} ; $5 == "M"' > docs/male.txt

Additional sources

Seq

seq function generates a range of numbers seq 3 seq 2 5 seq 5 2 9

BASH tutorial - Notes

https://xie186.github.io/Novice2Expert4Bioinformatics/r-introduction.html https://gist.github.com/nathanhaigh/3521724

More information https://github.com/raynamharris/Shell_Intro_for_Transcriptomics/tree/master https://practicalcomputing.org/ https://explainshell.com/ https://data-skills.github.io/unix-and-bash/01-unix-git-intro/index.html

[https://wikis.utexas.edu/display/bioiteam/Scott’s+list+of+linux+one-liners] [ https://www.tutorialspoint.com/using-gzip-and-gunzip-in-linux ] https://www.thegeekstuff.com/2012/12/linux-tr-command/ https://www.gnu.org/software/gawk/manual/html_node/Comparison-Operators.html

https://genome.ucsc.edu/FAQ/FAQformat#format1

https://gist.github.com/DannyArends/04d87f5590090dfe0dc6b42e5e1bbe15 RNA Sequencing - Setup and Prerequisites https://www.youtube.com/watch?v=PlqDQBl22DI&t=14s

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE19080

My index for genome star alignment https://labshare.cshl.edu/shares/gingeraslab/www-data/dobin/STAR/STARgenomes/Human/ https://data.broadinstitute.org/Trinity/ https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/

https://ubuntu.com/tutorials/command-line-for-beginners#4-creating-folders-and-files

http://linuxcommand.org/tlcl.php

https://xie186.github.io/Novice2Expert4Bioinformatics/install-bioinformatics-software-in-linux.html https://hackr.io/blog/basic-linux-commands https://youtu.be/7-G_dQr7B44 Install MUSCLE using binaries: https://youtu.be/UJx34KVLIgI Install PhyML using binaries: https://github.com/vappiah/bioinfoscr… Gene extractor script is available here https://itol.embl.de/ iTOL browser https://youtu.be/2tMQYi_12IQ Python codes explained Installing muscle https://www.drive5.com/muscle/downloads.htm

https://jlsteenwyk.com/ClipKIT/advanced/index.html https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3001007

From https://cyverse-leptin-rna-seq-lesson-dev.readthedocs-hosted.com/en/latest/section-3.html

https://cyverse-leptin-rna-seq-lesson-dev.readthedocs-hosted.com/en/latest/section-3.html https://www.freecodecamp.org/espanol/news/grep-command-tutorial-how-to-search-for-a-file-in-linux-and-unix-with-recursive-find/ https://tldp.org/LDP/Bash-Beginners-Guide/html/sect_09_02.html https://datacarpentry.org/wrangling-genomics/ https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#creating-an-environment-with-commands https://sanbomics.com/2022/01/08/complete-rnaseq-alignment-guide-from-fastq-to-count-table/ https://cyverse-leptin-rna-seq-lesson-dev.readthedocs-hosted.com/en/latest/section-8.html

https://gist.github.com/darencard/e1933e544c9c96234e86d8cbccc709e0

https://www.thegeekstuff.com/2010/03/awk-arrays-explained-with-5-practical-examples/ https://notebook.community/crazyhottommy/scripts-general-use/Shell/Awk_anotates_vcf_with_bed

(another blog)[https://biowize.wordpress.com/2015/03/26/the-fastest-darn-fastq-decoupling-procedure-i-ever-done-seen/]

Sources for scripts https://jeroenjanssens.com/dsatcl/ https://missing.csail.mit.edu/2020/course-shell/ https://swcarpentry.github.io/shell-novice/04-pipefilter/index.html https://datascienceatthecommandline.com/2e/chapter-2-getting-started.html?q=stdin#combining-command-line-tools

wget http://kandurilab.org/users/santhilal/courses/UNIX/materials/course_file_repo.zip

Onedrive wget -qO - https://download.opensuse.org/repositories/home:/npreining:/debian-ubuntu-onedrive/xUbuntu_22.04/Release.key | gpg –dearmor | sudo tee /usr/share/keyrings/obs-onedrive.gpg > /dev/null

echo “deb [arch=$(dpkg –print-architecture) signed-by=/usr/share/keyrings/obs-onedrive.gpg] https://download.opensuse.org/repositories/home:/npreining:/debian-ubuntu-onedrive/xUbuntu_22.04/./” | sudo tee /etc/apt/sources.list.d/onedrive.list

1. Unix Facts

A. Basics and definitions

Steps for Linux installation in Windows:

Types:

Navigation

B. Linux file system

C. Command Syntax

Practice

Downloading

D. Wildcards

Practice

Wilcard list

Visualization

E. System

Check jobs running

Clusters

Copy from the server

F. Scripting

Tips: Always avoid errors

Shebang - a header to call bash

All the BASH options! (Sanity check)

My first script

Nano

2. Data management

A. File management

Compressed files

Tar

Zip files

tar.bz2

File integrity

B. File manipulation

grep: Global regular expression print

find

cut

uniq

Piping

3. Virtual Environment for Python

A. Create a virtual environment

B. Activate venv

C. Install modules for Python

4. Variables

A. Variables and arguments

Practice

Practice

Practice

5. Conditionals

Practice

6. Expressions

A. String Comparisons:

B. Expressions for strings:

C. Expressions for numbers

Practice

D. Practice

7. Basic operations in Bash

A. Basic operations

test command

Logical operators

Operators

Practice

B. Case patterns

More case patterns

C. Arrays

D. Loops

E. While loops

Practice

Practice

F. While read

G. Until for loop

Until for loop

Select loop for easy menus

H. Break and continue

8. Functions

Practice

Local variables

Read-only

Signals and traps

Debugging a script

Find

XARGS

XARGS

`test` command