# Tip: Everytime I have to use bash (% or !)
!pip install virtualenv
!echo "Virtual environment allow to work every project with independent packages versions"Bash in VS Code
1. Unix Facts
A. Basics and definitions
- Most pipelines include tools (awk, bash, perl, etc) that are based on UNIX (faster than R and Python).
- Computer clusters use Linux (an open-source environment).
- GUI: Graphical User Interface (classic interface)
- Shell: Command line interface (terminal)
- In windows: VirtualBox is an option
- Flavours of Unix: Ubuntu, MacOS
(simple) Shell << Python (advanced)
Steps for Linux installation in Windows:
- Turn windows features on or off
- check windows subsystem for linux
Types:
- Bourne shell (sh) - old stuff
- Korn shell (ksh) - old stuff
- Bourn again shell (bash) - Linux
# = 1. Which shell are we using?
echo $0
# = 2. What shells have we available?
cat /etc/shells
# = 3. Where is bash located?
which bash
# = 4. Create an empty file
touch peru.sh
# = 5. To check my file permissions
ls -al
# = 6. System variables (usually uppercase)
echo $BASH
echo $BASH_VERSION
echo $HOME
echo $PWDB. Linux file system
Tree file system: path starts with a forward slash /
- Absolute:
/starts/from/the/root - Relative: to the working directory:
.(current)~(home) or cd empty..(one level up),../..(two levels up)-(previous directory and print)~-(previous directory)
Tip: To know where I am: pwdor readlink -f
C. Command Syntax
command [options] [file or arguments]
- prompt: [user]$
- STDIN: Standard input or 0
- STDOUT: Standard output or 1 (file
>) - STDERR: Standard error or 2 (command &> logfile, 2> for errors)
- Basic commands: ls, cd, cp, mv and mkdir -p
- Be careful with spaces, spelling and case; use always auto-completion
TABkey - More information:
man command,--help,whatis - Example:
htlvlab\$ cat -1 xd.txt - For intermediate file creation:
program1 input.txt | tee intermediate-file.txt | program2 > results.txt
# ls [option] [file]
ls -a #hidden
ls -t #time, -d details, -h size
ls -l -a #characteristics
ls -tlh
ls -l | grep -Ev '^d' #showing only files
ls -l | grep -E '^d' # only folders
# it is possible to combine
ls -la
# use tree to see all the structure
tree -aPractice
# Try the following
date
echo hello world! # > or >> to add or append
pwd
cd
man ls
cat # show the content and concat
# remove - BE CAREFUL!
rm -rf folder
rmdir #for directories (empty)Downloading
wget http://address
D. Wildcards
# * - all possible characters
ls *.xlsx
# . and ? one possible character
ls ?asta
# Wildcards
ls [a-z]*.fasta
ls {abc,bcd}.fastaPractice
Usual project organization: {bin,data,results,manuscripts}
# Create a whole folder structure, avoid spaces
mkdir -p project_htlv/{raw_data,metadata,curated_data,output}
mkdir results-$(date +%F)Wilcard list
| [:alpha:] | any upper/lowercase letter or [a-zA-Z] |
| [:digit:] | any digits or [0-9]. |
| [:alnum:] | any alphabet and digit or [a-zA-Z0-9] |
| [:upper:] | uppercase or [A-Z] |
| [:lower:] | lowercase latter or [a-z] |
| [:blank:] | space and tab |
| [:print:] | printable characters |
| [:cntrl:] | non-printable characters |
| [:space:] | white-space |
| [:xdigit:] | hexadecimal digits |
| [:ascii:] | ASCII characters |
| [:word:] | alphanumeric characters including underscore(_) |
‘^It’ #starts the line with It ‘[0-9]+’ #line with numbers [!a-d]* # negates [^A-Z] #match anything not in the set [‘[:punct:]’] #comma or . [a|b|c] a or b or c (cat|dog) cat or dog
Visualization
View or merge huge files cat, zcat, also: more, less, head and tail (use q to exit)
head -n 5 file.txt
head -5 file.txt# Example
x="Hola amiguitos!"
echo $x > hello.txt
cat hello.txt
## Excercise, using cat concat a list of fasta filesE. System
Check jobs running
ps aux
#dynamic realtime view
top
#kill a process
kill number
kill -9 number
kill -9 %<num>
#disk space
df -hClusters
#For clusters
myquota
#memory
free -h
#who is logged
w
who -a
whoami
#uptime
uptime
#network
ifconfig
#User and passwd
adduser xxxid
adduser xxxide -g bioinf
Copy from the server
rsync
scpF. Scripting
All programs exit with an exit-status $? (0: Successfully, nonzero: error)
.sh extension for bash executable files.
echo """this is my first text,
If I write a haiku,
use triple marks""" > xd.txt; echo $?
echo "Hola, I speak spanish"
cat -1 xd.txtTips: Always avoid errors
- Check that files exist.
- Any error may cause your script to abort 😰.
- Variable names should not star with a number.
- Be careful with strings and spaces!
Shebang - a header to call bash
- First line - our bash path
#!/bin/bashAll the BASH options! (Sanity check)
set -e: terminate the script if any command existed with a nonzero exit status.set -u: not to run any command containing a reference to an unset variable.set -o pipefail: if the last program terminates with a nonzero status the pipe, it wont stop.
echo "rm $VAR1/NOTSET*.*"`My first script
- First, create a file:
touch myscript.sh- Add a shebang:
#!/bin/bash
echo "I am writing this non-sense script for you - DEV" - Then, make it executable:
chmod +x myscript.sh #permissions (x) for the owner (u)
chmod u+x myscript.sh- Finally: Execute it!
bash my script.sh# set variables
dirpath='PATH'
fasta='file'
echo ${dirpath} #use always ${}
variables=$(pwd) #without spaces for commands
echo "hello ${NAME1}"
((NUM++)) #double parantheses for mathematical expressionsNano
Control + o = save Control + x = exit
2. Data management
A. File management
# File Size
du -sh file
du -ah file# Copy
cp filea.txt directory/filea.txt
# Move
mv filea.txt directory/filea.txtCompressed files
Check compressed files
zcat file.gz | less
# Check if is valid
gzip -t file.gz
# Compress multiple files, to keep the original -k
gzip *.fa
# Decompress
gunzip file.gz
gunzip SRR*_1R2.fastq.gz
# Decompress and keep the original
gunzip -c file.gz > fileTar
tar zxvf file.tar.gz
#z unzip
#x exctract
#v verbosely
#f the following is a filename
# c compress
# t view list
tar -tvf # viewing without extraction
#create a tar
tar zcvf file.tar.gz file1 file2 file3 folder1Zip files
zip file.zip
unzip file.ziptar.bz2
tar -jxvf file.tar.bz2
#to create
tar -cvjSf file.tar.bz2 file1 file2 folder1 folder2File integrity
Reason: files may be corrupted after downloading.
#integrity for data transfer checksum
md5sum *fa > MD5_checksum
#to check
md5sum -c MD5_checksum
#to compare then with the original file
diff -u file1 file2 | catB. File manipulation
cut, sort, uniq, wc, grep, paste
sort -n lenghts.txt #ordena de forma numerica
head -n lenghts.txt
wc *.fastq | sort -n | head -1grep: Global regular expression print
Search inside the file for a string or pattern.
grep "Hola" file.txt | wc -l
grep -w #exact word
grep -c # count number of lines
grep -n #line numbers
grep -v # do not match
grep -E #no need to escape special characters, -Ev invert option
grep --color
grep -l #files
grep -A 1 #prints one line after
grep -o '[A-Z]*' # get rid of numbers
grep -i "a line" | cut -f 1-3,5 # case insensitive
paste - - - - file |cut -f 1-2 | head -2 # merge the files horizontally with 4 columns
grep -E ab+c # one or more of the preceding character abbbbbbcd
grep -E ab*cd # zero or more of the preceding character
grep -E 'ab{1,5}cd' #number of repetitions of the preceding character set
'\<the\>' # word boundaries
grep -E '([0-9]+).*\1'
'([0-9]+)_([0-9]+)_\1_\2' #repeated twice
grep "[2|4]" #lines 2 and 4
### Practice
#fastq convert to fasta
paste - - - - <data/example.fastq |cut -f 1-2 | sed 's/^@/>/' | tr "\t" "\n" | head -4
## More advance: from paired to single end
paste < (paste - - - - < reads-1.fastq) \
< (paste - - - - < reads-2.fastq) | \
tr '\t' '\n' > reads-int.fastq
paste - - - - - - - - < reads-int.fastq \
| tee > (cut -f 1-4 | tr '\t' '\n' > reads-1.fastq) \
| cut -f 5-8 | tr '\t' '\n' > reads-2.fastq
wc -l #lines
wc -w #words
wc -c #characters# Translate
echo "santhilal" | tr a-z A-Zfastafile=xxxx.fa
grep -v "^>" $fastafile | grep --color -i "[^ATCG]"
echo "There are $(grep -c '^>' $fastafile) entries in my FASTA file."
# Examples, grep -w "...." = 4 characters
cat H0.list | grep "^USP*" # Here '^' means 'line starts with'
cat H0.list | grep "USP*"
cat H0.list | grep "\.[0-9]$" # $the ending part
# real life examples
cat hg38_gene_annotation_mart_export.txt | grep -v "^#" | cut -f1 | sort | uniq | wc -l # This counts total number of unique genes in the annotation
cat hg38_gene_annotation_mart_export.txt | grep -v "^#" | cut -f7 | sort | uniq | wc -l # This counts total number of unique transcripts
cat hg38_gene_annotation_mart_export.txt | grep -v "^#" | grep -w "protein_coding" | cut -f1 | sort | uniq | wc -l # Counts only total number of protein coding genes
# Zcat to open .gz files
zcat Homo_sapiens.GRCh38.87_badStructure.table.gz | sed 's/gene_id "//' | sed 's/gene_name "//' | sed 's/gene_biotype "//' | sed 's/"//g' | sed '1i\Geneid\tGeneSymbol\tChromosome\tClass\tStrand\tLength' | sed 's/ //g' | head
# convert DNA to RNA
cat file.txt | tr T U
#tr -d 'AT' (delete)
tr '\t' ','#tsv to csv
tr a-z A-Z
tr ACTG TGAC
# convert reverse complement
grep -o '[ACTG]*' junk.txt | tr ACGT TGCA | rev
grep '[CG]' -o file | wc -l #count only CG
#find unique features
zcat $DATA_DIR/$GTF | grep -v '^#' | cut -f 3 | sort | uniq
#counts
zcat $DATA_DIR/$GTF | grep -v '^#' | cut -f 3 | sort | uniq -c
#first 3 exons
zcat $DATA_DIR/$GTF | grep -v '^#' | awk '$3 == "exon"' | head -3
#attributes
zcat $DATA_DIR/$GTF | grep -v '^#' | awk '$3 == "gene"' | cut -f 9 > genes.txt
find
# Find any files with "Linux" and ".Rmd" in the file names, -iname, -maxdepth 1, -ctime +1, -mtime -1, -mmin -15
find . -type f -name "*Linux*.Rmd"
# Count the number of files
find . -type f | wc -l
find . -type f -size +10M
find . -maxdepth 1 -type f -size +1M
# Do something with the file or folder (d)
find . -type f -name "*fa*.pl" -exec ls -l {} +cut
cut: Choose a column from a tab-delimited file (spreadsheet) cut -f (column number to choose) file.txt
cut extracts columns separated by delimiters (-d)
cut -d " " -f 2 file.txt #column 2 or 1,3,5 or 1-5uniq
Recognize two or more adjacent lines are the same.
- uniq -c para contar
- uniq -w N solo usa los xxx caracteres iniciales para evaluar si son iguales
sort haiku.txt | uniq -c
history # historial
# Example first all step by step
ls ~ > file_list.txt
wc -l file_list.txt
rm file_list.txt
# piping the data from one command to another
ls ~ | wc -lPiping
&&: Continue only if the first is completed successfully.||: Continue only if the first is completed unsuccessfully.
continue to next line or pipe |
# Redirecting the output to print
tail -f
true && echo "works"
## Example
for file in ${@}
do
wc ${file}
done
for file in ../data/${@}
do
output=$(basename ${file} .fastq)-wordcount.txt
wc ${file} >> ../data/wordcount.txt
done
mkdir dataseq
while read
fasterq-dump -p --split-files $inputsra -O ./dataseq3. Virtual Environment for Python
venv allow us to manage every project independently (Avoiding version conflicts). Basically, two options:
- Conda
- Virtualenv or venv
Installation: (only the first time.)
- venv: built-in tool
- virtualenv: more advanced and efficient features.
A. Create a virtual environment
# Option A
# a ".venv" environment has been created (WOS)
!python -m venv .venv
# Option B
!virtualenv .venv
# Option C
# Control + Shift + P, venv
# create environment, select interpreter
## To delete (just remove it after deactivate and select another interpreter)
!rmdir /s /q .venvB. Activate venv
# Windows
# Select interpreter
# in the terminal .venv\Scripts\activate.ps1
## Mac
# source env_name/bin/activate
## deactivate
## deselect the interpreterC. Install modules for Python
Now every module will be downloaded and installed separately from the global environment (an specific version can also be specified), avoid using ! or % at this step.
!pip install ipykernel pandas jupyter pyyaml4. Variables
- We use a dollar sign (
$) to access a variable.
results_dir="results/"
echo $results_dir
# = To be more specific with beginning and end
echo ${results_dir}
# = Make it more robust with ""
echo "${results_dir}_abc/"A. Variables and arguments
An editor: Yes, we can use any (nano, vim, emacs, etc).
echo '
#!/bin/bash
echo "script name: $0"
echo "first arg: $1"
echo "second arg: $2"
echo "third arg: $3" ' > args.sh
bash args.sh arg1 arg2 arg3$0= script name.$1= argument 1.$#= number of arguments.
echo '#!/bin/bash
if [ "$#" -lt 3 ]
then
echo "error: too few arguments, you provided $#, 3 required"
echo "usage: script.sh arg1 arg2 arg3"
exit 1
fi
echo "script name: $0"
echo "first arg: $1"
echo "second arg: $2"
echo "third arg: $3" ' > args.shPractice
bash args.sh arg1 arg2
#Passing arguments
echo $1 $2 $3
#script name
echo $0
# = another form to pass, here 0 is the first argument
args=("$@")
echo ${args[1]} ${args[2]} ${args[3]}
#the same result
echo $@
#number of arguments
echo $# Practice
## using read
echo "Enter name:"
read name
echo "Entered name: $name"
## if multiple variables
echo "Enter names:"
read name1 name2 name3
echo "Names: $name1, $name2, $name3"
## to use the same line
read -p "username:" user_var
echo "username: $user_var"
#to use a hidden or silent -sp
read -p "username:" user_var
read -sp "password:" pass_var
echo
echo "username: $user_var"
echo "password: $pass_var"Practice
This is how arrays work in bash:
#run the inputs in a array read -a
read -a names
echo "names: ${names[0]}, ${names[1]}
#if after read is empty
read
echo "Name: $REPLY"5. Conditionals
0 means success,anything else is false or failure. No need any overhead to call.
if [command] then
[if-statement] #if it is true or 0
elif
then
else
else
[else-statement] #optional
fiPractice
- If pipelines is used with
set -o, any nonzero exit will skip the statement to the next block.
echo `#!/bin/bash
if grep "pattern" some_file.txt > /dev/null
then
# commands to run if "pattern" is found
echo "found 'pattern' in 'some_file.txt"
fi`
# We can also negate our program’s exit status with !:
if ! grep "pattern" some_file.txt > /dev/null
then echo "did not find 'pattern' in 'some_file.txt"
fi6. Expressions
- An expression can be: String comparison, Numeric comparison, File operators and Logical operators and it is represented by
[expression]:
A. String Comparisons:
=compare if two strings are equal;if [ "$a" = "$b" ]!=compare if two strings are not equal;if [ "$a" != "$b" ]-nevaluate if string length is > 0-zevaluate if string length is = 0==is equal toif [ "$a" == "$b" ]
B. Expressions for strings:
<is less than, in ASCII alphabetical orderif [[ "$a" < "$b" ]]>is greater than, in ASCII alphabetical orderif [[ "$a" > "$b" ]]
Examples:
[ s1 = s2 ] # (true if s1 same as s2, else false)
[ s1 != s2 ] # (true if s1 not same as s2, else false)
[ s1 ] # (true if s1 is not empty, else false)
[ -n s1 ] # (true if s1 has a length greater then 0, else false)
[ -z s2 ] # (true if s2 has a length of 0, otherwise false)C. Expressions for numbers
-eqcompare if two numbers are equal-gecompare if one number is >= to a number,(("$a" >= "$b"))-lecompare if one number is <= to a number,(("$a" <= "$b"))-necompare if two numbers are not equal-gtcompare if one number is > another number,(("$a" > "$b"))-ltcompare if one number is < another number,(("$a" < "$b"))
Practice
#Examples:
[ n1 -eq n2 ] #(true if n1 same as n2, else false)
[ n1 -ge n2 ] #(true if n1greater then or equal to n2, else false)
[ n1 -le n2 ] #(true if n1 less then or equal to n2, else false)
[ n1 -ne n2 ] #(true if n1 is not same as n2, else false)
[ n1 -gt n2 ] #(true if n1 greater then n2, else false)
[ n1 -lt n2 ] #(true if n1 less then n2, else false)echo -e "Enter the name of the file: \c"
read file_name
if [-e $file_name] #exists?
then
echo "$file_name found"
else
echo "$file_name not found"
fiD. Practice
echo -e "Enter the name of the file: \c"
read file_name
if [ -f $file_name ]
then
if [ -w $file_name]
then
echo "Type some text data, to quite press control + d"
else
fi
else
echo "$file_name dont exist"
fi7. Basic operations in Bash
Operations: + - * / % (remainer after division) After expr is necessary *
num1=20
num2=
echo $((num1+ num2))
echo $(expr $num1 + $num2)
echo $(($num1 + $num2))
# working with decimals, (awk)
echo "1.5+2.3" | bc
echo "scale=2; 20.5/5" |bc
echo "scale=2; sqrt($num)" | bc -l #library with more mathematical formulasA. Basic operations
- Tips: for floating numbers ->
bc(basic calculator)
test command
- Exits with either 0 or 1, supports standard comparison operators (independenly)
test "ATG" = "ATG" ; echo "$?"
test "ATG" = "atg" ; echo "$?"
test 3 -lt 1; echo "$?"
test 3 -le 3; echo "$?"Logical operators
-ddir;-ba special file;-ccharacter file-sempty;-ffile;-efile exists;-hlink-rreadable?;-wwritable?;-xexecutable?
test -d some_directory ; echo $? # is this a directory?
test -f some_file.txt ; echo $? # is this a file?
test -r some_file.txt ; echo $? # is this file readable?
#test+if
if test -f some_file.txt
then [...]
fi
if [-f some_file.txt]
then [...]
fiOperators
-aAND;-oOR;!negation;()group
age=25
if [ "$age" -gt 18 ] && [ "$age -lt 30" ]
then
echo "valid age"
else
echo "age not valid"
fi
### Alternatives
## [ "$age" -gt 18 -a "$age -lt 30" ]
## [[ "$age" -gt 18 && "$age -lt 30" ]]Practice
- Tip: Provide an space after and before
#!/bin/bash
set -e
set -u
set -o pipefail
if [ "$#" -ne 1 -o ! -r "$1" ]
then echo "usage: script.sh file_in.txt"
exit 1
fiB. Case patterns
LANG=C #setting the language uppercase
echo -e "Enter some xxxx: \c"
read value
case $value in
pattern 1) #[a-z][A-Z][0-9]?(special), *unknown
statement ;; #echo "this is the $value"
pattern 2)
statement ;;
*)
statement ;;
...
esacMore case patterns
#!/usr/bin/bash
echo -e "Enter a character: \c"
read value
#environment variable to set the language
LANG=C
case $value in
[a-z] )
echo "lower case" ;;
[A-Z] )
echo "upper case" ;;
[0-9] )
echo "digit" ;;
? )
echo "special" ;;
* )
echo "unknown input" ;;
esac C. Arrays
A set of elements
#!/usr/bin/bash
os=("ubuntu", "windows", "kali")
os[3]="mac" #it will add mac to the array
unset os[2] #remove the third one
# gaps in the array are possible
echo "${os[@]}" #all elements printed
echo "${os[1]}" #to print windows (2)
echo "${!os[@]}" #prints the elements (0 1 2)
echo "${#os[@]}" #prints the length of the arrayD. Loops
It’s better to have a list of files Tip: avoid spaces, tabs, newlines, or special characters (*)
echo ${sample_files[@]}
echo ${sample_names[@]}
#text with 3 columns, and the third column is filenames
sample_files=($(cut -f 3 ../file.txt))
echo ${sample_files[@]}
#then use basename to remove extension
basename -s .fastq seqs/ABC001.fastqE. While loops
n=1
while [ $n -le 10 ]
#other option (( $n <= 10 >))
do
echo "$n"
n=$((n+1))
#other option (( ++n ))
done- Tip: force a pause
sleep 1 #pause for 1 second
gnome-terminal & #open a terminal
xterm & #open a terminalPractice
#!/bin/bash
set -e
set -u
set -o pipefail
# specify the input samples file, where the third column is the path to each sample FASTQ file
sample_info=samples.txt
# create a Bash array from the third column of $sample_info
sample_files=($(cut -f 3 "$sample_info"))
for fastq_file in ${sample_files[@]}
do
# strip .fastq from each file, and add suffix "-stats.txt" to create an output filename
results_file="$(basename $fastq_file .fastq)-stats.txt"
# run fastq_stat on a file, writing results to the filename we've # above
fastq_stat $fastq_file > stats/$results_file
donePractice
#!/bin/bash
set -e
set -u
set -o pipefail
# specify the input samples file, where the third column is the path to each sample FASTQ file
sample_info=samples.txt
# our reference
reference=zmays_AGPv3.20.fa
# create a Bash array from the first column, which are
# sample names. Because there are duplicate sample names
# (one for each read pair), we call uniq
sample_names=($(cut -f 1 "$sample_info" | uniq))
for sample in ${sample_names[@]} do
# create an output file from the sample name
results_file="${sample}.sam"
bwa mem $reference ${sample}_R1.fastq ${sample}_R2.fastq > $results_file
doneF. While read
while read p
do
echo $p
done < file.sh
#another option
cat file.sh | while read p
do
echo $p
done < file.sh
#another option files with special characters
while IFS= read -r p
do
echo $p
doneG. Until for loop
#If the condition is false
n=1
until [ $n -ge 10 ]
#also possible with until (( $n > 10 ))
do
echo $n
n=$(( n+1 ))
done
## For loop
for variable in 1 2 3 4 5 #list of values
for variable in {1..10..2} #{start..end..increment}
for variable in file1 file 2 file 4
for OUTPUT in $(Linux-command-here)
for (( EXP1; EXP2; EXP3 ))
for (( i=0; i<5; i++ ))
for command in ls pwd dateUntil for loop
for item in * #all inside the folder
do
if [ -d $item ]
if [ $i -gt 5 ]
then
break #get out of the loop
continue #
then
echo $item
# Another example
#!/bin/bash
for (( i=1 ; i<=10 ; + i++))
do
if [ $i -eq 3 -o $i -eq 6 ]H. Break and continue
to exit the loop prematurely
for (( i=1 ; i<=10 ; i++ ))
do
if [ $i -gt 5 ]
then
break #continue also possible (excepcions)
fi
echo "$i"
done8. Functions
function name(){
Commands
}
#OR
name () {
Commands
}
#Example
function Hello(){
echo "Hello"
}Practice
usage(){
echo "you need to provide an argument : "
echo "usage : $0 file_name"
}
is_file_exist() {
local file="$1"
[[ -f "$file" ]] && return 0 || return 1
}
[[ $# -eq 0 ]] && usage
if ( is_file_exist "$1")
then
echo "File found"
else
echo "File not found"
fiLocal variables
function printx(){
local name=$1 #it will use the variable only for this function
echo "the name is $name"
}Read-only
it cant be overwritten
var=31
readonly var
## for functions
readonly -f Signals and traps
except SIGKILL and SIGSTOP
trap "echo Exit signal is detected" SIGNIT # or 9
echo "pid is $$"
while (( COUNT < 10 ))
do
sleep 10
(( COUNT ++ ))
done
exit 0
###############
kill -9 pid
trap "echo Exit command is detected" 0 #if it received a 0 value
echo "Hello world"
exit 0
###example
trap "rm -f $file && echo file deleted; exit" 0 2 15Debugging a script
bash -x script.sh
###in the shebang
#!/bin/bash -x
set -x
from line 1 to 10
set +x
#### Globbing
#!/bin/bash set -e
set -u
set -o pipefail
for file in *.fastq
do
echo "$file: " $(bioawk -c fastx 'END {print NR}' $file)
doneFind
ls *.fq | process_fq #may contain errors
process_fq *.fq #it has some limits
##### Find - find path expressions
find / | head
find -maxdepth 1 . #only this directory
touch path/absolute/xbc{A,B,C}_R{1,2}.fastq
find path
##### Find expressions
find path/to/folder -name "filex*fastq" -type f
# d for directories
# f for files
find path/to/folder -name "filex*fastq" -or -name "otrox*fastq" -type f
#another alternative
find path/to/folder -type f "!" -name "filex*fastq" -and "!" -name "*temp"
## to remove this files
find path/to/folder -name "*-temp.fastq" -exec rm {} \;XARGS
xargs reads data from stdin (input) and executes the command from the argument one or more times based on input if no command -> echo programs that take multiple arguments (rm, touch, mkdir)
find path/to/folder -name "*-temp.fastq" | xargs -n 1 rm
#other example
find zmays-snps/data/seqs -name "*-temp.fastq" > files-to-delete.txt
cat files-to-delete.txt
cat files-to-delete.txt | xargs rm
######
find . -name "*.fastq" | xargs basename -s ".fastq" | xargs -I{} fastq_stat --in {}.fastq --out ../summaries/{}.txt
## Notice that the BSD xargs will only replace up to five instances of the string specified
## by -I by default, unless more are set with -R.XARGS
We can launch multiple background processes with Bash for loops by adding the ampersand (&) at the end of the command (e.g., for filename in *.fastq; do program “$filename” & done) xargs to run simultaneously with the -P option
find . -name "*.fastq" | xargs basename -s ".fastq" | xargs -P 6 -I{} fastq_stat --in {}.fastq --out ../summaries/{}.txt
## Notice that the BSD xargs will only replace up to five instances of the string specified
## by -I by default, unless more are set with -R.GNU parallel
find . -name "*.fastq" | parallel --max-procs=6 'program {/.} > {/.}-out.txt'Make a list of files
ls -1 | sed 's/\.fastq\.gz//g' > ../PRJEB41550.txt9. Bioinformatics software instalation
bwa-mem2 and fastp >Bwa-mem2
#sudo root - be extremely carefull, control + D to exit
sudo su
sudo apt-get update
# answer yes
sudo apt-get upgrade -y
sudo apt-get dist-upgrade -y
sudo apt-get autoremove -y
sudo apt-get autoclean -y
sudo apt-get install unzip## Samtools
sudo apt-get upgrade
sudo apt-get update
sudo apt-get install -y libncurses-dev libbz2-dev liblzma-dev
sudo apt-get install libssl-dev
#### Option 2
sudo apt-get install libcurl4-openssl-dev
sudo apt-get install zlib1g-dev#Steps: Install
sudo ...
export PATH=$PATH:/home/dev105/devapps/STAR/sourceInstallation
#Download samtools
wget samtools bcftools htslib #LINK
tar xvjf samtoolsxxxx
cd samtoolsxxx
./configure
make
sudo make install##Export to the path
export PATH="$PATH:/home/labatl/devapps/bwa-mem2"
export PATH="$PATH:/home/labatl/devapps/htslib/htslib-1.20"
export PATH="$PATH:/home/labatl/devapps/htslib/bcftools-1.20"
export PATH="$PATH:/home/labatl/devapps/htslib/samtools-1.20"https://github.com/bwa-mem2/bwa-mem2
export PATH="$PATH:/home/labatl/devapps"
#Add it to
nano ~/.bashrc
## Source it
source ~/.bashrc
#Another option
nano ~/.bashrc
##Write at the end:
export PATH=$PATH:”filepath”SED
Allows to filter/replace efficiently text in a pipeline
sed 's/hello/world/' input.txt > output.txt cat input.txt | sed 's/hello/world/' - > output.txt
-i: overwrite the same file-n: supress output and specify the line withp:sed -n '45p' file.txtosed -n '1p ; $p' one.txt two.txt three.txt-s: reverse-eor-f: specify non-option parameter:sed --expression='s/hello/world/' input.txt > output.txtorsed --file=myscript.sed input.txt > output.txt-z: a set of lines
sed -n 1,2p Homo_sapiens.GRCh38.87_annotation.table
sed -n "1~2p" file.txt #-n not print, extract only odd numbered lines
zcat $DATA_DIR/$FASTQ | sed -n '2~4p;' | head | tr -d '\n' #extract every 4th line
sed -n '2000,2005p' #number of lines
sed "s/[0-9][0-9]*\.[[:space:]]//g" file.txt # g globalManual[https://www.gnu.org/software/sed/manual/sed.html] (Extended REGEX)[http://austingroupbugs.net/view.php?id=528]
Loops
foris use for iterating over a fixed number of items (may be unknown at time of coding).whileis use for iterating until a certain condition is met
#### Loops
##### Examples of for
for file in ../data/${@}
do
wc ${file} >> ../data/wordcount.txt
done
for file in ../data/${@}
do
output=$(basename ${file} .fastq)-wordcount.txt
wc ${file} >> ../results/${output}
done
for loops
for i in $( ls ); do echo item: $i
done
##### Examples of while
while true; do echo ‘hello’ done
#### Conditionals
if [ "foo" = "foo" ]; then
else
fi
##### Iterating with xargs -I one by one -0 avoid blanks
ls *.fasta | xargs -I% sh -c 'head %; echo "\n---\n"'
#sh -c "to call shell instead of bash"
#https://www.baeldung.com/linux/xargs-multiple-arguments
my_variable="ggplot2"
echo "My favorite R package is ${my_variable}"
my_var="chr1"
echo "${my_var}_1.vcf.gz"
#script
#STDIN -> STDOUT+STDERR
I/O redirections: |,<,>
ps -aef #show a list of all processes running
bash sam_run my_file.bam out_stats.txt
#!/bin/bash/
samtools stats $1 > $2
#!/bin/bash/
samtools stats $2 > $1 #output primero
#to hack SIGPIPE error in Jupyter
cleanup () {
:
}
trap "cleanup" SIGPIPE
#safety
set -u #unset variables as an error and exit
grep -v "^>" ../data/tb1.fasta | grep --color -i "[^ATCG]"
CCCCAAAGACGGACCAATCCAGCAGCTTCTACTGCTAYCCATGCTCCCCTCCCTTCGCCGCCGCCGACGC
for file in ../data/${@}
do
output=$(basename ${file} .fastq)-wordcount.txt
wc ${file} >> ../data/wordcount.txt
done
mkdir dataseq
while read
fasterq-dump -p --split-files $inputsra -O ./dataseq
echo "Start!"
for sra in (txt)
do
echo "${p}"
while [ $(jobs | wc -l) -ge "$maxproc" ]
do
sleep 1
done
base=${fname%_R1*}
echo starting new job with ongoing=$(jobs | wc -l)
bwa-0.7.5a/bwa mem -t 4 human_g1k_v37.fasta "${base}_R1.fastq.gz" "${base}_R2.fastq.gz" >"$destdir/${base}_R1_R2.sam" &
done
#if to check for file existence
if [-f hello.txt]; then
cat hello.txt
else
echo "no such file"
fi
if [! -f "data/file"]; then
wget url -O data/file
fi
#test for conditional evaluation
test -f file o [-f file]
#Loops
for i in SRRR*;
do
echo ${i} #inserted in a line with other characters
done
for file in copy-SRR534005*;
echo $file ${file//SRR534005/chicken};
do mv $file ${file//SRR534005/chicken};
done
for FILE in $(ls *ipynb); do
echo $FILE
done
for ((i=0; i<5; i++)); do
echo $i
done
test -a = &&
test -o = ||
=~ to match regular expressions patterns
for FILE in $(ls); do
if [[ "${FILE}" =~ ^Bash.*sh ]]; then
echo $FILE
fi
done
COUNTER=10
while [ $COUNTER -gt 0 ]; do
echo $COUNTER
COUNTER=$(($COUNTER - 1))
done
-lt #less than
-gt #greater than
== #equaity
!= #inequality
#braces in loops
for NUM in {000..005}; do
echo mkdir EXPT-${NUM}
done
echo foo.{c,cpp,h,hp}
#loop first 10 records
LINE_NUM=0
while read LINE
do
if [[ $LINE_NUM -lt 10 ]]; then
echo $LINE
(( LINE_NUM++ ))
fi
done < genes.txt
#command expressions $(command), #basename without path, dirname
#remove the file extension Y=${x%.*}
for FILE in $(ls ~/docs/*)
do
DIR_NAME=$(dirname $FILE)
FILE_NAME=$(basename $FILE)
NAME=${FILE_NAME%.*}
NEW_DIR=$DIR_NAME/$NAME
NEW_FILE=${NAME}-copy.txt
mkdir -p $NEW_DIR
cat $FILE | head -3 | tail -2 > $NEW_DIR/$NEW_FILE
done
for FILE in $(ls $DATA_DIR/*_MA_J_S20_*gz)
do
FIRST=$(zcat $FILE 2>/dev/null | head -1)
echo $(basename $FILE)
echo $FIRST
echo
done
#script
#!/path/to/shell or bash path/file
#!/bin/bash or which bash
chmod +x /path/to/script
#!/bin/bash
# "#" gives you the number from the command line
echo $#
for ARG in {0..$#}; do
echo $ARG
done
EOF
bash echo.sh
for file in *.fastq
do
echo ${file}
done
#script with input specifications
bash wordcount.sh *R1.fastq
for file in ${@}
do
wc ${file}
done
#structure
bin, data, results, manuscript
cat > count_lines.sh << 'EOF'
#!/bin/bash
for FILE in $(ls)
do
wc -l "${FILE}"
done
EOFAWK
#tutorial
cat file.txt | awk -F '\t' '$1=="Mt"' | head -3
cat file.txt | awk -F '\t' '$1=="2"' {print $3} | sort | uniq -c
cat docs/echo.txt | awk 'BEGIN {FS = ","} ; $5 == "M"' > docs/male.txt
Additional sources
Seq
seq function generates a range of numbers seq 3 seq 2 5 seq 5 2 9
BASH tutorial - Notes
https://xie186.github.io/Novice2Expert4Bioinformatics/r-introduction.html https://gist.github.com/nathanhaigh/3521724
More information https://github.com/raynamharris/Shell_Intro_for_Transcriptomics/tree/master https://practicalcomputing.org/ https://explainshell.com/ https://data-skills.github.io/unix-and-bash/01-unix-git-intro/index.html
[https://wikis.utexas.edu/display/bioiteam/Scott’s+list+of+linux+one-liners] [ https://www.tutorialspoint.com/using-gzip-and-gunzip-in-linux ] https://www.thegeekstuff.com/2012/12/linux-tr-command/ https://www.gnu.org/software/gawk/manual/html_node/Comparison-Operators.html
https://genome.ucsc.edu/FAQ/FAQformat#format1
https://gist.github.com/DannyArends/04d87f5590090dfe0dc6b42e5e1bbe15 RNA Sequencing - Setup and Prerequisites https://www.youtube.com/watch?v=PlqDQBl22DI&t=14s
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE19080
My index for genome star alignment https://labshare.cshl.edu/shares/gingeraslab/www-data/dobin/STAR/STARgenomes/Human/ https://data.broadinstitute.org/Trinity/ https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/
https://ubuntu.com/tutorials/command-line-for-beginners#4-creating-folders-and-files
http://linuxcommand.org/tlcl.php
https://xie186.github.io/Novice2Expert4Bioinformatics/install-bioinformatics-software-in-linux.html https://hackr.io/blog/basic-linux-commands https://youtu.be/7-G_dQr7B44 Install MUSCLE using binaries: https://youtu.be/UJx34KVLIgI Install PhyML using binaries: https://github.com/vappiah/bioinfoscr… Gene extractor script is available here https://itol.embl.de/ iTOL browser https://youtu.be/2tMQYi_12IQ Python codes explained Installing muscle https://www.drive5.com/muscle/downloads.htm
https://jlsteenwyk.com/ClipKIT/advanced/index.html https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3001007
From https://cyverse-leptin-rna-seq-lesson-dev.readthedocs-hosted.com/en/latest/section-3.html
From https://cyverse-leptin-rna-seq-lesson-dev.readthedocs-hosted.com/en/latest/section-3.html
https://cyverse-leptin-rna-seq-lesson-dev.readthedocs-hosted.com/en/latest/section-3.html https://www.freecodecamp.org/espanol/news/grep-command-tutorial-how-to-search-for-a-file-in-linux-and-unix-with-recursive-find/ https://tldp.org/LDP/Bash-Beginners-Guide/html/sect_09_02.html https://datacarpentry.org/wrangling-genomics/ https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-environments.html#creating-an-environment-with-commands https://sanbomics.com/2022/01/08/complete-rnaseq-alignment-guide-from-fastq-to-count-table/ https://cyverse-leptin-rna-seq-lesson-dev.readthedocs-hosted.com/en/latest/section-8.html
https://gist.github.com/darencard/e1933e544c9c96234e86d8cbccc709e0
https://www.thegeekstuff.com/2010/03/awk-arrays-explained-with-5-practical-examples/ https://notebook.community/crazyhottommy/scripts-general-use/Shell/Awk_anotates_vcf_with_bed
(another blog)[https://biowize.wordpress.com/2015/03/26/the-fastest-darn-fastq-decoupling-procedure-i-ever-done-seen/]
Sources for scripts https://jeroenjanssens.com/dsatcl/ https://missing.csail.mit.edu/2020/course-shell/ https://swcarpentry.github.io/shell-novice/04-pipefilter/index.html https://datascienceatthecommandline.com/2e/chapter-2-getting-started.html?q=stdin#combining-command-line-tools
wget http://kandurilab.org/users/santhilal/courses/UNIX/materials/course_file_repo.zip
Onedrive wget -qO - https://download.opensuse.org/repositories/home:/npreining:/debian-ubuntu-onedrive/xUbuntu_22.04/Release.key | gpg –dearmor | sudo tee /usr/share/keyrings/obs-onedrive.gpg > /dev/null
echo “deb [arch=$(dpkg –print-architecture) signed-by=/usr/share/keyrings/obs-onedrive.gpg] https://download.opensuse.org/repositories/home:/npreining:/debian-ubuntu-onedrive/xUbuntu_22.04/./” | sudo tee /etc/apt/sources.list.d/onedrive.list