We’ve got a useful tip that will help you get even more value out of QIAGEN CLC Microbial Genomics Module when performing OTU clustering. Get the latest version of the SILVA OTU database within the QIAGEN CLC Microbial Genomics Module with minimal effort outside of QIAGEN CLC Genomics Workbench, even before the latest version is released through the Microbial Genomics Module. The SILVA databases are updated more regularly than the corresponding QIIME versions, which the downloader currently relies on. To avoid waiting for QIIME updates, the newest SILVA database can be used with the Create Annotated Sequence List tool, with just a bit of reformatting required.
SILVA releases are available on the FTP server https://ftp.arb-silva.de/ where each release is stored in a separate folder. Here we focus on the latest release_138, more specifically the non-redundant database at 99% sequence similarity. If you are interested in another version, please consult the corresponding README file and change the surl and corresponding turl in the top of the script accordingly. To download the correct files and format it properly right away for import into the QIAGEN CLC Genomics Workbench, the following script may be used:
To run this script, you need a standard installation of python3. All you need to do is copy and paste the content above, modify the URL (if necessary), save it to a file and execute it on your system. For example, you may save the file as “get_silva.py”, then open a terminal and navigate to the folder where the script is located. Finally, execute it with:
Depending on your connection, this script will run for about 5 to 10 minutes. It downloads three files and performs actions on and with them:
- The most recent NCBI Taxonomy: taxdmp.zip. The script loads the taxids, parent ids, ranks and names of the taxonomy into memory.
- Taxonomy Mappings from SILVA: taxmap_embl-ebi_ena_ssu_ref_nr99_138.txt.gz. The script uses this file to get the mapping from the SILVA names to taxids in the NCBI taxonomy. Note that the SILVA database is updated biannually and the NCBI corresponding taxonomy is updated daily and thus there is not always a one-to-one correspondence between the final taxonomies and the original SILVA taxonomies.
- The SILVA rRNA database: SILVA_138_SSURef_NR99_tax_silva.fasta.gz. The script strips the provided taxonomies from this file, keeps the names and translates U to T.
For each of the taxids for the rRNAs, a 7-step lineage is constructed on the levels of the allowed ranks. The output of the script are two files in the folder where it is executed:
- SILVA_138_SSURef_NR99_tax_silva.fa.gz: Fasta file with the rRNA sequences and the sequence names in the header
- SILVA_138_SSURef_NR99_tax_silva.txt: A tab-separated file connecting the name of an rRNA sequence to its taxonomy in QIIME format
These two files can now be used in the Create Annotated Sequence List.
- Import the SILVA_138_SSURef_NR99_tax_silva.fa.gz file using a standard import, or drag and drop the file into the CLC Genomics Workbench
- Run the Create Annotated Sequence List on the resulting CLC file in the Workbench and click “Next”
- Select SILVA_138_SSURef_NR99_tax_silva.txt as taxonomy file
- Set the similarity percentage to 99% (if you have selected the NR99 version of SILVA, otherwise this should be adjusted)
- Click “Next” and in the “Select input file and map columns to attributes” under Parsing select Separator as “Tab”
- Click “Next” and “Finish”
Now you have version 138 of the SILVA database available for OTU clustering. Quick and easy, right?
For questions about this or other tips, tricks or functionalities related to QIAGEN CLC Microbial Genomics Module or QIGAGEN CLC Genomics Workbench, contact us at email@example.com.