When working with large datasets, only the files with code should be pushed to Github repositories, not the data itself. The raw data inputs or analysis output should either be kept in a local directory that is never committed, or for best practices, they should be stored in Box (download).

Once installed, you can navigate Box folders from Mac Finder or Windows Explorer. Files in Box Drive are also identified by a local path.

First, check that you have access to the imlab-data folder on Box. It should also appear in the Box Drive folder in Finder or Explorer, but you may need to first request access permission from Haky.

Next, create a directory in the data-Github folder, Box/imlab-data/data-Github, with the same name as the Github repository for the analysis. Copy all input files into the data-Github folder and create an output folder.

For example, I made a folder in data-Github, rat-genomic-analysis, and added the genotype data and model. Then I ran PrediXcan and output the results directly to /Users/sabrinami/Box/imlab-data/data-Github/rat-genomic-analysis/Results/PrediXcan

conda activate imlabtools

python3 /Users/sabrinami/Github/MetaXcan/software/Predict.py \
--model_db_path /Users/sabrinami/Box/imlab-data/data-Github/rat-genomic-analysis/sql/Vo_output_db.db  \
--text_genotypes /Users/sabrinami/Box/imlab-data/data-Github/rat-genomic-analysis/output/genotypes/chr1_20.txt.gz  \
--text_sample_ids /Users/sabrinami/Box/imlab-data/data-Github/rat-genomic-analysis/output/genotypes/samples.txt \
--prediction_output /Users/sabrinami/Box/imlab-data/data-Github/rat-genomic-analysis/Results/PrediXcan/Vo__predict.txt \
--prediction_summary_output /Users/sabrinami/Box/imlab-data/data-Github/rat-genomic-analysis/Results/PrediXcan/Vo__summary.txt \
--verbosity 1 \
--throw

here is a semi-automated way to generate and open the data folder in box

Copy pasted it to the beginning of your Rmd file

suppressMessages(library(tidyverse))
suppressMessages(library(glue))
PRE = "/Users/haekyungim/Library/CloudStorage/Box-Box/LargeFiles/imlab-data/data-Github/web-data"
##PRE="/Users/temi/Library/CloudStorage/Box-Box/imlab-data/data-Github/web-data"
## COPY THE DATE AND SLUG fields FROM THE HEADER
SLUG="correlation-between-ptrs-and-rat-height-bmi" ## copy the slug from the header
bDATE='2022-07-07' ## copy the date from the blog's header here
DATA = glue("{PRE}/{bDATE}-{SLUG}")
if(!file.exists(DATA)) system(glue::glue("mkdir {DATA}"))
WORK=DATA

## move data to DATA
#tempodata=("~/Downloads/tempo/gwas_catalog_v1.0.2-associations_e105_r2022-04-07.tsv")
#system(glue::glue("cp {tempodata} {DATA}/"))
system(glue("open {DATA}")) ## this will open the folder 

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The source code is licensed under MIT.

Suggest changes

If you find any mistakes (including typos) or want to suggest changes, please feel free to edit the source file of this page on Github and create a pull request.

Citation

For attribution, please cite this work as

Sabrina Mi (2020). Working with Large Files. ImLab Notes. /post/2020/11/23/working-with-large-files/

BibTeX citation

@misc{
  title = "Working with Large Files",
  author = "Sabrina Mi",
  year = "2020",
  journal = "ImLab Notes",
  note = "/post/2020/11/23/working-with-large-files/"
}