Data Management
Users of the ALS are responsible for meeting their data management obligations to their home institutions and granting agencies. Except as noted below for data stored at NERSC, the ALS does not provide specific resources to manage data that are generated through user experiments. Because the ALS does not have a facility-wide data archiving service or staff to manage the data, the user must generally make arrangements to copy data to their own storage systems or move the data to their home institutions.
Staff from the ALS, ESNest, and Berkeley IT have developed protocols for managing data in real time, which can be found below.
The ALS facility provides infrastructure such as networks and computers at the beamlines located on the ALS experimental floor. These resources may differ at individual beamlines, and users should consult with the beamline staff to understand the capabilities at the beamline that they will be using.
Individual beamlines may have specific resources and data management practices to help users meet their data management needs and obligations. Users should consult beamline staff when formulating data management plans and strategies. The beamline resources do not substitute for the user’s responsibility for their data. Data storage on beamline equipment is only temporary and cannot be relied upon for archival purposes.
The ALS is participating in a data pilot program with the National Energy Research Scientific Computing Center (NERSC), where user data sets may be stored. If users have data at NERSC, then the data management strategy and policies of NERSC must be followed. Users should consult beamline staff to determine if that beamline is storing data at NERSC.
If you have questions or require assistance, please contact the beamline staff or the ALS User Services Group.
Data Transfer
Working at the ALS generates huge amounts of data, and for many years this has caused users to have to carry hard drives and USB drives between the ALS and their home institutions for acquisition and analysis of experimental data. To avoid the physical transport of data and to make real-time analysis possible, staff at the ALS, ESnet, and Berkeley Lab’s IT Division have collaborated to implement several best practices that allow the fast and secure transfer of data over the network to a users home institution. A case study performed by ESNet demonstrated improved workflow and data export for the x-ray tomography beamline.
Setting Up and Implementing Network Data Transfer
For researchers planning to use network data transfer, the following resources are available for assistance in setting up and implementing the workflow:
- To speak with a beamline scientist who has implemented the tools described below, contact Dula Parkinson.
- To obtain and use the best file transfer tools or equipment, contact hpcshelp@lbl.gov
- To connect your beamline to the Lab’s fast ScienceDMZ network, or to debug networking issues at LBNL, contact lblnet@lbl.gov
- To debug national network issues, or to find contact information for offsite campus or IT groups, contact engage@es.net
Achieving Faster Data Transfer
There are three main ways for users and system administrators to achieve faster data transfer:
1. Use the right file transfer tools
Instead of FTP or scp, use tools that have been designed specifically for high-speed data transfer. We recommend GridFTP or Globus Online. GridFTP is good if you want to automate transfers, but requires significant setup. Globus Online has a graphical user interface and is easy to use. Using a fast transfer tool is the simplest thing you can do to increase data transfer speeds. LBNL extensively uses both of these transfer tools and provides an overview from the 2014 LabTech workshop, with information on how to get additional help.
2. Use capable file transfer servers
Data can only be transferred as fast as it can be read from the source disk and written to the destination disk. Most systems aren’t tuned for high speed data transfer out of the box. Systems tuned for high speed data transfer are called Data Transfer Nodes (DTNs). Beamline 8.3.2 has recently implemented such a DTN based on the reference specification provided by ESnet, which, along with a new network designed by ESnet and LBLnet, has resulted in a more than 10-fold improvement in data transfer speeds.
3. Ensure that the end-to-end network isn’t the bottleneck
If you are using fast data transfer tools between two fast data transfer nodes, the final thing to ensure is that the end-to-end network is not impeding the transfer. This becomes even more important over long distances. The need to resend just a small amount of data can dramatically increase transfer times. Unfortunately, this can also be the most complicated area to understand and correct. There are three main areas to consider:
Use capable network switches
For big, long distance data transfers, packet loss is a significant problem. Network switches (sometimes called hubs) are a notorious cause of retransmitted data. This can happen when there are several network connections on one side of the switch that share a single connection on the other side. In this case it’s important to have switches with enough memory to store packets from one connection long enough to allow the packets from other connections to move through the switch. LBNL or home institution networking professionals can recommend good switches for your environment and scientific application.
Avoid firewalls
Firewalls are a common device used to secure networks. Because they generally look at every packet that flows through them, they can create bottlenecks for big science data transfers. There is a secure, alternate approach to using firewalls commonly referred as the ScienceDMZ. It works by establishing a fast, dedicated, but secure path around the firewall. You’ll generally need one at both facilities you are transferring data between. LBNL personnel can help you use the lab’s ScienceDMZ. ESnet personnel may also be able to provide some help implementing a ScienceDMZ at your home institution. See the help contacts above.
Use a “healthy” network path
It is extremely difficult to know which network path your data is taking between LBNL and your home institution and/or whether that path is “healthy.” This issue is best left to the networking professionals (see above) after ensuring that all of the critical items above are not the problem (good data transfer tools and nodes, good switches and no firewalls). While network debugging is beyond the scope of this brief article, one of the tools ESnet finds indispensable in network path analysis is perfSONAR.
Involve Your Local Experts!
If Network Data Transfer would significantly increase your productivity but you don’t run your data servers yourself, please get your system and network administrators involved in the process.