Hey Sebastian- It may help to consider other pieces aside from compute nodes that you will need, such as nodes for proxies and databases, networking gear (such as switches and cables), and so on. http://usegalaxy.org/production has some details, and there are high-level pieces explained at http://wiki.g2.bx.psu.edu/Events/GDC2010?action=AttachFile&do=get&target=GDC2010_building_scalable.pdf You should also talk to your institution's IT folks about power requirements, how those costs passed on, off-site backup storage (though it sounds like you're counting on RAID 5/6), etc. It also may help if folks could share their experiences with benchmarking their own systems along with the tools that they've been using. The Galaxy Czars conference call could help - you could bring this up at the next meeting. I've answered inline, but in general I think that the bottleneck for your planned architecture will be I/O with respect to disk. The next bottleneck may be with respect to the network - if you have a disk farm with a 1 Gbps (125 MBps) connection, then it doesn't matter if your disks can write 400+ MBps. (Nate also included this in his presentation.) You may want to consider Infiniband over Ethernet - I think the Galaxy Czars call would be really helpful in this respect.
1. Using the described bioinformatics software: where are the potential system bottlenecks? (connections between CPUs, RAM, HDDs)
One way to get a better idea is to start with existing resources, create a sample workflow or two, and measure performance. Again, the Galaxy czars call could be a good bet.
2. What is the expected relation of integer-based and floating point based calculations, which will be loading the CPU cores?
This also depends on the tools being used. This might be more relevant if your architecture were to use more specialized hardware (such as GPUs or FPGAs), but this should be a secondary concern.
3. Regarding the architectural differences (strengths, weaknesses): Would an AMD- or an Intel-System be more suitable?
I really can't answer which processor line is more suitable, but I think that having enough RAM per core is more important. Nate shows that main.g2.bx.psu.edu has 4 GB RAM per core.
4. How much I/O (read and write) can be expected at the memory controllers? Which tasks are most I/O intensive (regarding RAM and/or HDDs)?
Workflows currently write all output to disk and read all input from disk. This gets back to previous questions on benchmarking.
5. Roughly separated in mapping and clustering jobs: which amounts of main memory can be expected to be required by a single job (given e.g. Illumina exome data, 50x coverage)? As far as I know mapping should be around 4 GB, clustering much more (may reach high double digits).
Nate's presentation shows that main.g2.bx.psu.edu has 24 to 48 GB per 8 core reservation, and as before it shows that there is 4 GB per core.
6. HDD access (R/W) is mainly in bigger blocks instead of masses of short operations - correct?
Again, this all depends on the tool being used and could help with some benchmarks. This question sounds like it's mostly related to choosing the filesystem - is that right? If so, then you may want to consider a compressing file system such as ZFS or BtrFS. You may also want to consider filesystems like Ceph or Gluster (now Red Hat). I know that Ceph can run on top of XFS and BtrFS, but you should look into BtrFS's churn rate - it might still be evolving quickly. Again, a ping to the Galaxy Czars call may help on any and possibly all of these questions. Good luck! -Scott