Surface Pro 8 - Speculation and Rumors

Discussion in 'Microsoft' started by dstrauss, Jul 9, 2020.

  1. daddyfish

    daddyfish Scribbler - Standard Member

    Messages:
    252
    Likes Received:
    159
    Trophy Points:
    56
    What Apple had done is impressive but what AMD does is impressive as well (especially considering their limited budget compared these 2 behemoths) and I feel many people tend to look at Apple's new CPU's with a bit too rose-tinted glasses. Eg take a look at this single core comparison (it should be noted that the i7 is on 28 W and iirc MacBook Pro cpu's have a similar watt usage):
    Cinebench R23 (Single-Core)
    • Intel Core i7-1165G7: 1504
    • Apple M1: 1498

    Now M1 takes the lead in multi core but so does AMD compared to Intel.​
     
  2. desertlap

    desertlap Scribbler - Standard Member Senior Member

    Messages:
    2,626
    Likes Received:
    3,295
    Trophy Points:
    181
    Ok, I'm going to climb on a bit of a soapbox this morning.:D

    I'll get to some particulars in a moment, but what Apple has done with the M1 in concert with Big Sur is truly impressive in multiple ways and may turn out to be a game changer in several key areas.

    As background, one of my core job functions is explaining to customers why this chip set and this OS is best suited to the device(s) you are looking to us to provide. eg. customer A may be best suited to a higher clocked dual core i3 running windows, while another may be best suited to an 8 core ARM chip running a purpose built Linux distro.

    So now a few generalities. First of all, all chips at their core level (pun intended) are RISC. In other words, ultimately they flip bits between 0 and 1.

    Generality number two, for a good part of microprocessors history, because of limitations of raw clock speed, as well as operating systems and apps, not to mention supporting hardware such as disk drives (think 5400 RPM spinners) and RAM (prior to DDR or low latency switching), by far the the most effective way to improve performance of a system was for a chip to incorporate optimized code on the chip to perform some of the more complex. but common task required by the OS or application. And the modern definition of CISC

    The epitome in my opinion, of that, is what Intel did with MMX. Offloading huge chunks of video encoding/decoding to the chip instead of relying on OS code to perform the task. It really is one of the shining examples of great CISC design, as my daughter learned in her processor design theory class.

    Generality number three; For most of the personal computers original ascendance; both Windows and the original mac operating systems, cranking the clock speed was the the most effective was to boost performance. And the previously mentioned CISC based instructions of course benefit too.

    But around the time of the ascendance of the 486 and IBM/Motorola Power PC, chip makers started hitting limits to just cranking the clock speed or for that matter in Intel's case, adding more CISC instructions to their processors.

    That was when you started to see serious efforts on the hardware side, to have additional cores and multi-threading and corresponding work in software to "divide the work of the app" up to utilize those additional cores and threads.

    So generality number 4. Writing efficient multicore/multithread apps or an OS is REALLY HARD. Likely due to the way human brains work. at least when it comes to creating sets of instructions (which is what apps essentially are) .

    It is only relatively recently, that development tools have incorporated in some cases, almost AI level smarts to then compile apps to take advantage of those cores . Android and IOS development environments for instance.

    Which brings me to generality number five. Mobile device oriented operating systems have the benefits of both being much more recent inventions and thus not tied to a lot of legacy design ideas and needs, and additionally have much bigger drivers of other requirements such as power efficiency.

    And thus right now, this is why you are seeing in my opinion the largest leaps in "innovation" and "real world" performance.

    So (this is getting quite long, apologies) To tie it back to the M1 and Big Sur, all of those learnings have culminated in what they released this month.

    To that point, with IOS and the A series chips, Apple for a large part of the initial development of the A series chips focused on single core performance. And that is one huge reason Apple has had such stellar benchmarks in some cases. Especially compared to comparable low power intel chips or some Qualcomm chips

    This was also an acknowledgement of the aforementioned difficulties developers have with creating effective multicore, multithreaded applications.

    However Apple already had the benefit of seeing what both intel and windows had gone through and thus focused on optimizing the OS to both leave the highest performance core(s) for the apps to use and provide the tools (libraries and frameworks) to aid in this task so that they can better utilize these same chips.

    Additionally Apple had significant constraints that factored in to this. For instance you obviously can't put a 45 watt TDP chip in an iPhone.

    Qualcomm/Android is on much the same journey, I'd argue that to date, for whatever reasons, Apple has just been better at it.

    So to try and wrap this up, the two remarkable things Apple has done with the M1 are in the performance per watt realm AND the amazing cross integration between OS and the chip it runs on.

    So this is long even for me and rivals @lovelaptops posts :eek::D, but there you go :)

    My Saturday morning soliloquy :p
     
    Last edited: Nov 21, 2020 at 11:05 AM
  3. desertlap

    desertlap Scribbler - Standard Member Senior Member

    Messages:
    2,626
    Likes Received:
    3,295
    Trophy Points:
    181
    BTW: What I somehow neglected to say which was the point of my voluminous post above, is that now more than ever, different factions are going to use various benchmarks to "prove" the validity of their specific agenda.

    I think this board is smart enough generally to see through that, and take the various results and map them to what they actually use a computer for.
     
    darkmagistric likes this.
  4. Steve S

    Steve S Pen Pro - Senior Member Super Moderator

    Messages:
    8,161
    Likes Received:
    3,659
    Trophy Points:
    331
    <<...Writing efficient multicore/multithread apps or an OS is REALLY HARD...It is only relatively recently, that development tools have incorporated...to then compile apps to take advantage of those cores...>>

    I would argue that this is the fundamental reason that computer system performance (where system means the hardware, the OS and the application) has been lagging in recent years. It isn't just that the OS isn't able to optimally allocate assembly level instructions to the cores, it's also that applications aren't coded optimally to be so allocated...
     
    sonichedgehog360 and desertlap like this.
  5. desertlap

    desertlap Scribbler - Standard Member Senior Member

    Messages:
    2,626
    Likes Received:
    3,295
    Trophy Points:
    181
    Yup. if we could have 500 GHZ single core processors, we likely wouldn't be having these discussions. Dang physics! :)

    EDIT: To your point, the fact that both Windows and some Linux distros even have a feature called core affinity speaks exactly to that.
     
    Last edited: Nov 21, 2020 at 11:39 AM
    sonichedgehog360 likes this.
  6. Marty

    Marty Pen Pro - Senior Member Senior Member

    Messages:
    3,417
    Likes Received:
    3,347
    Trophy Points:
    231
    Very nice overview, @desertlap. But we gotta go deeper than that for our Saturday morning readers, right? So let's dive into the guts of how Apple achieves this in the hardware, shall we? ;)

    First off, one particular finding by Anandtech struck me, regarding the memory architecture on the M1:

    (Anandtech)
    "Besides the additional cores on the part of the CPUs and GPU, one main performance factor of the M1 that differs from the A14 is the fact that’s it’s running on a 128-bit memory bus rather than the mobile 64-bit bus. Across 8x 16-bit memory channels and at LPDDR4X-4266-class memory, this means the M1 hits a peak of 68.25GB/s memory bandwidth...

    Inside of the M1, the results are ground-breaking: A single Firestorm achieves memory reads up to around 58GB/s, with memory writes coming in at 33-36GB/s. Most importantly, memory copies land in at 60 to 62GB/s depending if you’re using scalar or vector instructions. The fact that a single Firestorm core can almost saturate the memory controllers is astounding and something we’ve never seen in a design before."


    The M1 is built on a 4+4 core big-little design (4x 3.2GHz "Firestorm" performance cores and 4x 2.1GHz "Thunder" efficiency cores) much like the Kryo 495 reference design in Qualcomm's SQ1/2 chips. On paper, these architectures appear much the same (with similar clocks, memory bandwidth and TDPs).

    But what Apple has done is optimize each single performance core to be able to fully utilize the memory bandwidth of the entire SoC (even though it is supposed to a multi-core architecture). This means that each core is able to act very much like a monolithic Intel Core, and handle high burst workloads from a single complex task. Whereas the SQ1/2 cores (and most other ARM chips) seem designed for more balanced loads.

    Another major difference between the M1 and the SQ1/2 is in cache size (think super-fast CPU-exclusive memory that saves time from fetching from main memory, RAM). The SQ design has 2MB of shared L3 cache, while the M1 features a monstrous 12MB of shared "L2" cache. For comparison, a typical consumer 10th-gen Intel processor features just 4-8MB of shared cache.

    And why do I put Apple's "L2" cache figure in quotes? Well, a typical tiered cache architecture looks like this:

    (ExtremeTech)
    [​IMG]

    L1/L2 cache is typical ultrafast/expensive memory private to each core, while L3 cache is typically much slower and shared between cores.

    Apple calling their shared cache "L2" instead of L3, is actually a subtle dig at the rest of the computing world (read: Intel and Qualcomm) saying, "Hey our shared cache is as fast as the private caches on your slow ass SoCs!" :p

    So to sum up, TL;DR: not only is Apple's basic memory architecture far-more optimized for high single-core performance, they often don't even need to hit that (better-architected) memory, because they have 6-times the processor cache, that itself, is also faster than the caches of other ARM and x86 CPUs.

    Now how's that for laying down the gauntlet? (or smack in the face :))
     
    lovelaptops, dstrauss, JoeS and 4 others like this.
  7. sonichedgehog360

    sonichedgehog360 AKA Hifihedgehog Senior Member

    Messages:
    2,188
    Likes Received:
    1,759
    Trophy Points:
    181
    It is incredible when you note that Ryzen 5000 does not even come remotely close in the bandwidth arena. What Apple is achieving is what you normally see on a GPU-style design and as I will explain later, the design philosophy comes at a cost of latency.

    [​IMG]

    [​IMG]

    There is a drawback, however, (not surprisingly, Andrei Frumusanu glosses over this point since he has a bit of reputation of being an Apple fanboy) when upping the bandwidth to insanity levels and that takes us back to the design philosophies of GPUs and CPUs. CPUs are meant to be fast in ultra-low latency tasks with less parallelism whereas GPUs give up razor-thin access times in exchange for opening the floodgates to mammoth-wide parallelism. You can see below the concessions that M1 made in latency compared to the Ryzen 5000/Zen 3 series.


    [​IMG]
    [​IMG]


    One can argue that on one hand this helps the integrated GPU pull several factors ahead against the competition and, to its credit, M1 does fulfill that design objective with an insane landslide victory on the graphics' front. On the other hand, CPUs are going to see minimal if ever any performance benefits (for the CPU cores, it is akin to installing a firehose as a water fountain) from all that bandwidth given the linearity and simplicity of the tasks they deal compared to a GPU's highly parallel, data heavy tasks.

    It is the classic monster truck versus sports car analogy in action here in Apple's optimization of the data pipeline where they give up lighting-quick cornering in favor of white-knuckle raw throughput. Apple has done some amazing things by giving up some traditionally CPU-favoring latency for the GPU-favoring bandwidth that they have baked into their design. I hope AMD takes notes and tries to mimick some of this in the future since they are a video game hardware company as well as an APU company that could use these tricks to up the ante for their potent integrated graphics architecture.
     
    Last edited: Nov 21, 2020 at 1:31 PM
    lovelaptops, Marty and Steve S like this.
  8. desertlap

    desertlap Scribbler - Standard Member Senior Member

    Messages:
    2,626
    Likes Received:
    3,295
    Trophy Points:
    181
    Kudos!!! @Marty

    I was trying both for at least some brevity and not to go too far in to the weeds, but on that last item i should have known my audience :)

    One more very top level thing is that Apple in the best ways possible, utilized what has been learned about what works and what doesn't with processors, supporting hardware, applications and operating systems and thus the M1 Big Sur Mac.

    The thing I'd be asking myself if I was QUALCOMM or MS is , "we knew all this too, so how did we so badly blow it with WOA?"

    PS: The one area where I could go absolutely as deep in to weeds as anyone is display tech, That's absolutely because of where I started , but I'll spare all the tedium that it can be:)
     
    Last edited: Nov 21, 2020 at 3:28 PM
    lovelaptops and Steve S like this.
  9. desertlap

    desertlap Scribbler - Standard Member Senior Member

    Messages:
    2,626
    Likes Received:
    3,295
    Trophy Points:
    181
    @sonichedgehog360

    Those are indeed impressive stats. Once again though I feel like you are focused on benchmarks and specific facts that put AMD in the very best light. And that's fine as far as it goes, but it missed what is to me the much bigger picture of modern computing devices and their associated operating systems and applications which you simply have to consider in aggregate as a SYSTEM.

    To talk to one very specific example you mention about the M1 which is the ability for an individual core to very nearly saturate the available memory bandwidth and thus could create latency issues for the other cores. That is 100% true. but.....

    By the same token that same feature allows an app that for good or for bad totally pounds a single core. From a top level user experience though, "man that app really screams on this system"

    And here is where apple's "aha moment" likely occurred in that both the OS and the chips are intertwined so tightly that as much as possible the best results are achieved.

    So that brings me back to my core (pun intended) issue with what is modern Wintel. And which AMD is locked in to as well (though arguably they are the smartest kid in the class).

    The issue is that the multiple legacies of what started as a single core, CISC based single threaded baseline are what's holding it back today. Not to mention the absolutely ginormous ranges of devices that 100% lead to the earlier unqualified success of windows up to today.

    But that same legacy bloats the code, the OS and the user experience. I mean really, how many of us need to support a SCSI disk array with our Dell XPS laptop?

    Perhaps we are at the point where only apple because they control the whole widget, could do what they do....

    PS: Though we have not sold one yet, we have one of our custom devices that sort of makes my point. It's actually built with a Ryzen 4500u. But.... it clobbers in all of our measurements, the same device with a 6 core 10th Gen Intel Core I7 chip. The key difference in this case is a highly customized purpose built Linux distro that controls it.
     
    Last edited: Nov 21, 2020 at 3:45 PM
    sonichedgehog360 and Steve S like this.
  10. desertlap

    desertlap Scribbler - Standard Member Senior Member

    Messages:
    2,626
    Likes Received:
    3,295
    Trophy Points:
    181
    One last comment on the whole RISC versus CISC thing. I truly think both are perfectly valid approaches to modern computing with both having strengths and weaknesses.

    To us an analogy, say you have 2 trees to chop down, who's going to be faster Paul Bunyan and his mighty axe? Or six forest rangers with the fastest greatest chain saws?. The answer in this case is Paul Bunyan, because the rangers will just get in each other's way....

    OTOH if you are talking a whole stand of trees, Paul Bunyan will blow his heart out before he gets done.
     
    Last edited: Nov 21, 2020 at 3:40 PM
    Marty, sonichedgehog360 and Steve S like this.
Loading...

Share This Page