博客

|

2024-11-27

Mobileye AI 活动日的五大亮点

CEO Amnon Shashua 教授与 CTO Shai Shalev-Shwartz 教授探讨了自动驾驶领域的关键AI进展。

Prof. Shai Shalev-Shwartz and Prof. Amnon Shashua
President and CEO at Driving AI

Prof. Shai Shalev-Shwartz and Prof. Amnon Shashua President and CEO at Driving AI

Mobileye has been one of the leaders in navigating the path towards full self-driving systems. Achieving a fully autonomous "eyes-off" system is the goal, but it demands exceptionally high safety standards. Developing such a complex system requires substantial long-term investment. 

To reach this milestone, we focus on maintaining a sustainable business by generating revenue today through our leadership in advanced driver assistance technologies all the while keeping our eyes on the goal of full autonomy. But how do we get there?  Last month, our CEO Prof. Amnon Shashua and CTO, Prof. Shai Shalev-Schwartz took to the stage at Mobileye’s Driving AI 2024 to discuss and share the company’s innovative AI methods to reach that milestone.  

Here are the five main takeaways from the lecture on how Mobileye, through its intelligent leveraging of AI, is solving autonomy one system at a time. 

Watch the full video below.

There's no one way to solve autonomy 

In his opening remarks, Mobileye CEO, Prof. Amnon Shashua, outlined Mobileye's perspective on various approaches towards solving autonomy.  Examples of these approaches include Waymo's Lidar-centric strategy with a compound-ai system approach (CAIS), Tesla's camera-only and end-to-end AI approach, and Mobileye's own camera-centric method with a CAIS AI model.  

Prof. Shashua emphasized the importance of examining each approach against four key pillars for successful autonomy: cost, modularity, geographic scalability, and Mean Time Between Failures (MTBF). While each approach has unique strengths and limitations, none has fully solved autonomy by meeting all four pillars.

With each approach, there are trade-offs. One approach may offer high accuracy but be completely inefficient to produce, while another, by itself, may be limited or unreliable. For example, a lidar-centric approach provides high accuracy, resulting in a very high (good) MTBF, but its cost hinders scalability, making it less suitable for the wider market.

Conversely, a camera-only or camera-centric approach is more affordable but typically results in more challenges in reaching a high MTBF and a pure end-to-end approach awakens the alignment problem in automotive AV (the difficulty of ensuring that the goals of an AI system/machine learning model align with our human objective).  A balanced approach can bridge these gaps, offering both safety and scalability for the evolving AV market. 

A pure end-to-end approach has its limitations  

The End-to-End approach is built on the premise that the more data you feed into the system, the better it gets at mimicking human driving behavior, eventually reaching or even surpassing human-level performance. This method eliminates the need for "glue code," or manual coding. Instead, it's all about data—unsupervised data specifically.  

The transformer-based neural network continuously learns from millions of cars sending driving data, eliminating the need for the manual process of humans labelling or interpreting that data. However, this approach faces three significant challenges: the lack of abstractions, the shortcut learning problem, and the long-tail problem, each of which highlights the limitations of current systems in effectively handling the complexities of real-world driving scenarios. 

In his talk, Prof. Shashua explained its limitations with the lack of abstractions through the "calculator problem", which is the difficulty ChatGPT has with reliably handling complex, multi-step calculations due to the limitations of its language-based architecture.

To solve it, the answer was almost too easy: it must refer to a calculator tool.  To address this, a Python environment was integrated to enhance computational accuracy.  Prof. Shashua argued that relying solely on unsupervised data in the End-to-End approach to address all the complexities of a safety-critical system like autonomous vehicles is both questionable and risky.  

This approach also risks embedding undesired or even dangerous driving behaviors in the learning phase. Here, the "alignment problem" in AV becomes evident as the model may prioritize common but incorrect behaviors over rare but correct ones.

For example: human drivers often perform rolling stops at stop signs or engage in rude driving behaviors, this is where the system might learn as common actions despite being incorrect. So, distinguishing between correct and incorrect actions—especially in rare but correct scenarios—remains a complex challenge. 

A practical illustration of these issues can be seen in the data collected from the Tesla Full Self-Driving (FSD) tracker. The long-tail problem emerges here when you see that the FSD tracker data indicates that even with extensive data input, the model struggles to adequately address these rare events, ultimately affecting the system's overall safety and reliability and hindering its ability to improve mean time between failures (MTBF) with each new generation.

This problem highlights the limitations of relying on large datasets alone, as rare but critical driving scenarios are often underrepresented.  

The Primary, Guardian, and Fallback Fusion is critical for safety system 

There are countless decisions we encounter throughout a driving journey. From very simple, binary ones—such as turning left or turning right, braking or not braking, to the more complex and nuanced choices, such as, if I brake, should I do so gently or harshly?  

A typical approach involves multiple systems offering their answer and taking the majority rule - two out of three systems say merge left, so merge left. But what if the three systems offer three different suggestions, such as turn left, turn right, or keep straight? The majority rule is not available here as a solution.  

The answer is the Primary, Guardian, and Fallback Fusion (PGF) approach, The Primary, Guardian, and Fallback (PGF) system functions as a layered decision-making model. Here’s how it works: 

  • Primary: This system is a standard self-driving system (SDS) that generates a trajectory—a planned route or course of action for the vehicle. It’s essentially the main decision-maker in most situations, outputting the initial suggested path or movement. 
  • Fallback: Like the Primary, the Fallback is another SDS that can generate its own trajectory or alternative route. It serves as a backup in case the Primary system encounters an issue or if the Guardian system detects a potential problem. 
  • Guardian: The Guardian acts as a monitoring layer. Instead of producing a route, it evaluates the Primary system’s trajectory to ensure it meets certain safety and feasibility standards. The Guardian in short evaluates whether the Primary system’s suggested action is safe and viable, If the Guardian detects an issue, it can prompt a switch to the Fallback system to ensure safe navigation 

In a binary scenario of deciding whether to apply brakes, using the example of the  triple-sensor system: the camera (primary) ,the radar (guardian), and the lidar (fallback) we rely on a majority vote: if the camera and radar agree with each other, we follow suit; if they disagree with each other,  we defer to the lidar, which will inevitably align with one of the other two, ensuring we follow the majority – thus PGF is the equivalent to the 2/3 majority vote. 

The notion of “majority over three sub-systems” is well-defined only for binary decisions. However, many of the decisions we need to make while driving are not binary decisions.

Most notably, the geometry of the lane we are driving in is not a binary decision, and this geometry has profound implications on RSS decisions. We therefore propose a generalization of the majority vote, which we call the Primary-Guardian-Fallback (PGF) fusion system. This fused system follows the Primary or the Fallback systems depending on the output of the Guardian system.  

Transformer efficiency can be increased up to 100 times 

In the realm of object detection, communication is paramount. The process of inputting images accurately into the tokenization process—transforming them into understandable and processable data—needs to be executed as precisely as possible. But to take this to the next level of efficiency, beyond just the intelligent tokenization process, is to create a system where tokens can communicate with each other en masse, in an effective and organized way.  

Bear in mind, the communication process needs to be highly efficient due to the heavy computational demands on a single chip. For the chip to perform seamlessly, data flow must be optimized to prevent performance lags that could ultimately slow down communication. 

In his talk, Mobileye CTO, Prof. Shai Shalev-Schwartz explains how we tackle this process and boost transformer efficiency by 100 times with the STAT (Sparse Typed Attention) method. The STAT method optimizes communication between tokens by organizing them into structured groups. Think of thousands of people trying to talk to each other in a giant stadium.  This would lead to chaos. The same concept applies to thousands of tokens that, in the same scenario, would struggle to communicate effectively with each other.

Unless, of course, they’re divided into specific roles—such as 'regular tokens' and 'manager tokens'—to create a more organized communication structure, or in this case, relevant connectivity. This is essentially the idea behind STAT, where introducing structure and parameters improves model efficiency through improved typing and organization. 

So how do we achieve this? We establish order by dividing and grouping the tokens with “managers” or link tokens, allowing regular tokens to communicate independently with the link token (“manager”). For example, by grouping 300 regular tokens with 32 link tokens, the regular tokens can communicate with the link tokens, and the link tokens can communicate with each other. This structured approach significantly reduces complexity and leads to a 100-fold increase in efficiency. 

There’s a sweet spot in the efficiency to flexibility spectrum  

Prof. Shai Shalev-Schwartz outlined the balance between efficiency and flexibility when it comes to chip operation. Simply put, if we were to create a super-efficient chip with only one built-in purpose, it would be extremely efficient, but also very limited.

On the other hand, a chip designed to handle many tasks will be flexible, but its performance wouldn't be nearly as efficient. That’s the essential trade-off between efficiency and flexibility – especially when talking about a chip on-board a vehicle. However, the EyeQ™6 High chip hits the sweet spot for automated driving, offering the right mix of both flexibility and efficiency. 

To achieve this, there are a variety of components within the chip, each with varying degrees of flexibility and efficiency. Prof. Shai Shalev-Schwartz mentions five components—with five distinct architectures— that range from highly specific and efficient to highly flexible. Starting from two CPUs—MPC and MIPS, which are highly flexible—to the XNN which is highly efficient and specific to two additional accelerators in between, the chip adapts depending on the operation, moving across the spectrum. 

The efficiency of the Mobileye Eye Q6 High in executing demanding AI deep learning tasks is impressive. With a capability of 34 TOPS (terra operations per second), it significantly outperforms its predecessor, the Eye Q5, which offers less TOPS. However, simply comparing TOPS figures doesn’t tell the whole story. 

The true measure of a chip's effectiveness for automated driving applications lies in its ability to process frames per second across various neural network tasks. For example, the Eye Q6 High can handle over 1,000 frames per second for a pixel labeling NN, compared to just 91 frames per second on the Eye Q5—showing a more than tenfold increase in efficiency. This improvement comes not just from higher clock speeds but from a specialized architecture mostly from the XNN designed for high utilization in specific applications. 

In comparison to Nvidia's Orin chip for example, which can reach 275 TOPS, the raw numbers may suggest it’s superior. Yet, when running a standard ResNet-50 network, the difference in frame processing is only a factor of two.

This highlights that TOPS alone isn’t a sufficient measure of effectiveness; context and efficiency are key. Overall, the EyeQ 6High’s design focuses on tailored functionality and efficiency, supported by a solid software stack that allocates tasks optimally across its various accelerators.  In essence, the Eye Q6 High’s real strength is its smart design, tailored to handle specific tasks efficiently proving that raw TOPS numbers alone don’t capture true performance. 

To sum up, Mobileye’s expertise in AI combined with purpose-built hardware optimized for efficiency reflects a clear path toward scalable autonomy. As we continue progressing step-by-step, these breakthroughs bring us closer to the vision of a fully autonomous future. 

订阅新闻简讯

了解Mobileye最新动态

媒体联系人

联系公关团队

/